Optimize bit iteration and validity checking with SIMD popcount#8261
Optimize bit iteration and validity checking with SIMD popcount#8261joseph-isaacs wants to merge 5 commits into
Conversation
`Mask::valid_counts_for_indices` walked the validity bit buffer one bit at a time via `BitIterator::next`, accumulating a running valid count. The pco/zstd slice decoders call it with `[slice_start, slice_stop]`, so the cost was a bit-by-bit scan of the entire prefix up to `slice_stop`. Replace the per-bit walk with an incremental SIMD popcount over each gap between consecutive indices. Total scanned bits are unchanged, but the work now goes through the vectorized `count_ones` path. Add `BitBuffer::count_range(start, end)`, which counts set bits in a range directly over the backing buffer without cloning a sliced `BitBuffer`, so the many-indices case has no per-gap allocation overhead. Benchmark (vortex-mask/benches/valid_counts.rs, median): slice_bounds 16384 8.7us -> 51ns (~170x) slice_bounds 262144 138us -> 310ns (~445x) slice_bounds 1M 554us -> 1.86us (~300x) many_indices 16384 10.1us -> 1.68us (~6x) many_indices 1M 613us -> 3.33us (~185x) Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Two more spots that walked a bitmap one bit at a time where a chunked
operation is available.
runend filter (`filter_run_end_primitive`): the per-run loop read the
predicate mask bit-by-bit via `value_unchecked`, accumulating a popcount
and an OR. Replace with a single `BitBuffer::count_range` (SIMD popcount)
per run; `keep` becomes `run_trues > 0`. Semantics are identical. Expose
the helper via the existing `_benchmarking` module and add a divan bench
that drives the kernel directly (note: `ArrayRef::filter` is lazy and only
builds a `FilterArray`, so it does not exercise the kernel).
run_end_filter (run=16), median:
16k 19.7us -> 8.5us (~2.3x)
256k 320us -> 133us (~2.4x)
1M 1.27ms -> 0.54ms (~2.3x)
duckdb bool export (`BoolExporter::export`): unpack bits directly into the
DuckDB byte-bool destination instead of `.iter().collect::<Vec<bool>>()`
followed by `copy_from_slice`, dropping a throwaway allocation and a copy
pass. Add a self-contained microbench (no live DuckDB vector needed).
bool_export (offset 1), median:
1k ~equal
16k ~equal
256k ~328us -> ~229us (~1.3x)
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
…ow is_valid
`Validity::is_valid(i)` on the `Validity::Array` variant spins up a fresh
execution context and runs a scalar lookup on every call. Two loops in
`l2_denorm` (`normalize_as_l2_denorm` and the `validate_l2_normalized_rows_
against_norms` constructor check, which runs on every `L2Denorm::try_new_array`)
called it once per row, so building/validating a nullable tensor paid that
cost `row_count` times.
Materialize the validity into a `Mask` once via `execute_mask` and read it
with O(1) `Mask::value`. For non-nullable inputs `execute_mask` returns
`Mask::AllTrue` with no work, so the common path is unchanged.
Add a microbench (`vortex-array/benches/validity_is_valid.rs`) isolating the
cost that was removed — per-element `is_valid` vs `execute_mask` + `value`
over an array-backed validity:
is_valid_per_element vs execute_mask_then_value (median):
256 18.6us -> 0.37us (~50x)
1024 74.0us -> 0.81us (~91x)
4096 299.6us -> 3.07us (~98x)
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
… filter; document the pattern Generalizes the fix from the l2_denorm change: `Validity::is_valid(i)` on the `Validity::Array` variant allocates an execution context and runs a scalar lookup per call, so calling it (or re-deriving validity) inside an O(n) loop is the dominant cost. Materialize the validity into a `Mask` once, then read it with O(1) `Mask::value`. - listview `rebuild_by_offset` / `rebuild_list_by_list`: hoisted `self.validity()` (which re-derives the listview validity every iteration) and replaced per-row `is_valid` with a once-materialized mask. - varbin `filter_select_var_bin_by_index`: materialize the validity mask once (threading the real `ExecutionCtx`) and index it per selected row. Prevention: - AGENTS.md: new "Performance: avoid hidden-cost accessors in hot loops" section describing the generalizable pattern (a per-element accessor that hides amortizable work in an O(n) loop), an accessor->bulk replacement table, and a per-site decision rule (sequential -> bulk; gather -> materialize once then O(1) reads; already-O(1) -> leave it). Plus a Common Mistakes checklist item. - `Validity::is_valid`/`is_null`: `# Performance` doc warnings pointing at `execute_mask` + `Mask::value`. - Benchmarks `vortex-array/benches/validity_is_valid.rs` and `vortex-mask/benches/valid_counts.rs` serve as examples and regression guards. Verified: vortex-array varbin filter (6) and listview (143) tests pass; clippy and fmt clean. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
…t for Mask indices Benchmarking the strategies for a validity-gated loop (process each set bit) showed that per-element `value(i)` and arrow's `BitIterator` are both dominated by the per-element branch and are 2-45x slower than iterating a `u64` word at a time. So materializing validity into a mask and then looping with `value(i)` — as several call sites (and our earlier fixes) do — still leaves a lot on the table. Add `BitBuffer::for_each_set_index`, which iterates a word at a time with all-set / all-unset fast paths and walks set bits via `trailing_zeros`. It self-adapts from sparse to dense and needs no index/slice materialization. It also beats collecting `set_indices()` ~2x at mid/high density (the iterator's per-`next` state inlines less well). `mask_iteration` bench, 65536 elements, median (sum over set bits): density value(i) BitIterator for_each_set_index set_slices set_indices 0.01 47.4us 46.6us 2.0us 3.8us 1.9us 0.10 52.6us 47.4us 4.9us 6.3us 8.2us 0.50 69.6us 55.5us 20.0us 19.1us 40.0us 0.90 69.9us 53.0us 25.5us 28.6us 65.0us 0.99 70.4us 54.2us 26.3us 35.5us 53.9us Use it to build `MaskValues::indices()` instead of `extend(set_indices())`. Docs: extend the AGENTS.md performance section to recommend word-at-a-time iteration over per-element `value(i)` and over `set_indices()` for dense masks. Verified: vortex-buffer (10 new) and vortex-mask (117) tests pass; clippy/fmt clean. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Merging this PR will not alter performance
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ❌ | Simulation | compare[31] |
157.6 µs | 213.7 µs | -26.26% |
| ❌ | Simulation | compare[30] |
155.1 µs | 209.3 µs | -25.9% |
| ❌ | Simulation | compare[29] |
153.2 µs | 205.6 µs | -25.47% |
| ❌ | Simulation | compare[28] |
150.5 µs | 201 µs | -25.11% |
| ❌ | Simulation | compare[27] |
149.2 µs | 197.9 µs | -24.59% |
| ❌ | Simulation | compare[26] |
146.8 µs | 193.5 µs | -24.14% |
| ❌ | Simulation | compare[33] |
190.5 µs | 250.1 µs | -23.83% |
| ❌ | Simulation | compare[25] |
144.9 µs | 189.8 µs | -23.65% |
| ❌ | Simulation | compare[24] |
140.8 µs | 183.9 µs | -23.41% |
| ❌ | Simulation | compare[23] |
140.6 µs | 181.8 µs | -22.67% |
| ❌ | Simulation | compare[29] |
181.1 µs | 233.2 µs | -22.35% |
| ❌ | Simulation | compare[22] |
138.2 µs | 177.5 µs | -22.16% |
| ❌ | Simulation | compare[21] |
136.3 µs | 173.8 µs | -21.54% |
| ❌ | Simulation | compare[20] |
133.2 µs | 168.8 µs | -21.06% |
| ❌ | Simulation | compare[25] |
171.7 µs | 216.4 µs | -20.64% |
| ❌ | Simulation | compare[19] |
132 µs | 165.6 µs | -20.32% |
| ❌ | Simulation | compare[18] |
129.4 µs | 161.3 µs | -19.77% |
| ❌ | Simulation | compare[16] |
117.8 µs | 145.8 µs | -19.25% |
| ❌ | Simulation | compare[17] |
127.6 µs | 157.6 µs | -19.05% |
| ❌ | Simulation | compare[16] |
120.5 µs | 148.6 µs | -18.91% |
| ... | ... | ... | ... | ... | ... |
ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.
Tip
Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.
Comparing claude/bit-iterator-perf-0IMQ5 (096369c) with develop (bfe88b8)
Polar Signals Profiling ResultsLatest Run
Previous Runs (1)
Powered by Polar Signals Cloud |
Benchmarks: PolarSignals ProfilingVortex (geomean): 1.036x ➖ How to read Verdict and Engines
datafusion / vortex-file-compressed (1.036x ➖, 0↑ 1↓)
No file size changes detected. |
Benchmarks: FineWeb NVMeVerdict: No clear signal (low confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (1.118x ❌, 0↑ 7↓)
datafusion / vortex-compact (1.026x ➖, 0↑ 1↓)
datafusion / parquet (1.021x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (1.086x ➖, 0↑ 3↓)
duckdb / vortex-compact (1.104x ❌, 0↑ 6↓)
duckdb / parquet (1.056x ➖, 0↑ 1↓)
File Size Changes (1 files changed, -0.0% overall, 0↑ 1↓)
Totals:
|
Benchmarks: TPC-H SF=1 on NVMEVerdict: No clear signal (environment too noisy confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (1.093x ➖, 0↑ 9↓)
datafusion / vortex-compact (1.192x ❌, 0↑ 18↓)
datafusion / parquet (1.049x ➖, 2↑ 7↓)
datafusion / arrow (1.060x ➖, 0↑ 6↓)
duckdb / vortex-file-compressed (1.292x ❌, 0↑ 20↓)
duckdb / vortex-compact (1.111x ❌, 0↑ 11↓)
duckdb / parquet (1.052x ➖, 1↑ 5↓)
duckdb / duckdb (1.081x ➖, 0↑ 6↓)
File Size Changes (10 files changed, -0.1% overall, 4↑ 6↓)
Totals:
|
Benchmarks: TPC-DS SF=1 on NVMEVerdict: No clear signal (low confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (1.038x ➖, 1↑ 15↓)
datafusion / vortex-compact (0.999x ➖, 3↑ 3↓)
datafusion / parquet (1.094x ➖, 0↑ 42↓)
duckdb / vortex-file-compressed (1.002x ➖, 1↑ 3↓)
duckdb / vortex-compact (0.982x ➖, 3↑ 0↓)
duckdb / parquet (1.003x ➖, 0↑ 0↓)
duckdb / duckdb (0.998x ➖, 1↑ 1↓)
File Size Changes (6 files changed, -0.0% overall, 1↑ 5↓)
Totals:
|
Benchmarks: FineWeb S3Verdict: No clear signal (low confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (1.265x ➖, 0↑ 3↓)
datafusion / vortex-compact (1.097x ➖, 0↑ 2↓)
datafusion / parquet (1.177x ➖, 0↑ 1↓)
duckdb / vortex-file-compressed (1.064x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.026x ➖, 0↑ 0↓)
duckdb / parquet (1.048x ➖, 0↑ 0↓)
|
Benchmarks: Statistical and Population GeneticsVerdict: No clear signal (low confidence) How to read Verdict and Engines
duckdb / vortex-file-compressed (0.997x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.992x ➖, 0↑ 0↓)
duckdb / parquet (1.008x ➖, 0↑ 0↓)
File Size Changes (1 files changed, -0.0% overall, 0↑ 1↓)
Totals:
|
Benchmarks: TPC-H SF=10 on NVMEVerdict: No clear signal (low confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (1.009x ➖, 0↑ 1↓)
datafusion / vortex-compact (1.000x ➖, 0↑ 0↓)
datafusion / parquet (1.011x ➖, 0↑ 0↓)
datafusion / arrow (1.011x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (1.009x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.000x ➖, 0↑ 0↓)
duckdb / parquet (0.997x ➖, 0↑ 0↓)
duckdb / duckdb (1.002x ➖, 0↑ 0↓)
File Size Changes (26 files changed, +0.0% overall, 18↑ 8↓)
Totals:
|
Benchmarks: TPC-H SF=1 on S3Verdict: No clear signal (environment too noisy confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (1.009x ➖, 0↑ 1↓)
datafusion / vortex-compact (1.103x ➖, 0↑ 4↓)
datafusion / parquet (1.049x ➖, 1↑ 3↓)
duckdb / vortex-file-compressed (1.019x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.030x ➖, 0↑ 0↓)
duckdb / parquet (1.021x ➖, 0↑ 0↓)
|
Benchmarks: Appian on NVMEVerdict: No clear signal (low confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (1.007x ➖, 0↑ 0↓)
datafusion / parquet (0.998x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (1.003x ➖, 0↑ 0↓)
duckdb / parquet (1.000x ➖, 0↑ 0↓)
duckdb / duckdb (1.000x ➖, 0↑ 0↓)
File Size Changes (4 files changed, -0.0% overall, 2↑ 2↓)
Totals:
|
Benchmarks: TPC-H SF=10 on S3Verdict: No clear signal (environment too noisy confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (1.017x ➖, 0↑ 2↓)
datafusion / vortex-compact (1.087x ➖, 0↑ 3↓)
datafusion / parquet (1.005x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (1.096x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.124x ➖, 0↑ 0↓)
duckdb / parquet (1.019x ➖, 0↑ 0↓)
|
Benchmarks: Random AccessVortex (geomean): 0.922x ➖ How to read Verdict and Engines
unknown / unknown (1.005x ➖, 9↑ 6↓)
|
Benchmarks: Clickbench on NVMEVerdict: No clear signal (low confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (1.066x ➖, 1↑ 7↓)
datafusion / parquet (1.058x ➖, 0↑ 5↓)
duckdb / vortex-file-compressed (1.052x ➖, 1↑ 5↓)
duckdb / parquet (1.032x ➖, 0↑ 0↓)
duckdb / duckdb (1.043x ➖, 0↑ 0↓)
File Size Changes (105 files changed, -0.0% overall, 52↑ 53↓)
Totals:
|
Benchmarks: CompressionVortex (geomean): 1.007x ➖ How to read Verdict and Engines
unknown / unknown (1.014x ➖, 0↑ 2↓)
|
Summary
This PR improves performance of bit iteration and validity checking by:
Adding
BitBuffer::count_range(start, end)— a SIMD popcount over a bit range without allocating a newBitBuffer. This replaces bit-by-bit walks in hot paths like run-end filtering.Adding
BitBuffer::for_each_set_index(f)— a word-at-a-time iterator that fast-paths all-set/all-unset words and usestrailing_zerosto walk set bits. Benchmarks show 2-45x speedup over per-elementvalue(i)checks, and ~2x over collectingset_indices().Optimizing
Mask::valid_counts_for_indices— replacing bit-by-bit iteration withcount_rangefor SIMD popcount across gaps.Fixing array-backed validity performance anti-patterns — materializing validity into a
Maskonce rather than callingValidity::is_valid(i)per row in loops (which allocates an execution context per call for array-backed validity).Adding comprehensive performance guidance to AGENTS.md — documenting the hidden-cost accessor trap and providing a decision table for when to materialize vs. iterate.
Adding benchmarks to measure and validate these optimizations:
vortex-mask/benches/mask_iteration.rs— compares strategies for processing valid elements across densitiesvortex-mask/benches/valid_counts.rs— measuresMask::valid_counts_for_indicesperformanceencodings/runend/benches/run_end_filter.rs— isolates the run-end filter kernelvortex-array/benches/validity_is_valid.rs— contrasts per-element vs. materialized validityvortex-duckdb/benches/bool_export.rs— measures bit-to-byte unpackingCode Changes
vortex-buffer/src/bit/buf.rs:
count_range(start, end)— O(1) allocation, SIMD popcount over a rangefor_each_set_index(f)— word-at-a-time iteration with all-set/all-unset fast pathsvortex-mask/src/lib.rs:
Mask::valid_counts_for_indicesnow usescount_rangeinstead of bit-by-bit iterationencodings/runend/src/compute/filter.rs:
filter_run_end_primitivenow usescount_rangefor per-run popcount instead of loopingvalue_uncheckedvortex-array/src/validity.rs:
is_validandis_nullwarning against per-element calls in loopsvortex-duckdb/src/exporter/bool.rs:
collect::<Vec<bool>>()+copy_from_slicewith direct zip-write to avoid allocationvortex-array/src/arrays/listview/rebuild.rs and vortex-tensor/src/scalar_fns/l2_denorm.rs and vortex-array/src/arrays/varbin/compute/filter.rs:
Maskonce before looping, avoiding per-element execution context allocationAGENTS.md:
Testing
count_range(various ranges, with offsets) andfor_each_set_index(various lengths, with offsets, all-set case)https://claude.ai/code/session_018J3ZqwcL4Se1wWc5EPY6i1