Skip to content

perf: branchless primitive zip kernel#8270

Open
joseph-isaacs wants to merge 1 commit into
claude/bool-branchless-zipfrom
claude/primitive-branchless-zip
Open

perf: branchless primitive zip kernel#8270
joseph-isaacs wants to merge 1 commit into
claude/bool-branchless-zipfrom
claude/primitive-branchless-zip

Conversation

@joseph-isaacs
Copy link
Copy Markdown
Contributor

@joseph-isaacs joseph-isaacs commented Jun 5, 2026

Summary

Adds a dedicated primitive zip kernel that selects values branchlessly per row.

The generic zip path copies runs of if_true/if_false between mask boundaries — fast for clustered masks but degrading to per-element work on fragmented masks. This kernel walks the mask as 64-bit chunks and blends both sides per row with no data-dependent branch, so the inner loop stays branch-free and auto-vectorizable regardless of mask shape. Result validity reuses the shared zip_validity helper, which expresses validity selection as a (lazy) zip over the two validity bitmaps.

Stacked on #8275 (branchless boolean zip kernel). The shared zip_validity produces a boolean zip over the validity bitmaps, so the nullable path is only fast once the bool kernel lands — this PR is based on that branch and should merge after it. The diff here is primitive-only.

Changes

  • vortex-array/src/arrays/primitive/compute/zip.rs: branchless per-row value blend; validity via the shared zip_validity; tests spanning the 64-bit chunk boundary + remainder, non-nullable and nullable.
  • vortex-array/src/arrays/primitive/compute/mod.rs, .../vtable/kernel.rs: register the kernel.
  • vortex-array/benches/primitive_zip.rs: a small divan bench — one Fragmented (alternating) mask, non-nullable and nullable.

Performance (divan, local walltime, median)

At LEN = 65_536 (matching the original measurements), the nullable Fragmented case — which routes validity through the shared zip_validity → boolean zip — drops from ~43 ms (generic builder, before the bool kernel) to ~73 µs; non-nullable is ~58 µs. The committed bench uses LEN = 16_384 so each case stays well under a few hundred microseconds under CodSpeed's instruction-count simulation.

Testing

  • vortex-array zip tests pass (24, incl. the new primitive cases); cargo +nightly fmt, clippy -D warnings (default + all-features), and cargo doc -D warnings clean.

https://claude.ai/code/session_01N5ivPiCJy7dGQjMEP7ips9

@joseph-isaacs joseph-isaacs added the changelog/performance A performance improvement label Jun 5, 2026 — with Claude
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented Jun 5, 2026

Merging this PR will not alter performance

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 1 improved benchmark
❌ 4 regressed benchmarks
✅ 1510 untouched benchmarks
🆕 2 new benchmarks

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation bitwise_not_vortex_buffer_mut[128] 216.9 ns 275.3 ns -21.19%
Simulation bitwise_not_vortex_buffer_mut[1024] 278.6 ns 336.9 ns -17.31%
Simulation bitwise_not_vortex_buffer_mut[2048] 342.2 ns 400.6 ns -14.56%
Simulation chunked_varbinview_canonical_into[(100, 100)] 274.6 µs 309.3 µs -11.25%
Simulation chunked_varbinview_canonical_into[(1000, 10)] 198.2 µs 161.9 µs +22.47%
🆕 Simulation nonnull N/A 252.5 µs N/A
🆕 Simulation nullable N/A 285.2 µs N/A

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing claude/primitive-branchless-zip (bdc1c4b) with claude/bool-branchless-zip (7ad4b18)

Open in CodSpeed

Add a dedicated `ZipKernel for Primitive` that selects values branchlessly per
64-bit mask chunk instead of routing through the generic run-copy builder. For
each chunk it blends `if_true`/`if_false` per row without a data-dependent
branch, so the inner loop is auto-vectorizable and mask-shape-independent
(memory-bandwidth-bound): up to ~900x faster on fragmented masks and ~8x on
clustered masks, so no density-adaptive fallback is needed. Covers every native
ptype.

Result validity is computed by zipping the two boolean validity arrays with the
same mask -- `Validity::Array(zip(mask, if_true_valid, if_false_valid))` -- reusing
the zip machinery rather than re-deriving the mask algebra, with fast paths when
both sides' validity already agrees.

Adds a `primitive_zip` divan benchmark across fragmented/block/sparse/dense masks
for nullable and non-nullable inputs.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
@joseph-isaacs joseph-isaacs force-pushed the claude/primitive-branchless-zip branch from 241e46d to bdc1c4b Compare June 5, 2026 18:29
@joseph-isaacs joseph-isaacs requested a review from a team June 5, 2026 18:29
@joseph-isaacs joseph-isaacs changed the base branch from develop to claude/bool-branchless-zip June 5, 2026 18:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/performance A performance improvement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants