perf: branchless primitive zip kernel by joseph-isaacs · Pull Request #8270 · vortex-data/vortex

joseph-isaacs · 2026-06-05T15:38:25Z

Summary

Adds a dedicated primitive zip kernel that selects values branchlessly per row.

The generic zip path copies runs of if_true/if_false between mask boundaries — fast for clustered masks but degrading to per-element work on fragmented masks. This kernel walks the mask as 64-bit chunks and blends both sides per row with no data-dependent branch, so the inner loop stays branch-free and auto-vectorizable regardless of mask shape. Result validity reuses the shared zip_validity helper, which expresses validity selection as a (lazy) zip over the two validity bitmaps.

Stacked on #8275 (branchless boolean zip kernel). The shared zip_validity produces a boolean zip over the validity bitmaps, so the nullable path is only fast once the bool kernel lands — this PR is based on that branch and should merge after it. The diff here is primitive-only.

Changes

vortex-array/src/arrays/primitive/compute/zip.rs: branchless per-row value blend; validity via the shared zip_validity; tests spanning the 64-bit chunk boundary + remainder, non-nullable and nullable.
vortex-array/src/arrays/primitive/compute/mod.rs, .../vtable/kernel.rs: register the kernel.
vortex-array/benches/primitive_zip.rs: a small divan bench — one Fragmented (alternating) mask, non-nullable and nullable.

Performance (divan, local walltime, median)

At LEN = 65_536 (matching the original measurements), the nullable Fragmented case — which routes validity through the shared zip_validity → boolean zip — drops from ~43 ms (generic builder, before the bool kernel) to ~73 µs; non-nullable is ~58 µs. The committed bench uses LEN = 16_384 so each case stays well under a few hundred microseconds under CodSpeed's instruction-count simulation.

Testing

vortex-array zip tests pass (24, incl. the new primitive cases); cargo +nightly fmt, clippy -D warnings (default + all-features), and cargo doc -D warnings clean.

https://claude.ai/code/session_01N5ivPiCJy7dGQjMEP7ips9

codspeed-hq · 2026-06-05T15:47:31Z

Merging this PR will not alter performance

⚠️

Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 1 improved benchmark
❌ 4 regressed benchmarks
✅ 1510 untouched benchmarks
🆕 2 new benchmarks

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
❌	Simulation	`bitwise_not_vortex_buffer_mut[128]`	216.9 ns	275.3 ns	-21.19%
❌	Simulation	`bitwise_not_vortex_buffer_mut[1024]`	278.6 ns	336.9 ns	-17.31%
❌	Simulation	`bitwise_not_vortex_buffer_mut[2048]`	342.2 ns	400.6 ns	-14.56%
❌	Simulation	`chunked_varbinview_canonical_into[(100, 100)]`	274.6 µs	309.3 µs	-11.25%
⚡	Simulation	`chunked_varbinview_canonical_into[(1000, 10)]`	198.2 µs	161.9 µs	+22.47%
🆕	Simulation	`nonnull`	N/A	252.5 µs	N/A
🆕	Simulation	`nullable`	N/A	285.2 µs	N/A

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.

_{Comparing claude/primitive-branchless-zip (bdc1c4b) with claude/bool-branchless-zip (7ad4b18)}

Add a dedicated `ZipKernel for Primitive` that selects values branchlessly per 64-bit mask chunk instead of routing through the generic run-copy builder. For each chunk it blends `if_true`/`if_false` per row without a data-dependent branch, so the inner loop is auto-vectorizable and mask-shape-independent (memory-bandwidth-bound): up to ~900x faster on fragmented masks and ~8x on clustered masks, so no density-adaptive fallback is needed. Covers every native ptype. Result validity is computed by zipping the two boolean validity arrays with the same mask -- `Validity::Array(zip(mask, if_true_valid, if_false_valid))` -- reusing the zip machinery rather than re-deriving the mask algebra, with fast paths when both sides' validity already agrees. Adds a `primitive_zip` divan benchmark across fragmented/block/sparse/dense masks for nullable and non-nullable inputs. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

joseph-isaacs added the changelog/performance A performance improvement label Jun 5, 2026 — with Claude

joseph-isaacs force-pushed the claude/primitive-branchless-zip branch from 241e46d to bdc1c4b Compare June 5, 2026 18:29

joseph-isaacs requested a review from a team June 5, 2026 18:29

joseph-isaacs changed the base branch from develop to claude/bool-branchless-zip June 5, 2026 18:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: branchless primitive zip kernel#8270

perf: branchless primitive zip kernel#8270
joseph-isaacs wants to merge 1 commit into
claude/bool-branchless-zipfrom
claude/primitive-branchless-zip

joseph-isaacs commented Jun 5, 2026 •

edited

Loading

Uh oh!

codspeed-hq Bot commented Jun 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

joseph-isaacs commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Performance (divan, local walltime, median)

Testing

Uh oh!

codspeed-hq Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will not alter performance

Performance Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

joseph-isaacs commented Jun 5, 2026 •

edited

Loading

codspeed-hq Bot commented Jun 5, 2026 •

edited

Loading