perf: branchless primitive zip kernel#8270
Open
joseph-isaacs wants to merge 1 commit into
Open
Conversation
Merging this PR will not alter performance
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ❌ | Simulation | bitwise_not_vortex_buffer_mut[128] |
216.9 ns | 275.3 ns | -21.19% |
| ❌ | Simulation | bitwise_not_vortex_buffer_mut[1024] |
278.6 ns | 336.9 ns | -17.31% |
| ❌ | Simulation | bitwise_not_vortex_buffer_mut[2048] |
342.2 ns | 400.6 ns | -14.56% |
| ❌ | Simulation | chunked_varbinview_canonical_into[(100, 100)] |
274.6 µs | 309.3 µs | -11.25% |
| ⚡ | Simulation | chunked_varbinview_canonical_into[(1000, 10)] |
198.2 µs | 161.9 µs | +22.47% |
| 🆕 | Simulation | nonnull |
N/A | 252.5 µs | N/A |
| 🆕 | Simulation | nullable |
N/A | 285.2 µs | N/A |
Tip
Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.
Comparing claude/primitive-branchless-zip (bdc1c4b) with claude/bool-branchless-zip (7ad4b18)
Add a dedicated `ZipKernel for Primitive` that selects values branchlessly per 64-bit mask chunk instead of routing through the generic run-copy builder. For each chunk it blends `if_true`/`if_false` per row without a data-dependent branch, so the inner loop is auto-vectorizable and mask-shape-independent (memory-bandwidth-bound): up to ~900x faster on fragmented masks and ~8x on clustered masks, so no density-adaptive fallback is needed. Covers every native ptype. Result validity is computed by zipping the two boolean validity arrays with the same mask -- `Validity::Array(zip(mask, if_true_valid, if_false_valid))` -- reusing the zip machinery rather than re-deriving the mask algebra, with fast paths when both sides' validity already agrees. Adds a `primitive_zip` divan benchmark across fragmented/block/sparse/dense masks for nullable and non-nullable inputs. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
241e46d to
bdc1c4b
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a dedicated primitive zip kernel that selects values branchlessly per row.
The generic zip path copies runs of
if_true/if_falsebetween mask boundaries — fast for clustered masks but degrading to per-element work on fragmented masks. This kernel walks the mask as 64-bit chunks and blends both sides per row with no data-dependent branch, so the inner loop stays branch-free and auto-vectorizable regardless of mask shape. Result validity reuses the sharedzip_validityhelper, which expresses validity selection as a (lazy) zip over the two validity bitmaps.Changes
vortex-array/src/arrays/primitive/compute/zip.rs: branchless per-row value blend; validity via the sharedzip_validity; tests spanning the 64-bit chunk boundary + remainder, non-nullable and nullable.vortex-array/src/arrays/primitive/compute/mod.rs,.../vtable/kernel.rs: register the kernel.vortex-array/benches/primitive_zip.rs: a small divan bench — one Fragmented (alternating) mask, non-nullable and nullable.Performance (divan, local walltime, median)
At
LEN = 65_536(matching the original measurements), the nullable Fragmented case — which routes validity through the sharedzip_validity→ boolean zip — drops from ~43 ms (generic builder, before the bool kernel) to ~73 µs; non-nullable is ~58 µs. The committed bench usesLEN = 16_384so each case stays well under a few hundred microseconds under CodSpeed's instruction-count simulation.Testing
vortex-arrayzip tests pass (24, incl. the new primitive cases);cargo +nightly fmt,clippy -D warnings(default + all-features), andcargo doc -D warningsclean.https://claude.ai/code/session_01N5ivPiCJy7dGQjMEP7ips9