perf: branchless mask-select for listview zip#8264
Open
joseph-isaacs wants to merge 3 commits into
Open
Conversation
Merging this PR will degrade performance by 10.64%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ❌ | Simulation | compare[15] |
119.9 µs | 145.6 µs | -17.7% |
| ❌ | Simulation | compare[14] |
117.5 µs | 141.3 µs | -16.88% |
| ❌ | Simulation | compare[13] |
115.5 µs | 137.6 µs | -16.05% |
| ⚡ | Simulation | compare[5] |
76.9 µs | 69.3 µs | +11.04% |
| 🆕 | Simulation | nonnull |
N/A | 301.4 µs | N/A |
| 🆕 | Simulation | nullable |
N/A | 322.2 µs | N/A |
Tip
Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.
Comparing claude/wizardly-carson-6Cixf (a9516cf) with develop (e06d80b)
d11d3b8 to
ddeb10d
Compare
ddeb10d to
8ea7fb9
Compare
Replace the per-element, data-dependent branch in the listview zip kernel's offset/size selection with a branchless, chunk-at-a-time mask select that the compiler can auto-vectorize. For each 64-bit mask chunk, each bit is expanded to a full-width lane mask and both sides are blended with `(t & m) | (f & !m)` via a shared `select_column` helper, so the inner loop is branch-free regardless of mask shape. `if_false` offsets are shifted into the second half of the concatenated elements as before. Adds a `listview_zip` divan benchmark across fragmented/block/sparse/dense masks for nullable and non-nullable inputs. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
8ea7fb9 to
f05b6a0
Compare
Reduce the divan bench to one Fragmented (alternating) mask with non-nullable and nullable inputs, and lower LEN to 8192 so each case stays well under a few hundred microseconds. The branchless chunked select is mask-shape-independent, so a single shape suffices; drop the now-unused MaskShape matrix. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
CodSpeed's instruction-count simulation runs ~10x local walltime, putting the 8192-list bench at ~550us there. Drop to 4096 lists so each case stays well under a few hundred microseconds in CI while still exercising the select. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces the per-element, data-dependent branch in the listview zip kernel's offset/size selection with a branchless, chunk-at-a-time mask select that the compiler can auto-vectorize.
For each 64-bit mask chunk, each bit is expanded to a full-width lane mask and both sides are blended with
(t & m) | (f & !m)via a sharedselect_columnhelper, so the inner loop is branch-free regardless of mask shape.if_falseoffsets are shifted into the second half of the concatenatedelementsas before; sizes are taken verbatim from the chosen side.Changes
vortex-array/src/arrays/listview/compute/zip.rs: branchless chunked offset/size select viaselect_column; added a test that spans the 64-bit chunk boundary + remainder.vortex-array/benches/listview_zip.rs: a small divan bench — one Fragmented (alternating) mask with non-nullable and nullable inputs.Benchmark
Because the branchless select is mask-shape-independent, the bench uses a single Fragmented mask (the worst case for the replaced per-element branch) with both nullability variants.
LENis 8,192: listview zip cost is dominated by element concatenation and per-list canonicalization, so a few thousand lists already exercise the select while keeping each case well under a few hundred microseconds (~37 µs non-null, ~41 µs nullable locally).The kernel change itself is a single-to-low-double-digit-percent win on the offset/size select; end-to-end listview zip is dominated by element concatenation + canonicalization, so the select is a small slice of total cost.
Testing
vortex-arraylistview zip tests pass (incl. the new multi-chunk-span test);cargo +nightly fmtandclippy -D warnings(default + all-features) clean.https://claude.ai/code/session_01N5ivPiCJy7dGQjMEP7ips9