Skip to content

perf: branchless mask-select for listview zip#8264

Open
joseph-isaacs wants to merge 3 commits into
developfrom
claude/wizardly-carson-6Cixf
Open

perf: branchless mask-select for listview zip#8264
joseph-isaacs wants to merge 3 commits into
developfrom
claude/wizardly-carson-6Cixf

Conversation

@joseph-isaacs
Copy link
Copy Markdown
Contributor

@joseph-isaacs joseph-isaacs commented Jun 5, 2026

Summary

Replaces the per-element, data-dependent branch in the listview zip kernel's offset/size selection with a branchless, chunk-at-a-time mask select that the compiler can auto-vectorize.

For each 64-bit mask chunk, each bit is expanded to a full-width lane mask and both sides are blended with (t & m) | (f & !m) via a shared select_column helper, so the inner loop is branch-free regardless of mask shape. if_false offsets are shifted into the second half of the concatenated elements as before; sizes are taken verbatim from the chosen side.

The primitive-array zip kernel that was previously bundled here has been split into its own PR.

Changes

  • vortex-array/src/arrays/listview/compute/zip.rs: branchless chunked offset/size select via select_column; added a test that spans the 64-bit chunk boundary + remainder.
  • vortex-array/benches/listview_zip.rs: a small divan bench — one Fragmented (alternating) mask with non-nullable and nullable inputs.

Benchmark

Because the branchless select is mask-shape-independent, the bench uses a single Fragmented mask (the worst case for the replaced per-element branch) with both nullability variants. LEN is 8,192: listview zip cost is dominated by element concatenation and per-list canonicalization, so a few thousand lists already exercise the select while keeping each case well under a few hundred microseconds (~37 µs non-null, ~41 µs nullable locally).

The kernel change itself is a single-to-low-double-digit-percent win on the offset/size select; end-to-end listview zip is dominated by element concatenation + canonicalization, so the select is a small slice of total cost.

Testing

  • vortex-array listview zip tests pass (incl. the new multi-chunk-span test); cargo +nightly fmt and clippy -D warnings (default + all-features) clean.

https://claude.ai/code/session_01N5ivPiCJy7dGQjMEP7ips9

@joseph-isaacs joseph-isaacs added the changelog/performance A performance improvement label Jun 5, 2026 — with Claude
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented Jun 5, 2026

Merging this PR will degrade performance by 10.64%

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 1 improved benchmark
❌ 3 regressed benchmarks
✅ 1509 untouched benchmarks
🆕 2 new benchmarks

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation compare[15] 119.9 µs 145.6 µs -17.7%
Simulation compare[14] 117.5 µs 141.3 µs -16.88%
Simulation compare[13] 115.5 µs 137.6 µs -16.05%
Simulation compare[5] 76.9 µs 69.3 µs +11.04%
🆕 Simulation nonnull N/A 301.4 µs N/A
🆕 Simulation nullable N/A 322.2 µs N/A

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing claude/wizardly-carson-6Cixf (a9516cf) with develop (e06d80b)

Open in CodSpeed

@joseph-isaacs joseph-isaacs marked this pull request as draft June 5, 2026 10:43
@joseph-isaacs joseph-isaacs force-pushed the claude/wizardly-carson-6Cixf branch from d11d3b8 to ddeb10d Compare June 5, 2026 14:33
@joseph-isaacs joseph-isaacs changed the title Add branchless zip kernel for primitive arrays perf: branchless mask-select for primitive and listview zip Jun 5, 2026
@joseph-isaacs joseph-isaacs marked this pull request as ready for review June 5, 2026 14:34
@joseph-isaacs joseph-isaacs force-pushed the claude/wizardly-carson-6Cixf branch from ddeb10d to 8ea7fb9 Compare June 5, 2026 14:47
Replace the per-element, data-dependent branch in the listview zip kernel's
offset/size selection with a branchless, chunk-at-a-time mask select that the
compiler can auto-vectorize. For each 64-bit mask chunk, each bit is expanded
to a full-width lane mask and both sides are blended with `(t & m) | (f & !m)`
via a shared `select_column` helper, so the inner loop is branch-free
regardless of mask shape. `if_false` offsets are shifted into the second half
of the concatenated elements as before.

Adds a `listview_zip` divan benchmark across fragmented/block/sparse/dense
masks for nullable and non-nullable inputs.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
@joseph-isaacs joseph-isaacs force-pushed the claude/wizardly-carson-6Cixf branch from 8ea7fb9 to f05b6a0 Compare June 5, 2026 15:32
@joseph-isaacs joseph-isaacs changed the title perf: branchless mask-select for primitive and listview zip perf: branchless mask-select for listview zip Jun 5, 2026
Reduce the divan bench to one Fragmented (alternating) mask with non-nullable
and nullable inputs, and lower LEN to 8192 so each case stays well under a few
hundred microseconds. The branchless chunked select is mask-shape-independent,
so a single shape suffices; drop the now-unused MaskShape matrix.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
@joseph-isaacs joseph-isaacs requested a review from a team June 5, 2026 18:17
CodSpeed's instruction-count simulation runs ~10x local walltime, putting the
8192-list bench at ~550us there. Drop to 4096 lists so each case stays well
under a few hundred microseconds in CI while still exercising the select.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/performance A performance improvement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants