perf: branchless mask-select for listview zip by joseph-isaacs · Pull Request #8264 · vortex-data/vortex

joseph-isaacs · 2026-06-05T10:27:52Z

Summary

Replaces the per-element, data-dependent branch in the listview zip kernel's offset/size selection with a branchless, chunk-at-a-time mask select that the compiler can auto-vectorize.

For each 64-bit mask chunk, each bit is expanded to a full-width lane mask and both sides are blended with (t & m) | (f & !m) via a shared select_column helper, so the inner loop is branch-free regardless of mask shape. if_false offsets are shifted into the second half of the concatenated elements as before; sizes are taken verbatim from the chosen side.

The primitive-array zip kernel that was previously bundled here has been split into its own PR.

Changes

vortex-array/src/arrays/listview/compute/zip.rs: branchless chunked offset/size select via select_column; added a test that spans the 64-bit chunk boundary + remainder.
vortex-array/benches/listview_zip.rs: a small divan bench — one Fragmented (alternating) mask with non-nullable and nullable inputs.

Benchmark

Because the branchless select is mask-shape-independent, the bench uses a single Fragmented mask (the worst case for the replaced per-element branch) with both nullability variants. LEN is 8,192: listview zip cost is dominated by element concatenation and per-list canonicalization, so a few thousand lists already exercise the select while keeping each case well under a few hundred microseconds (~37 µs non-null, ~41 µs nullable locally).

The kernel change itself is a single-to-low-double-digit-percent win on the offset/size select; end-to-end listview zip is dominated by element concatenation + canonicalization, so the select is a small slice of total cost.

Testing

vortex-array listview zip tests pass (incl. the new multi-chunk-span test); cargo +nightly fmt and clippy -D warnings (default + all-features) clean.

https://claude.ai/code/session_01N5ivPiCJy7dGQjMEP7ips9

codspeed-hq · 2026-06-05T10:38:56Z

Merging this PR will degrade performance by 10.64%

⚠️

Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 1 improved benchmark
❌ 3 regressed benchmarks
✅ 1509 untouched benchmarks
🆕 2 new benchmarks

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
❌	Simulation	`compare[15]`	119.9 µs	145.6 µs	-17.7%
❌	Simulation	`compare[14]`	117.5 µs	141.3 µs	-16.88%
❌	Simulation	`compare[13]`	115.5 µs	137.6 µs	-16.05%
⚡	Simulation	`compare[5]`	76.9 µs	69.3 µs	+11.04%
🆕	Simulation	`nonnull`	N/A	301.4 µs	N/A
🆕	Simulation	`nullable`	N/A	322.2 µs	N/A

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.

_{Comparing claude/wizardly-carson-6Cixf (a9516cf) with develop (e06d80b)}

Replace the per-element, data-dependent branch in the listview zip kernel's offset/size selection with a branchless, chunk-at-a-time mask select that the compiler can auto-vectorize. For each 64-bit mask chunk, each bit is expanded to a full-width lane mask and both sides are blended with `(t & m) | (f & !m)` via a shared `select_column` helper, so the inner loop is branch-free regardless of mask shape. `if_false` offsets are shifted into the second half of the concatenated elements as before. Adds a `listview_zip` divan benchmark across fragmented/block/sparse/dense masks for nullable and non-nullable inputs. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

Reduce the divan bench to one Fragmented (alternating) mask with non-nullable and nullable inputs, and lower LEN to 8192 so each case stays well under a few hundred microseconds. The branchless chunked select is mask-shape-independent, so a single shape suffices; drop the now-unused MaskShape matrix. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

CodSpeed's instruction-count simulation runs ~10x local walltime, putting the 8192-list bench at ~550us there. Drop to 4096 lists so each case stays well under a few hundred microseconds in CI while still exercising the select. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

joseph-isaacs added the changelog/performance A performance improvement label Jun 5, 2026 — with Claude

joseph-isaacs marked this pull request as draft June 5, 2026 10:43

joseph-isaacs force-pushed the claude/wizardly-carson-6Cixf branch from d11d3b8 to ddeb10d Compare June 5, 2026 14:33

joseph-isaacs changed the title ~~Add branchless zip kernel for primitive arrays~~ perf: branchless mask-select for primitive and listview zip Jun 5, 2026

joseph-isaacs marked this pull request as ready for review June 5, 2026 14:34

joseph-isaacs force-pushed the claude/wizardly-carson-6Cixf branch from ddeb10d to 8ea7fb9 Compare June 5, 2026 14:47

joseph-isaacs force-pushed the claude/wizardly-carson-6Cixf branch from 8ea7fb9 to f05b6a0 Compare June 5, 2026 15:32

joseph-isaacs changed the title ~~perf: branchless mask-select for primitive and listview zip~~ perf: branchless mask-select for listview zip Jun 5, 2026

joseph-isaacs mentioned this pull request Jun 5, 2026

perf: branchless primitive zip kernel #8270

Open

joseph-isaacs requested a review from a team June 5, 2026 18:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: branchless mask-select for listview zip#8264

perf: branchless mask-select for listview zip#8264
joseph-isaacs wants to merge 3 commits into
developfrom
claude/wizardly-carson-6Cixf

joseph-isaacs commented Jun 5, 2026 •

edited

Loading

Uh oh!

codspeed-hq Bot commented Jun 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

joseph-isaacs commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Benchmark

Testing

Uh oh!

codspeed-hq Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will degrade performance by 10.64%

Performance Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

joseph-isaacs commented Jun 5, 2026 •

edited

Loading

codspeed-hq Bot commented Jun 5, 2026 •

edited

Loading