Skip to content

perf: Replace HashSet with generational vec#261

Open
michael-weigelt wants to merge 2 commits into
bytecodealliance:mainfrom
michael-weigelt:mwe/conflict_set
Open

perf: Replace HashSet with generational vec#261
michael-weigelt wants to merge 2 commits into
bytecodealliance:mainfrom
michael-weigelt:mwe/conflict_set

Conversation

@michael-weigelt
Copy link
Copy Markdown

@michael-weigelt michael-weigelt commented Jun 5, 2026

When compiling certain types of functions (e.g. with many locals), a lot of time is spent in try_to_allocate_bundle_to_reg, in particular on HashSet::insert. This data structure is only used for deduplicating and the cost is quite high (hashing, allocations, bad cache properties).

Claude suggested to replace it with a generational vec, and the results seem significant (in my ad-hoc benchmark):
Compiling a Wasm function with X locals before and after this patch:

1000 locals: 780ms compilation -> 363ms   -53%
10k  locals:  66s compilation ->  23s     -65%
40k  locals: 952s compilation ->  356s    -62%

Claude argues that the non-degenerate cases are unaffected or improved by this change. Are there any benchmarks I could run to confirm this?

Note that I am compiling without optimizations which is currently necessary for my usecase.

@michael-weigelt
Copy link
Copy Markdown
Author

@cfallin I was told you might want to have a look at this.

@michael-weigelt
Copy link
Copy Markdown
Author

This is a small and (maybe too) focused improvement in the direction that is also discussed here: #87

@cfallin
Copy link
Copy Markdown
Member

cfallin commented Jun 5, 2026

@michael-weigelt thanks for this. It's an interesting optimization. Could you build a Wasmtime engine (wasmtime-bench-api target) with this version of regalloc2 and with baseline, and test both with Sightglass over the full suite?

Those speedups are frankly fantastic if real but I'm also more than a little skeptical about purported 2-3x speedups; I've done a lot of profiling of regalloc2 and, at least last time I did this (a few years ago now), didn't see major hotspots outside of the expected ones (e.g. the main allocation maps). That said, I welcome benchmark results that say otherwise if reliable!

Data-structure-wise, I was initially skeptical in reading the patch about the dense representation but I see that it's a reused set so we will only allocate one per function compilation -- so it's O(n) time and space cost overall, which is acceptable. It's still worse than the O(k) cost for k << n (very few conflicts), so I do want a careful evaluation over the suite.

Side-note: in BA we now have the AI tools policy, and while you are welcome to use tools such as Claude privately (many of us do for various purposes), you are fully and solely responsible for interactions and discussions here, so instead of "Claude said [thing you're unsure about]", please speak only about things you've personally understood/vetted and don't relay raw agent thoughts or output. In this case that vetting includes running the benchmarks, and carefully reasoning about the algorithmic tradeoff as humans ourselves. Thanks!

@cfallin
Copy link
Copy Markdown
Member

cfallin commented Jun 5, 2026

Also: I would want to see that Wasmtime+Cranelift pass all tests wth the modified regalloc2, just to make sure we're not missing something wrt changed output (which could also explain huge compile-time wins if e.g. we suddenly treat conflicts differently and allocate less optimally or whatever). The algorithm looks correct but let's verify.

@michael-weigelt
Copy link
Copy Markdown
Author

Thanks for the pointers, I'll run those benchmarks. I don't expect the speedups of the general cases to be this impressive, but if they are at least not worse, I am happy.

you are fully and solely responsible for interactions and discussions here, so instead of "Claude said [thing you're unsure about]", please speak only about things you've personally understood/vetted

Understood and agreed. The whole point of that sentence was: I want to vet this. For the record, because I am a new contributor here: The only AI contributions in this PR are the brainstorming on how to address the problem and code. I would never relay slop to a human or let it talk/write prose for me.

/// Empty the set. O(1) except on the rare generation wraparound.
#[inline]
pub fn clear(&mut self) {
self.generation = self.generation.wrapping_add(1);
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could instead use a u64 and ignore the risk of wrap around.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, definitely; with existing limits on function size it's actually not possible to have more than u64::MAX live bundles (nevermind the implied execution time that would require).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants