fix: move Release and delete out of BlockingCall callback to prevent SIGBUS on macOS#21629
Closed
ludamad wants to merge 1 commit into
Closed
fix: move Release and delete out of BlockingCall callback to prevent SIGBUS on macOS#21629ludamad wants to merge 1 commit into
ludamad wants to merge 1 commit into
Conversation
9221e1a to
be9b609
Compare
dbanks12
approved these changes
Mar 17, 2026
dbanks12
left a comment
Contributor
There was a problem hiding this comment.
This diff is large (needs rebase?), but the change to async_op.hpp lgtm
…SIGBUS on macOS Release() and delete were called inside the BlockingCall callback, which runs on the JS thread while BlockingCall is still blocked on the worker thread. Release() drops the TSFN refcount to 0, tearing down internal state that BlockingCall needs to unwind. delete destroys the TSFN member entirely. Both cause use-after-free when BlockingCall returns on the worker thread. macOS magazine malloc unmaps freed pages aggressively → SIGBUS. Fix: move both Release() and delete to after BlockingCall returns on the worker thread, where they can execute safely.
bbdb7f7 to
91fab20
Compare
charlielye
approved these changes
Mar 17, 2026
Contributor
|
Not closing quite yet, but marking as draft as we merged #21625 and claude thinks 21629 is still "racy" |
Collaborator
|
This issue was automatically closed because it was referenced in PR #23469 which has been merged to the default branch. |
danielntmd
pushed a commit
to danielntmd/aztec-packages
that referenced
this pull request
Jun 4, 2026
…AztecProtocol#23469) ## Summary `aztec start --local-network` reliably SIGBUSes a few blocks into a run on macOS arm64 (since `v5.0.0-nightly.20260520`, i.e. after AztecProtocol#21625 shipped the `shared_ptr` use-after-free fix). This is a **different** fault from the one AztecProtocol#21625 fixed: a stack-guard violation (stack overflow) on a `nodejs_module.node` worker thread running AVM-simulation code, not a use-after-free. This pins an explicit, generous stack size on the `ThreadedAsyncOperation` worker thread. ## Root cause `ThreadedAsyncOperation::Queue()` (introduced in AztecProtocol#21138) runs the AVM simulation (`_fn`) directly on a bare `std::thread(...).detach()`. A `std::thread` uses the OS default stack for non-main threads, which is **512 KB on macOS** versus **8 MB on Linux**. The AVM-simulation call chain is deep enough to overflow 512 KB, so on macOS arm64 the worker writes into its stack-guard page and the process aborts with: ``` EXC_BAD_ACCESS / SIGBUS, KERN_PROTECTION_FAILURE "Could not determine thread index for stack guard region" #0 _platform_memmove #1.. nodejs_module.node bb::nodejs (AVM simulation path) ``` Linux is unaffected because its 8 MB default is comfortably large. The previous `AsyncOperation` path never hit this either: it ran on the libuv threadpool, whose threads are sized from `RLIMIT_STACK` (8 MB soft on macOS), not the 512 KB raw-thread default. ## Fix `std::thread` can't set a stack size, so launch the worker via `pthreads` with `pthread_attr_setstacksize` pinned to a generous `WORKER_STACK_SIZE` (32 MB — 4× the 8 MB that the libuv path proved sufficient, with headroom for deeper future call chains). Falls back to a default-stack `std::thread` only if pthreads is unavailable (`_WIN32`) or `pthread_create` fails. The shared_ptr lifetime model from AztecProtocol#21625 is preserved exactly — both the worker lambda and the `BlockingCall` completion callback still capture `self`, so this does not reintroduce the use-after-free. Only the thread-launch mechanism changed. ## Testing - The full bb build is too heavy to run in this session, so this is **not yet a local end-to-end repro/fix verification** — it relies on CI for compilation and on a macOS arm64 `aztec start --local-network` run to confirm the crash is gone. - The pthread/`std::function` trampoline was compiled and run standalone under `-std=c++20 -Wall -Wextra -Werror`: the worker thread receives a 32 MB stack (`pthread_get_stacksize_np` reports `33554432`), and the work runs and completes. - **Requested:** verify against tonight's nightly on macOS arm64 (M3) — the reporter's exact repro. ## Notes for reviewers - Targets `next` (not `merge-train/barretenberg`) to match AztecProtocol#21625's base and to make the nightly, since this is an urgent release-affecting crash. Happy to retarget if you'd prefer it go through the merge train. - 32 MB is a deliberate over-provision; if you'd rather mirror the libuv path precisely we could instead size from `getrlimit(RLIMIT_STACK)`. The fixed constant is simpler and the virtual reservation only commits pages as touched. - The longer-term fix is the NAPI→IPC migration (AztecProtocol#21331 / AztecProtocol#23196 / AztecProtocol#23238), which removes this in-process worker entirely. This is a targeted stop-gap for the shipping NAPI path. Related: AztecProtocol#21138 (introduced the threaded model), AztecProtocol#21625 (use-after-free fix), AztecProtocol#21629 (open alternative). --- *Created by [claudebox](https://claudebox.work/v2/sessions/4bd36dc505c20254) · group: `slackbot`*
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes SIGBUS crash on macOS in
ThreadedAsyncOperation(#21138). Also targeting next via #21625.Release()anddelete opwere inside theBlockingCallcallback, which runs on the JS thread whileBlockingCallis still blocked on the worker thread.Release()tears down TSFN internals (mutex/condvar) thatBlockingCallneeds to unwind, anddeletedestroys the member entirely. macOS unmaps freed pages aggressively → SIGBUS. Linux → silent use-after-free / segfault.Fix: move both
Release()anddelete thisto afterBlockingCallreturns on the worker thread.Full post mortem with diagrams