Skip to content

fix: move Release and delete out of BlockingCall callback to prevent SIGBUS on macOS#21629

Closed
ludamad wants to merge 1 commit into
nextfrom
fix/threaded-async-op-sigbus-v4
Closed

fix: move Release and delete out of BlockingCall callback to prevent SIGBUS on macOS#21629
ludamad wants to merge 1 commit into
nextfrom
fix/threaded-async-op-sigbus-v4

Conversation

@ludamad

@ludamad ludamad commented Mar 16, 2026

Copy link
Copy Markdown
Collaborator

Fixes SIGBUS crash on macOS in ThreadedAsyncOperation (#21138). Also targeting next via #21625.

Release() and delete op were inside the BlockingCall callback, which runs on the JS thread while BlockingCall is still blocked on the worker thread. Release() tears down TSFN internals (mutex/condvar) that BlockingCall needs to unwind, and delete destroys the member entirely. macOS unmaps freed pages aggressively → SIGBUS. Linux → silent use-after-free / segfault.

Fix: move both Release() and delete this to after BlockingCall returns on the worker thread.

Full post mortem with diagrams

@ludamad ludamad added the ci-barretenberg Run all barretenberg/cpp checks. label Mar 16, 2026
@ludamad ludamad force-pushed the backport-to-v4-staging branch from 9221e1a to be9b609 Compare March 17, 2026 18:16

@dbanks12 dbanks12 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This diff is large (needs rebase?), but the change to async_op.hpp lgtm

…SIGBUS on macOS

Release() and delete were called inside the BlockingCall callback, which runs
on the JS thread while BlockingCall is still blocked on the worker thread.
Release() drops the TSFN refcount to 0, tearing down internal state that
BlockingCall needs to unwind. delete destroys the TSFN member entirely.
Both cause use-after-free when BlockingCall returns on the worker thread.

macOS magazine malloc unmaps freed pages aggressively → SIGBUS.

Fix: move both Release() and delete to after BlockingCall returns on the
worker thread, where they can execute safely.
@ludamad ludamad force-pushed the fix/threaded-async-op-sigbus-v4 branch from bbdb7f7 to 91fab20 Compare March 17, 2026 18:57
@ludamad ludamad changed the base branch from backport-to-v4-staging to next March 17, 2026 18:57
@ludamad ludamad removed the ci-barretenberg Run all barretenberg/cpp checks. label Mar 17, 2026
@dbanks12 dbanks12 marked this pull request as draft May 19, 2026 16:45
@dbanks12

Copy link
Copy Markdown
Contributor

Not closing quite yet, but marking as draft as we merged #21625 and claude thinks 21629 is still "racy"

@ludamad ludamad added the claudebox Owned by claudebox. it can push to this PR. label May 20, 2026
@AztecBot

Copy link
Copy Markdown
Collaborator

This issue was automatically closed because it was referenced in PR #23469 which has been merged to the default branch.

View workflow run

@AztecBot AztecBot closed this May 22, 2026
danielntmd pushed a commit to danielntmd/aztec-packages that referenced this pull request Jun 4, 2026
…AztecProtocol#23469)

## Summary

`aztec start --local-network` reliably SIGBUSes a few blocks into a run
on macOS arm64 (since `v5.0.0-nightly.20260520`, i.e. after AztecProtocol#21625
shipped the `shared_ptr` use-after-free fix). This is a **different**
fault from the one AztecProtocol#21625 fixed: a stack-guard violation (stack
overflow) on a `nodejs_module.node` worker thread running AVM-simulation
code, not a use-after-free.

This pins an explicit, generous stack size on the
`ThreadedAsyncOperation` worker thread.

## Root cause

`ThreadedAsyncOperation::Queue()` (introduced in AztecProtocol#21138) runs the AVM
simulation (`_fn`) directly on a bare `std::thread(...).detach()`. A
`std::thread` uses the OS default stack for non-main threads, which is
**512 KB on macOS** versus **8 MB on Linux**. The AVM-simulation call
chain is deep enough to overflow 512 KB, so on macOS arm64 the worker
writes into its stack-guard page and the process aborts with:

```
EXC_BAD_ACCESS / SIGBUS, KERN_PROTECTION_FAILURE
"Could not determine thread index for stack guard region"
  #0 _platform_memmove
  #1.. nodejs_module.node  bb::nodejs (AVM simulation path)
```

Linux is unaffected because its 8 MB default is comfortably large. The
previous `AsyncOperation` path never hit this either: it ran on the
libuv threadpool, whose threads are sized from `RLIMIT_STACK` (8 MB soft
on macOS), not the 512 KB raw-thread default.

## Fix

`std::thread` can't set a stack size, so launch the worker via
`pthreads` with `pthread_attr_setstacksize` pinned to a generous
`WORKER_STACK_SIZE` (32 MB — 4× the 8 MB that the libuv path proved
sufficient, with headroom for deeper future call chains). Falls back to
a default-stack `std::thread` only if pthreads is unavailable (`_WIN32`)
or `pthread_create` fails.

The shared_ptr lifetime model from AztecProtocol#21625 is preserved exactly — both
the worker lambda and the `BlockingCall` completion callback still
capture `self`, so this does not reintroduce the use-after-free. Only
the thread-launch mechanism changed.

## Testing

- The full bb build is too heavy to run in this session, so this is
**not yet a local end-to-end repro/fix verification** — it relies on CI
for compilation and on a macOS arm64 `aztec start --local-network` run
to confirm the crash is gone.
- The pthread/`std::function` trampoline was compiled and run standalone
under `-std=c++20 -Wall -Wextra -Werror`: the worker thread receives a
32 MB stack (`pthread_get_stacksize_np` reports `33554432`), and the
work runs and completes.
- **Requested:** verify against tonight's nightly on macOS arm64 (M3) —
the reporter's exact repro.

## Notes for reviewers

- Targets `next` (not `merge-train/barretenberg`) to match AztecProtocol#21625's base
and to make the nightly, since this is an urgent release-affecting
crash. Happy to retarget if you'd prefer it go through the merge train.
- 32 MB is a deliberate over-provision; if you'd rather mirror the libuv
path precisely we could instead size from `getrlimit(RLIMIT_STACK)`. The
fixed constant is simpler and the virtual reservation only commits pages
as touched.
- The longer-term fix is the NAPI→IPC migration (AztecProtocol#21331 / AztecProtocol#23196 /
AztecProtocol#23238), which removes this in-process worker entirely. This is a
targeted stop-gap for the shipping NAPI path.

Related: AztecProtocol#21138 (introduced the threaded model), AztecProtocol#21625 (use-after-free
fix), AztecProtocol#21629 (open alternative).

---
*Created by
[claudebox](https://claudebox.work/v2/sessions/4bd36dc505c20254) ·
group: `slackbot`*
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claudebox Owned by claudebox. it can push to this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants