Skip to content

feat: merge-train/spartan-v5#24196

Open
AztecBot wants to merge 16 commits into
v5-nextfrom
merge-train/spartan-v5
Open

feat: merge-train/spartan-v5#24196
AztecBot wants to merge 16 commits into
v5-nextfrom
merge-train/spartan-v5

Conversation

@AztecBot

@AztecBot AztecBot commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

BEGIN_COMMIT_OVERRIDE
fix(p2p): re-seed discovery from persisted peer ENRs after restart (#24169)
docs(e2e): annotate e2e tests with setup/category notes (#24191)
chore: merge v5-next into merge-train/spartan-v5 (raw, conflict markers) (#24221)
fix(archiver): index zero-field logs under empty tag instead of throwing (A-1253) (#24212)
fix(prover-node): report awaiting-root and publishing-proof phases in EpochSession (A-1212) (#24216)
fix(p2p): bound declared contract-class bytecode length before allocating (A-1258) (#24213)
fix(p2p): frame gossipsub msgId and restrict allowedTopics (A-1256) (#24214)
chore: merge v5-next into merge-train/spartan-v5 (#24226)
fix(p2p): guard ENR address parsing against malformed TCP fields (A-1255) (#24215)
END_COMMIT_OVERRIDE

…24169)

## Problem

A node that restarts after its bootstrap nodes are gone cannot rejoin
the network on its own, even though it has a list of peers persisted on
disk.

discv5 builds its routing table only from **bootstrap** and **trusted**
ENRs (`discV5_service.ts` `start()`), and keeps that table **in memory**
— it is never persisted. The peer-manager dial queue is fed exclusively
by discv5 discovery events, and libp2p auto-dial is disabled
(`minConnections: 0`). So after a restart with no reachable bootstrap
node, a node's DHT starts empty and the only way it reconnects is via
**inbound** dials from peers that still hold its ENR. That is unreliable
and fails entirely on a full-network restart.

This is the root cause of the flaky `e2e_p2p_rediscovery` failure (see
A-1244): the test only passed because it restarted nodes one at a time,
so still-running neighbours dialed the restarted node back. The last
nodes to restart often ended up below quorum (`discv5KadPeers: 0`, stuck
at 2/3 peers).

## Fix

Persist discovered peer ENRs and re-seed discovery from them on startup:

- **Persist** valid, non-bootnode peer ENRs to the p2p peer store
(`onEnrAdded`), keyed by nodeId, bounded to `MAX_PERSISTED_PEER_ENRS`
with simple eviction.
- **Re-seed** on `start()`, alongside the existing bootstrap/trusted-ENR
seeding: load persisted ENRs and `discv5.addEnr(...)` each (dropping
stale/incompatible ones). Re-adding an ENR flows through the existing
`onDiscovered` → `DISCOVERED` → `PeerManager` dial path, so no new dial
logic is required.

ENRs carry both UDP (discv5) and TCP (libp2p) addresses, so this
restores both discovery and dialability without a bootstrap node. The
persisted store is the same `peerStore` libp2p already uses; the
discovery service just opens its own map on it.

## Test

`e2e_p2p_rediscovery` now exercises the real capability:

- Wait for a full mesh before teardown, so every node has persisted all
its peers (replaces a blind `sleep`).
- Stop the bootstrap node and **all** validators, then restart them
**all** — a full-network bounce — instead of the rolling
stop-one/restart-one loop. With no live peers to dial the nodes back,
the mesh can only rebuild by re-seeding from persisted ENRs.

Existing discv5 unit tests pass unchanged. (The pre-existing `should
persist peers without bootnode` unit test remains skipped — it can't be
made faithful without spawning fresh node processes; the e2e test is the
integration coverage.)

Fixes A-1244.
Comment-only annotations across the end-to-end test suite, produced by
the e2e consolidation survey. Every test file under
`yarn-project/end-to-end/src/` gets a short prose note above its
`describe`/`it` describing how it sets up its environment — sequencer
type, node topology, prover, cross-chain, time-warp — and the category
it maps to in the proposed consolidation, plus inline `// REFACTOR:`
markers where a follow-up is expected.
@PhilWindle PhilWindle enabled auto-merge June 19, 2026 14:36
PhilWindle and others added 3 commits June 20, 2026 17:04
…ting (A-1258)

DataTxValidator decodes contract-class publication logs before proof
validation. bufferFromFields trusted the declared byte-length field and
called Buffer.alloc(byteLength) even when it far exceeded the payload
actually present, so a malformed class-publication tx could force
attacker-chosen multi-MiB allocations on every node that validates the
gossiped tx (req/resp, block proposal, RPC, gossip) before the proof
validator rejects it.

Add an optional maxByteLength guard to bufferFromFields (preserving the
documented blob-reconstruction padding for callers that don't pass it);
ContractClassPublishedEvent.fromLog passes the fixed log's payload
capacity, so an over-declared length throws before allocation.
DataTxValidator already wraps fromLog in try/catch, so the throw becomes
a clean pre-proof rejection.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The gossipsub msgId was SHA256(topic || data) with no framing between the
topic string and the message bytes, and no allowedTopics was set. A peer
could publish on an unsubscribed topic T+data[0] with data[1:], whose
(topic, data) concatenation is byte-identical to a real message on topic
T, hashing to the same msgId. gossipsub transformed it and inserted the
id into seenCache before the subscription check, so the genuine
proposal/attestation was later dropped as a duplicate, suppressing a
time-sensitive consensus message.

- Frame the topic length into the msgId input (uint32be(topicLen) ||
  topic || data) so the (topic, data) boundary is unambiguous and a
  boundary-shifted pair no longer collides.
- Set exact allowedTopics on the gossipsub config so an unsubscribed-topic
  message is rejected before transform / msgId / seenCache insertion.

Defense in depth: either fix alone breaks the attack.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…255)

A discovery peer could advertise a signed, same-version, UDP-contactable
ENR with a malformed TCP field (e.g. 3 bytes instead of a 2-byte port).
validateEnr only checked the aztec key + version, so discv5 emitted
enrAdded; the unguarded async onEnrAdded listener then awaited
getFullMultiaddr('tcp'), which threw RangeError, and as an async listener
the rejection was unhandled and crashed the node.

Audit of every getFullMultiaddr/getLocationMultiaddr site in p2p found
the same pattern on the bootnode's discovered listener (crashes the
bootnode) and config-provided preferredPeers parsing (aborts setup):

- Guard onEnrAdded and the bootstrap discovered listener so a parser
  exception is caught, logged and the ENR dropped (no process exit).
- validateEnr rejects an ENR whose TCP multiaddr won't parse, so it's
  never emitted as a full peer / re-parsed downstream. A missing TCP addr
  stays allowed (those peers are skipped at dial time).
- preferredPeers ENR parsing (peer_manager, libp2p_service) skips and logs
  a malformed configured ENR instead of aborting node setup.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@PhilWindle PhilWindle disabled auto-merge June 21, 2026 12:54

@ludamad ludamad left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Auto-approved

@AztecBot AztecBot enabled auto-merge June 21, 2026 13:32
@AztecBot

Copy link
Copy Markdown
Collaborator Author

🤖 Auto-merge enabled after 4 hours of inactivity. This PR will be merged automatically once all checks pass.

PhilWindle and others added 2 commits June 22, 2026 11:29
…258)

Defensive guard from PR review: an empty fields array would destructure
length to undefined and throw on length.toNumber(). A valid bufferAsFields
output always carries the leading byte-length field, so empty encodes no
data — return an empty buffer.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rs) (#24221)

## Raw merge — do not merge alone

This is **PR 1 of a two-PR stack** that catches `merge-train/spartan-v5`
up to its base `v5-next` so the long-lived train PR
(`merge-train/spartan-v5` → `v5-next`) becomes conflict-free again.

This PR is the **raw merge of `v5-next` into `merge-train/spartan-v5`
with conflict markers committed** — it is red by design. Review and
merge it only together with the stacked resolution PR (PR 2, head
`cb/merge-train-spartan-v5-fix`), which targets this branch.

### Parents
- train: `merge-train/spartan-v5` @
`bc2c924919fe8880ee9572df779b78f653ef2e15`
- base: `v5-next` @ `4df72438bf01f125e791af7e8bb11a613964457e`

### Conflicted files (1)
- `yarn-project/end-to-end/src/e2e_nested_utility_calls.test.ts`

Resolution lives in the stacked fix PR.

---
*Created by
[claudebox](https://claudebox.work/v2/sessions/51d528a3438f2849) ·
group: `slackbot`*

---------

Co-authored-by: Mitchell Tracy <mitchellftracy@gmail.com>
Co-authored-by: Nicolas Chamo <nicolas@chamo.com.ar>
Co-authored-by: Santiago Palladino <santiago@aztec-labs.com>
Co-authored-by: just-mitch <68168980+just-mitch@users.noreply.github.com>
Co-authored-by: Nicolás Venturo <nicolas.venturo@gmail.com>
Co-authored-by: AztecBot <tech@aztecprotocol.com>
PhilWindle and others added 8 commits June 22, 2026 15:20
…ing (A-1253) (#24212)

Fixes A-1253.

## Problem

A protocol-valid public log can carry **zero fields** —
`PublicLog(nonzeroContractAddress, [])` — reachable by any tx that
bypasses the high-level aztec.nr helpers (the AVM `EMITPUBLICLOG` opcode
accepts `logSize = 0`). `LogStore` used `fields[0]` as the tag-index
key, so `fieldHex(undefined)` threw inside the `addLogs` transaction.
Because that runs inside `addProposedBlock`/`addCheckpoints`' larger
store transaction, the throw aborted the entire block/checkpoint update
— a single such tx in an accepted block stalls L1 sync for every node
that ingests it (the same valid checkpoint is retried forever).

## Fix (Fix A — archiver-only, lowest risk)

A log with no fields has no tag. Index it under the **empty tag**
(`tagHexForLog` returns `''` for a zero-field log):

- it stays retrievable via the per-block read (`getPublicLogsForBlock`)
and is dropped on reorg via the per-block secondary index, exactly like
any other log;
- the empty-tag key sorts before every real key for the contract, so a
real 64-hex-char tag query never matches it.

`PublicLog.fields` is a variable-length `Fr[]` (the actual crash
surface); `PrivateLog.fields` is a fixed padded tuple so its `fields[0]`
is always present, but the shared helper is applied to both loops for
consistency and is behaviour-preserving for private logs.

## Test

`log_store.test.ts`: constructs `new PublicLog(CONTRACT, [])`, asserts
`addLogs` resolves (previously threw), the log is returned by
`getPublicLogsForBlock`, and a tag query omits it.

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…(A-1256)

The gossipsub fast-path dedup cache keyed on fastMsgIdFn — a
non-cryptographic 64-bit xxhash of the raw data only (no topic), seeded
from Math.random (~2^30, non-CSPRNG). A fast-id collision (accidental, or
engineered given the weak seed, or simply same-data-on-a-different-topic)
makes gossipsub drop a different message as a duplicate with no fallback
to the full id, so it's a gossip-suppression vector that the framed full
msgId does not close (the fast path short-circuits before it).

fastMsgIdFn is optional; removing it leaves dedup resting solely on the
cryptographic, topic-framed msgIdFn (SHA-256). Also removes the now-unused
xxhash machinery from encoding.ts.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
No longer used after removing fastMsgIdFn.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… EpochSession (A-1212) (#24216)

Fixes A-1212.

## Problem

`prover_getJobs` exposes each `EpochSession`'s state, but `EpochSession`
only ever assigned `initialized`, `awaiting-checkpoints`, and the
terminal states. The intermediate phases that exist to describe active
work were never set, so a session read `awaiting-checkpoints`
continuously while it was actually (a) proving the top tree or (b)
publishing to L1. Observed on staging-internal: every `startProof`
session sat in `awaiting-checkpoints` even with logs showing `Starting
top-tree prove…`. This is a regression vs the old `EpochProvingJob`,
which advanced through distinct phases.

## Fix

- Add a new **non-terminal** `awaiting-root` state to
`EpochProvingJobState` (the zod schema is derived from the same array,
so it updates automatically). Not added to
`EpochProvingJobTerminalState`.
- Set `awaiting-root` when the top-tree (root) prove begins, via the
`TopTreeJob` `beforeProve` hook — which fires once the sub-tree
(checkpoint block) proofs are ready and the root prove is about to
start. `toTopTreeHooks()` now always wires this transition and layers
any test-provided hooks on top (previously it returned `undefined` when
no test hooks were set).
- Set `publishing-proof` at the top of `submitProof()`, before the L1
submit.

Both transitions are guarded by `isTerminal()` so a concurrent
`cancel()` still wins (the post-submit `isTerminal()` check relies on
this).

Phase progression is now: `initialized → awaiting-checkpoints →
awaiting-root → publishing-proof → terminal`.

## Tests

- `epoch-session.test.ts`: new "state reporting" test asserts
`getState()` is `awaiting-root` during the prove and `publishing-proof`
during the submit, then `completed`. Existing hook-ordering test
(`before`/`prove`/`after`) still holds.
- `prover-node.test.ts`: the JSON-RPC schema round-trip mock now
includes an `awaiting-root` job.

Targets the v5 line — the
`EpochSession`/`TopTreeJob`/`awaiting-checkpoints` code (post-#23552)
exists only on `merge-train/spartan-v5`.

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Resolves conflict in e2e_nested_utility_calls.test.ts (keeps the train's
doc comment above the 'denies...private function' test; the msg_sender
test from v5-next is already present). Real merge commit so v5-next
becomes an ancestor of the train.
…ting (A-1258) (#24213)

Fixes A-1258.

## Problem

`DataTxValidator` decodes contract-class publication logs **before**
proof validation (data validation runs ahead of proof validation on the
req/resp, block-proposal, RPC, and gossip paths). The decode trusts the
declared public-bytecode length field:

- `bufferFromFields` (`stdlib/src/abi/buffer.ts`) reads the declared
`byteLength` and, when the payload is shorter, does
`Buffer.alloc(byteLength)` — trusting the declared length over the bytes
actually present.
- `ContractClassPublishedEvent.fromLog` decodes the bytecode via
`bufferFromFields`, and `DataTxValidator.#hasCorrectContractClassIds`
calls `fromLog` before any class-id / proof rejection.

A contract-class log is a fixed `CONTRACT_CLASS_LOG_SIZE_IN_FIELDS`
array, so the real packed bytecode is capped at ~93 KiB. But the
declared length is unbounded: a ~96 KiB log declaring 16 MiB forces a 16
MiB allocation on every node that validates the tx, before the proof
check rejects it — a memory-pressure / validation-amplification path.

## Fix

- Add an optional `maxByteLength` parameter to `bufferFromFields`; when
provided, it throws (before allocating) if the declared length exceeds
it. Callers that don't pass it keep the documented blob-reconstruction
padding behaviour unchanged.
- `fromLog` passes the fixed log's payload capacity (`payloadFields ×
31`), so an over-declared length is rejected before allocation.
`DataTxValidator` already wraps `fromLog` in try/catch, so the throw
becomes a clean pre-proof `TX_ERROR_MALFORMED_CONTRACT_CLASS_LOG`
rejection.

## Tests

- `data_validator.test.ts`: a log declaring a 16 MiB length is rejected,
and `Buffer.alloc` is never called with the declared size (test time
drops from ~1.2s to ~40ms — no large allocation). Run against the
pre-fix compiled code, the allocation still happened (red); green after
the fix.
- Existing `contract_class_published_event.test.ts` "fits a max-size
public bytecode" and the `buffer.test.ts` padding tests confirm
legitimate max-size bytecode and the padding contract are unaffected.
…24214)

Fixes A-1256.

## Problem

The gossipsub full message id was `SHA256(topic || data)[0..20]` with
**no framing** between the topic string and the message bytes, and
`LibP2PService` set no `allowedTopics`. Raw concatenation isn't
injective: for a real message `(T, D)`, a peer can craft `T' = T +
D[0]`, `D' = D[1:]` so `T' || D'` is byte-identical to `T || D` → same
msgId. ChainSafe gossipsub transforms the arbitrary-topic message and
inserts that id into `seenCache` **before** the subscription check (it
isn't delivered to us, since we're not subscribed to `T'`). When the
genuine proposal/attestation arrives on `T`, gossipsub drops it as a
**duplicate** before application validation, peer scoring, or handling —
suppressing a time-sensitive consensus message within the slot.

## Fix (defense in depth — either alone breaks the attack)

1. **Frame the msgId input** as `uint32be(topicLen) || topic || data`
(`encoding.ts`). The topic length pins the `(topic, data)` boundary, so
a boundary-shifted pair no longer collides. (`getMsgIdFn`'s parameter is
narrowed to `Pick<Message, 'topic' | 'data'>` — the only fields it reads
— which stays assignable to gossipsub's `msgIdFn` slot.)
2. **Set exact `allowedTopics`** (`libp2p_service.ts`) to the subscribed
Aztec topic strings. Verified against the installed gossipsub: the
allowlist is enforced in `handleReceivedRpc` *before*
`handleReceivedMessage`/`seenCache.put`, so an unsubscribed-topic
message is dropped before transform / msgId / seenCache.

## Test

`encoding.test.ts`: builds a real `P2PMessage.toMessageData()` buffer
(confirming `data[0] === 0x00`), constructs the shifted `(T', D')`, and
asserts the two msg ids now differ (they collided before the framing
change); plus a determinism check.

## Compatibility

`msgId` is computed locally for dedup; changing the function doesn't
change any on-wire format. During a rolling upgrade, mixed nodes briefly
compute different ids for the same message (minor IHAVE/IWANT
inefficiency), with no correctness impact.
## Merge `v5-next` into `merge-train/spartan-v5` — MUST merge as a merge
commit (do not squash)

### Why the conflicts came back
The previous stack (#24221/#24222) resolved the conflict correctly, but
#24221 was **squash-merged** onto the train. A squash creates a
brand-new single commit and does **not** record `v5-next` as a parent,
so git still sees the train and `v5-next` as diverged (`git merge-base
--is-ancestor origin/v5-next origin/merge-train/spartan-v5` → false).
The merge-train auto-pull then re-attempts `v5-next → train` and hits
the same conflict again.

### What this PR does
This is a real **merge commit** of `v5-next` into the train (two
parents: train tip `eb7d64d` + `v5-next` `4df7243`), with the one
conflict resolved. The resulting tree is **identical to the current
train** (0 file changes) because the content already landed via the
earlier squash — this PR exists purely to record `v5-next` as an
ancestor and advance the merge base.

After this lands, `v5-next` is an ancestor of the train and the
long-lived `merge-train/spartan-v5` → `v5-next` PR is conflict-free.

### Conflict resolved (1)
`yarn-project/end-to-end/src/e2e_nested_utility_calls.test.ts` — kept
the train's doc comment above the `denies…private function` test; the
`msg_sender` test from `v5-next` is already present.

### ⚠️ Merge method
**Merge this with "Create a merge commit"** (or `merge_pr
merge_method=merge`). A squash or rebase merge will recreate the same
divergence and the conflicts will return.

---
*Created by
[claudebox](https://claudebox.work/v2/sessions/51d528a3438f2849) ·
group: `slackbot`*
@AztecBot

Copy link
Copy Markdown
Collaborator Author

Flakey Tests

🤖 says: This CI run detected 1 tests that failed, but were tolerated due to a .test_patterns.yml entry.

\033FLAKED\033 (8;;http://ci.aztec-labs.com/202cd8fd9ef45b3a�202cd8fd9ef45b3a8;;�): yarn-project/scripts/run_test.sh ethereum/src/test/tx_delayer.test.ts (127s) (code: 0)

…255) (#24215)

Fixes A-1255.

## Problem

A discovery peer can advertise a signed ENR with valid `ip`/`udp`
contact data and the expected Aztec ENR version key, but a **malformed**
`tcp` field (e.g. 3 bytes instead of a 2-byte port).
`DiscV5Service.validateEnr` only checked the aztec key + version, so
upstream discv5 accepted the session and emitted `enrAdded`. The
unguarded async `onEnrAdded` listener then awaited
`enr.getFullMultiaddr('tcp')`, which throws `RangeError` on the
malformed value; as an async EventEmitter listener the rejection is
unhandled and **the node process exits**. One malformed,
UDP-contactable, same-version ENR crashes any p2p-enabled node.

## Audit

A sweep of every `getFullMultiaddr` / `getLocationMultiaddr` call in
`p2p/src` found the same unguarded-parse pattern in more than one place:

| Site | ENR source | Crash today | Action |
|------|-----------|-------------|--------|
| `discV5_service.ts` `onEnrAdded` | remote | yes | guard + drop |
| `bootstrap.ts` `discovered` listener | remote | yes (crashes the
**bootnode**) | guard + drop |
| `peer_manager.ts` `handleDiscoveredPeer` | remote | no (caught by
wrapper) | covered by `validateEnr` rejecting malformed-addr ENRs |
| `peer_manager.ts` / `libp2p_service.ts` `preferredPeers` | config |
aborts setup | skip + log |

## Fix

- Guard `onEnrAdded` and the bootnode `discovered` listener so a parser
exception is caught, logged, and the ENR dropped — no unhandled
rejection.
- `validateEnr` rejects an ENR whose TCP multiaddr won't parse, so it's
never emitted as a full peer (`PeerEvent.DISCOVERED`) or re-parsed by
`handleDiscoveredPeer`. A *missing* TCP addr stays allowed — those peers
are simply skipped at dial time; only an *unparseable* one is rejected.
- `preferredPeers` ENR parsing skips and logs a malformed configured ENR
instead of aborting node / preferred-peer setup.

The self-ENR log line (`discV5_service.ts:199-200`) is intentionally
left unguarded — it reads our own ENR, which we build and which can't be
attacker-malformed.

## Tests

`discv5_service.test.ts`: builds a same-version ENR with valid
`ip`/`udp` and a 3-byte `tcp` value, and asserts (a) `validateEnr`
rejects it while a valid control built from the same identity passes —
isolating the TCP check — and (b) `onEnrAdded` resolves (no unhandled
rejection, previously a `RangeError`) and never emits `DISCOVERED`.

Note: `discV5_service.ts` overlapped with the already-merged PR #24169;
this branch is off the post-merge `merge-train/spartan-v5`, so no
conflict.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants