feat: merge-train/spartan-v5#24196
Open
AztecBot wants to merge 16 commits into
Open
Conversation
…24169) ## Problem A node that restarts after its bootstrap nodes are gone cannot rejoin the network on its own, even though it has a list of peers persisted on disk. discv5 builds its routing table only from **bootstrap** and **trusted** ENRs (`discV5_service.ts` `start()`), and keeps that table **in memory** — it is never persisted. The peer-manager dial queue is fed exclusively by discv5 discovery events, and libp2p auto-dial is disabled (`minConnections: 0`). So after a restart with no reachable bootstrap node, a node's DHT starts empty and the only way it reconnects is via **inbound** dials from peers that still hold its ENR. That is unreliable and fails entirely on a full-network restart. This is the root cause of the flaky `e2e_p2p_rediscovery` failure (see A-1244): the test only passed because it restarted nodes one at a time, so still-running neighbours dialed the restarted node back. The last nodes to restart often ended up below quorum (`discv5KadPeers: 0`, stuck at 2/3 peers). ## Fix Persist discovered peer ENRs and re-seed discovery from them on startup: - **Persist** valid, non-bootnode peer ENRs to the p2p peer store (`onEnrAdded`), keyed by nodeId, bounded to `MAX_PERSISTED_PEER_ENRS` with simple eviction. - **Re-seed** on `start()`, alongside the existing bootstrap/trusted-ENR seeding: load persisted ENRs and `discv5.addEnr(...)` each (dropping stale/incompatible ones). Re-adding an ENR flows through the existing `onDiscovered` → `DISCOVERED` → `PeerManager` dial path, so no new dial logic is required. ENRs carry both UDP (discv5) and TCP (libp2p) addresses, so this restores both discovery and dialability without a bootstrap node. The persisted store is the same `peerStore` libp2p already uses; the discovery service just opens its own map on it. ## Test `e2e_p2p_rediscovery` now exercises the real capability: - Wait for a full mesh before teardown, so every node has persisted all its peers (replaces a blind `sleep`). - Stop the bootstrap node and **all** validators, then restart them **all** — a full-network bounce — instead of the rolling stop-one/restart-one loop. With no live peers to dial the nodes back, the mesh can only rebuild by re-seeding from persisted ENRs. Existing discv5 unit tests pass unchanged. (The pre-existing `should persist peers without bootnode` unit test remains skipped — it can't be made faithful without spawning fresh node processes; the e2e test is the integration coverage.) Fixes A-1244.
Comment-only annotations across the end-to-end test suite, produced by the e2e consolidation survey. Every test file under `yarn-project/end-to-end/src/` gets a short prose note above its `describe`/`it` describing how it sets up its environment — sequencer type, node topology, prover, cross-chain, time-warp — and the category it maps to in the proposed consolidation, plus inline `// REFACTOR:` markers where a follow-up is expected.
…ting (A-1258) DataTxValidator decodes contract-class publication logs before proof validation. bufferFromFields trusted the declared byte-length field and called Buffer.alloc(byteLength) even when it far exceeded the payload actually present, so a malformed class-publication tx could force attacker-chosen multi-MiB allocations on every node that validates the gossiped tx (req/resp, block proposal, RPC, gossip) before the proof validator rejects it. Add an optional maxByteLength guard to bufferFromFields (preserving the documented blob-reconstruction padding for callers that don't pass it); ContractClassPublishedEvent.fromLog passes the fixed log's payload capacity, so an over-declared length throws before allocation. DataTxValidator already wraps fromLog in try/catch, so the throw becomes a clean pre-proof rejection. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The gossipsub msgId was SHA256(topic || data) with no framing between the topic string and the message bytes, and no allowedTopics was set. A peer could publish on an unsubscribed topic T+data[0] with data[1:], whose (topic, data) concatenation is byte-identical to a real message on topic T, hashing to the same msgId. gossipsub transformed it and inserted the id into seenCache before the subscription check, so the genuine proposal/attestation was later dropped as a duplicate, suppressing a time-sensitive consensus message. - Frame the topic length into the msgId input (uint32be(topicLen) || topic || data) so the (topic, data) boundary is unambiguous and a boundary-shifted pair no longer collides. - Set exact allowedTopics on the gossipsub config so an unsubscribed-topic message is rejected before transform / msgId / seenCache insertion. Defense in depth: either fix alone breaks the attack. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…255)
A discovery peer could advertise a signed, same-version, UDP-contactable
ENR with a malformed TCP field (e.g. 3 bytes instead of a 2-byte port).
validateEnr only checked the aztec key + version, so discv5 emitted
enrAdded; the unguarded async onEnrAdded listener then awaited
getFullMultiaddr('tcp'), which threw RangeError, and as an async listener
the rejection was unhandled and crashed the node.
Audit of every getFullMultiaddr/getLocationMultiaddr site in p2p found
the same pattern on the bootnode's discovered listener (crashes the
bootnode) and config-provided preferredPeers parsing (aborts setup):
- Guard onEnrAdded and the bootstrap discovered listener so a parser
exception is caught, logged and the ENR dropped (no process exit).
- validateEnr rejects an ENR whose TCP multiaddr won't parse, so it's
never emitted as a full peer / re-parsed downstream. A missing TCP addr
stays allowed (those peers are skipped at dial time).
- preferredPeers ENR parsing (peer_manager, libp2p_service) skips and logs
a malformed configured ENR instead of aborting node setup.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Collaborator
Author
|
🤖 Auto-merge enabled after 4 hours of inactivity. This PR will be merged automatically once all checks pass. |
…258) Defensive guard from PR review: an empty fields array would destructure length to undefined and throw on length.toNumber(). A valid bufferAsFields output always carries the leading byte-length field, so empty encodes no data — return an empty buffer. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rs) (#24221) ## Raw merge — do not merge alone This is **PR 1 of a two-PR stack** that catches `merge-train/spartan-v5` up to its base `v5-next` so the long-lived train PR (`merge-train/spartan-v5` → `v5-next`) becomes conflict-free again. This PR is the **raw merge of `v5-next` into `merge-train/spartan-v5` with conflict markers committed** — it is red by design. Review and merge it only together with the stacked resolution PR (PR 2, head `cb/merge-train-spartan-v5-fix`), which targets this branch. ### Parents - train: `merge-train/spartan-v5` @ `bc2c924919fe8880ee9572df779b78f653ef2e15` - base: `v5-next` @ `4df72438bf01f125e791af7e8bb11a613964457e` ### Conflicted files (1) - `yarn-project/end-to-end/src/e2e_nested_utility_calls.test.ts` Resolution lives in the stacked fix PR. --- *Created by [claudebox](https://claudebox.work/v2/sessions/51d528a3438f2849) · group: `slackbot`* --------- Co-authored-by: Mitchell Tracy <mitchellftracy@gmail.com> Co-authored-by: Nicolas Chamo <nicolas@chamo.com.ar> Co-authored-by: Santiago Palladino <santiago@aztec-labs.com> Co-authored-by: just-mitch <68168980+just-mitch@users.noreply.github.com> Co-authored-by: Nicolás Venturo <nicolas.venturo@gmail.com> Co-authored-by: AztecBot <tech@aztecprotocol.com>
…ing (A-1253) (#24212) Fixes A-1253. ## Problem A protocol-valid public log can carry **zero fields** — `PublicLog(nonzeroContractAddress, [])` — reachable by any tx that bypasses the high-level aztec.nr helpers (the AVM `EMITPUBLICLOG` opcode accepts `logSize = 0`). `LogStore` used `fields[0]` as the tag-index key, so `fieldHex(undefined)` threw inside the `addLogs` transaction. Because that runs inside `addProposedBlock`/`addCheckpoints`' larger store transaction, the throw aborted the entire block/checkpoint update — a single such tx in an accepted block stalls L1 sync for every node that ingests it (the same valid checkpoint is retried forever). ## Fix (Fix A — archiver-only, lowest risk) A log with no fields has no tag. Index it under the **empty tag** (`tagHexForLog` returns `''` for a zero-field log): - it stays retrievable via the per-block read (`getPublicLogsForBlock`) and is dropped on reorg via the per-block secondary index, exactly like any other log; - the empty-tag key sorts before every real key for the contract, so a real 64-hex-char tag query never matches it. `PublicLog.fields` is a variable-length `Fr[]` (the actual crash surface); `PrivateLog.fields` is a fixed padded tuple so its `fields[0]` is always present, but the shared helper is applied to both loops for consistency and is behaviour-preserving for private logs. ## Test `log_store.test.ts`: constructs `new PublicLog(CONTRACT, [])`, asserts `addLogs` resolves (previously threw), the log is returned by `getPublicLogsForBlock`, and a tag query omits it. --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…(A-1256) The gossipsub fast-path dedup cache keyed on fastMsgIdFn — a non-cryptographic 64-bit xxhash of the raw data only (no topic), seeded from Math.random (~2^30, non-CSPRNG). A fast-id collision (accidental, or engineered given the weak seed, or simply same-data-on-a-different-topic) makes gossipsub drop a different message as a duplicate with no fallback to the full id, so it's a gossip-suppression vector that the framed full msgId does not close (the fast path short-circuits before it). fastMsgIdFn is optional; removing it leaves dedup resting solely on the cryptographic, topic-framed msgIdFn (SHA-256). Also removes the now-unused xxhash machinery from encoding.ts. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
No longer used after removing fastMsgIdFn. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… EpochSession (A-1212) (#24216) Fixes A-1212. ## Problem `prover_getJobs` exposes each `EpochSession`'s state, but `EpochSession` only ever assigned `initialized`, `awaiting-checkpoints`, and the terminal states. The intermediate phases that exist to describe active work were never set, so a session read `awaiting-checkpoints` continuously while it was actually (a) proving the top tree or (b) publishing to L1. Observed on staging-internal: every `startProof` session sat in `awaiting-checkpoints` even with logs showing `Starting top-tree prove…`. This is a regression vs the old `EpochProvingJob`, which advanced through distinct phases. ## Fix - Add a new **non-terminal** `awaiting-root` state to `EpochProvingJobState` (the zod schema is derived from the same array, so it updates automatically). Not added to `EpochProvingJobTerminalState`. - Set `awaiting-root` when the top-tree (root) prove begins, via the `TopTreeJob` `beforeProve` hook — which fires once the sub-tree (checkpoint block) proofs are ready and the root prove is about to start. `toTopTreeHooks()` now always wires this transition and layers any test-provided hooks on top (previously it returned `undefined` when no test hooks were set). - Set `publishing-proof` at the top of `submitProof()`, before the L1 submit. Both transitions are guarded by `isTerminal()` so a concurrent `cancel()` still wins (the post-submit `isTerminal()` check relies on this). Phase progression is now: `initialized → awaiting-checkpoints → awaiting-root → publishing-proof → terminal`. ## Tests - `epoch-session.test.ts`: new "state reporting" test asserts `getState()` is `awaiting-root` during the prove and `publishing-proof` during the submit, then `completed`. Existing hook-ordering test (`before`/`prove`/`after`) still holds. - `prover-node.test.ts`: the JSON-RPC schema round-trip mock now includes an `awaiting-root` job. Targets the v5 line — the `EpochSession`/`TopTreeJob`/`awaiting-checkpoints` code (post-#23552) exists only on `merge-train/spartan-v5`. --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Resolves conflict in e2e_nested_utility_calls.test.ts (keeps the train's doc comment above the 'denies...private function' test; the msg_sender test from v5-next is already present). Real merge commit so v5-next becomes an ancestor of the train.
…ting (A-1258) (#24213) Fixes A-1258. ## Problem `DataTxValidator` decodes contract-class publication logs **before** proof validation (data validation runs ahead of proof validation on the req/resp, block-proposal, RPC, and gossip paths). The decode trusts the declared public-bytecode length field: - `bufferFromFields` (`stdlib/src/abi/buffer.ts`) reads the declared `byteLength` and, when the payload is shorter, does `Buffer.alloc(byteLength)` — trusting the declared length over the bytes actually present. - `ContractClassPublishedEvent.fromLog` decodes the bytecode via `bufferFromFields`, and `DataTxValidator.#hasCorrectContractClassIds` calls `fromLog` before any class-id / proof rejection. A contract-class log is a fixed `CONTRACT_CLASS_LOG_SIZE_IN_FIELDS` array, so the real packed bytecode is capped at ~93 KiB. But the declared length is unbounded: a ~96 KiB log declaring 16 MiB forces a 16 MiB allocation on every node that validates the tx, before the proof check rejects it — a memory-pressure / validation-amplification path. ## Fix - Add an optional `maxByteLength` parameter to `bufferFromFields`; when provided, it throws (before allocating) if the declared length exceeds it. Callers that don't pass it keep the documented blob-reconstruction padding behaviour unchanged. - `fromLog` passes the fixed log's payload capacity (`payloadFields × 31`), so an over-declared length is rejected before allocation. `DataTxValidator` already wraps `fromLog` in try/catch, so the throw becomes a clean pre-proof `TX_ERROR_MALFORMED_CONTRACT_CLASS_LOG` rejection. ## Tests - `data_validator.test.ts`: a log declaring a 16 MiB length is rejected, and `Buffer.alloc` is never called with the declared size (test time drops from ~1.2s to ~40ms — no large allocation). Run against the pre-fix compiled code, the allocation still happened (red); green after the fix. - Existing `contract_class_published_event.test.ts` "fits a max-size public bytecode" and the `buffer.test.ts` padding tests confirm legitimate max-size bytecode and the padding contract are unaffected.
…24214) Fixes A-1256. ## Problem The gossipsub full message id was `SHA256(topic || data)[0..20]` with **no framing** between the topic string and the message bytes, and `LibP2PService` set no `allowedTopics`. Raw concatenation isn't injective: for a real message `(T, D)`, a peer can craft `T' = T + D[0]`, `D' = D[1:]` so `T' || D'` is byte-identical to `T || D` → same msgId. ChainSafe gossipsub transforms the arbitrary-topic message and inserts that id into `seenCache` **before** the subscription check (it isn't delivered to us, since we're not subscribed to `T'`). When the genuine proposal/attestation arrives on `T`, gossipsub drops it as a **duplicate** before application validation, peer scoring, or handling — suppressing a time-sensitive consensus message within the slot. ## Fix (defense in depth — either alone breaks the attack) 1. **Frame the msgId input** as `uint32be(topicLen) || topic || data` (`encoding.ts`). The topic length pins the `(topic, data)` boundary, so a boundary-shifted pair no longer collides. (`getMsgIdFn`'s parameter is narrowed to `Pick<Message, 'topic' | 'data'>` — the only fields it reads — which stays assignable to gossipsub's `msgIdFn` slot.) 2. **Set exact `allowedTopics`** (`libp2p_service.ts`) to the subscribed Aztec topic strings. Verified against the installed gossipsub: the allowlist is enforced in `handleReceivedRpc` *before* `handleReceivedMessage`/`seenCache.put`, so an unsubscribed-topic message is dropped before transform / msgId / seenCache. ## Test `encoding.test.ts`: builds a real `P2PMessage.toMessageData()` buffer (confirming `data[0] === 0x00`), constructs the shifted `(T', D')`, and asserts the two msg ids now differ (they collided before the framing change); plus a determinism check. ## Compatibility `msgId` is computed locally for dedup; changing the function doesn't change any on-wire format. During a rolling upgrade, mixed nodes briefly compute different ids for the same message (minor IHAVE/IWANT inefficiency), with no correctness impact.
## Merge `v5-next` into `merge-train/spartan-v5` — MUST merge as a merge commit (do not squash) ### Why the conflicts came back The previous stack (#24221/#24222) resolved the conflict correctly, but #24221 was **squash-merged** onto the train. A squash creates a brand-new single commit and does **not** record `v5-next` as a parent, so git still sees the train and `v5-next` as diverged (`git merge-base --is-ancestor origin/v5-next origin/merge-train/spartan-v5` → false). The merge-train auto-pull then re-attempts `v5-next → train` and hits the same conflict again. ### What this PR does This is a real **merge commit** of `v5-next` into the train (two parents: train tip `eb7d64d` + `v5-next` `4df7243`), with the one conflict resolved. The resulting tree is **identical to the current train** (0 file changes) because the content already landed via the earlier squash — this PR exists purely to record `v5-next` as an ancestor and advance the merge base. After this lands, `v5-next` is an ancestor of the train and the long-lived `merge-train/spartan-v5` → `v5-next` PR is conflict-free. ### Conflict resolved (1) `yarn-project/end-to-end/src/e2e_nested_utility_calls.test.ts` — kept the train's doc comment above the `denies…private function` test; the `msg_sender` test from `v5-next` is already present. ###⚠️ Merge method **Merge this with "Create a merge commit"** (or `merge_pr merge_method=merge`). A squash or rebase merge will recreate the same divergence and the conflicts will return. --- *Created by [claudebox](https://claudebox.work/v2/sessions/51d528a3438f2849) · group: `slackbot`*
Collaborator
Author
Flakey Tests🤖 says: This CI run detected 1 tests that failed, but were tolerated due to a .test_patterns.yml entry. |
…255) (#24215) Fixes A-1255. ## Problem A discovery peer can advertise a signed ENR with valid `ip`/`udp` contact data and the expected Aztec ENR version key, but a **malformed** `tcp` field (e.g. 3 bytes instead of a 2-byte port). `DiscV5Service.validateEnr` only checked the aztec key + version, so upstream discv5 accepted the session and emitted `enrAdded`. The unguarded async `onEnrAdded` listener then awaited `enr.getFullMultiaddr('tcp')`, which throws `RangeError` on the malformed value; as an async EventEmitter listener the rejection is unhandled and **the node process exits**. One malformed, UDP-contactable, same-version ENR crashes any p2p-enabled node. ## Audit A sweep of every `getFullMultiaddr` / `getLocationMultiaddr` call in `p2p/src` found the same unguarded-parse pattern in more than one place: | Site | ENR source | Crash today | Action | |------|-----------|-------------|--------| | `discV5_service.ts` `onEnrAdded` | remote | yes | guard + drop | | `bootstrap.ts` `discovered` listener | remote | yes (crashes the **bootnode**) | guard + drop | | `peer_manager.ts` `handleDiscoveredPeer` | remote | no (caught by wrapper) | covered by `validateEnr` rejecting malformed-addr ENRs | | `peer_manager.ts` / `libp2p_service.ts` `preferredPeers` | config | aborts setup | skip + log | ## Fix - Guard `onEnrAdded` and the bootnode `discovered` listener so a parser exception is caught, logged, and the ENR dropped — no unhandled rejection. - `validateEnr` rejects an ENR whose TCP multiaddr won't parse, so it's never emitted as a full peer (`PeerEvent.DISCOVERED`) or re-parsed by `handleDiscoveredPeer`. A *missing* TCP addr stays allowed — those peers are simply skipped at dial time; only an *unparseable* one is rejected. - `preferredPeers` ENR parsing skips and logs a malformed configured ENR instead of aborting node / preferred-peer setup. The self-ENR log line (`discV5_service.ts:199-200`) is intentionally left unguarded — it reads our own ENR, which we build and which can't be attacker-malformed. ## Tests `discv5_service.test.ts`: builds a same-version ENR with valid `ip`/`udp` and a 3-byte `tcp` value, and asserts (a) `validateEnr` rejects it while a valid control built from the same identity passes — isolating the TCP check — and (b) `onEnrAdded` resolves (no unhandled rejection, previously a `RangeError`) and never emits `DISCOVERED`. Note: `discV5_service.ts` overlapped with the already-merged PR #24169; this branch is off the post-merge `merge-train/spartan-v5`, so no conflict.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
BEGIN_COMMIT_OVERRIDE
fix(p2p): re-seed discovery from persisted peer ENRs after restart (#24169)
docs(e2e): annotate e2e tests with setup/category notes (#24191)
chore: merge v5-next into merge-train/spartan-v5 (raw, conflict markers) (#24221)
fix(archiver): index zero-field logs under empty tag instead of throwing (A-1253) (#24212)
fix(prover-node): report awaiting-root and publishing-proof phases in EpochSession (A-1212) (#24216)
fix(p2p): bound declared contract-class bytecode length before allocating (A-1258) (#24213)
fix(p2p): frame gossipsub msgId and restrict allowedTopics (A-1256) (#24214)
chore: merge v5-next into merge-train/spartan-v5 (#24226)
fix(p2p): guard ENR address parsing against malformed TCP fields (A-1255) (#24215)
END_COMMIT_OVERRIDE