fix(archiver): validate checkpoint attestations before fetching blobs (A-1252)#24183
Draft
PhilWindle wants to merge 9 commits into
Draft
Conversation
… (A-1252) During L1 sync the archiver fetched and decoded checkpoint blob data before validating committee attestations. A selected proposer could publish a checkpoint with invalid/insufficient attestations plus malformed-but-hash- matching blob data; the blob decode threw BlobDeserializationError before the invalid-attestation skip path ran, so the checkpoint was never recorded as rejected and the L1 sync point never advanced past it. The archiver then re-queried the same L1 blocks every poll and re-threw forever, taking any valid checkpoints in the same batch down with it (sync-liveness / DoS). Validate attestations from calldata first (the signed consensus payload -- header, archive root, fee asset price modifier -- is fully available without blobs), reject invalid and descendant-of-rejected checkpoints without fetching their blobs, then fetch blobs in parallel only for the survivors and ingest. - validation.ts: add validateCheckpointAttestationsFromCalldata and extract the shared core validator. - l1_synchronizer.ts: reorder handleCheckpoints to validate-then-fetch; track lastSeenCheckpoint from calldata since rejected checkpoints are no longer built. - archiver-sync.test.ts: regression test covering invalid attestations + malformed blob, asserting rejection without a BlobDeserializationError retry loop. Source: AztecProtocol/aztec-claude#1553
… be pruned (A-1252) A checkpoint with valid attestations but a malformed or withheld blob still threw during blob decode in handleCheckpoints, aborting the whole sync iteration and freezing the L1 sync clock (this.l1Timestamp is only advanced at the end of syncFromL1). Frozen clock => every honest proposer's checkSync fails => nobody proposes => the epoch-prune recovery never fires, so a single bribed-committee + gossip-withholding attack permanently halts the chain. Gate the blob failure on rollup.canPruneAtTime: while the checkpoint's epoch can still be proven a blob failure stays fatal (the blob is canonical and must become available, so we keep retrying); once the proof window has expired the checkpoint is destined for pruning, so we skip it and every later checkpoint this iteration and let the existing epoch-prune machinery recover. Sync no longer throws, the clock advances, honest proposers resume, prune-on-propose fires on L1, and the bad checkpoint is dropped. Adds unit tests for the prunable/non-prunable boundary and confirms a matching local proposed checkpoint is still promoted (blob fetch skipped) even when its on-chain blob is malformed.
…esh node syncs past withheld blob (A-1252) Adds an e2e to epochs_invalidate_block: all sequencers post a checkpoint with insufficient attestations, then revert to honest config so it is invalidated. Asserts an already-running honest node detects the bad checkpoint from L1 calldata (InvalidAttestationsCheckpointDetected) before any blob fetch, and that a fresh observer node started afterwards — with the bad checkpoint's blob deleted from the shared file store and no gossiped copy to promote — still syncs past it rather than looping on a blob-fetch error.
…zes sync clock once prunable (A-1252) Adds the rows-4/5 counterpart to the invalid-attestations e2e: a checkpoint with valid attestations but a withheld blob cannot be rejected by attestation validation, so a node must fetch its blob to ingest it. Using a short proof window and no prover, the test freezes the chain on such a checkpoint, deletes its blob from the shared store, and starts a gossip-disabled observer that never promotes. The observer's L1 sync clock stays frozen (getSyncedL1Timestamp undefined) while the epoch is still provable, then advances past the checkpoint once the proof window expires and it becomes prunable — the clock-unfreeze the fix restores. Without the fix the observer would loop on the blob fetch forever.
…hable-blob checkpoint (A-1252) Extends the rows-4/5 e2e: after the observer's sync clock unfreezes (it skipped the unfetchable checkpoint once prunable), production is resumed. The next proposer prunes the doomed unproven epoch on L1 and the chain rebuilds; every node — all validators plus the previously-stuck observer — progresses past the bad checkpoint. Because the observer never ingested that checkpoint, it can only get past it if the chain actually prunes (a descendant would otherwise be unservable to it), so this also asserts the prune occurred. Freezing/resuming is done via minTxsPerBlock + buildCheckpointIfEmpty toggles rather than stopping the sequencer (which closes its store and cannot be cleanly restarted).
…kpoint while canonical (A-1252) Reworks the two blob-unavailability e2e tests to be deterministic and to exercise the intended paths: - Production is driven by feeding txs (minTxsPerBlock 1, empty checkpoints disabled) rather than racing sequencer-config changes against a stream of empty checkpoints, so we control exactly which checkpoint is bad and the chain stays frozen on it once we stop feeding txs. - The invalid-attestations test now keeps the bad checkpoint CANONICAL (invalidation disabled on the sequencers) so the late, no-promote observer rejects it via the L1 calldata attestation check — not the archive-mismatch filter, which only drops already-invalidated/replaced checkpoints. With its blob withheld, the observer's sync clock still advances past it, proving it never fetched the blob. The chain then invalidates and rebuilds, and every node (validators + observer) converges past it. - The rows-4/5 test keeps the same shape: a valid-attestations checkpoint with a withheld blob freezes a no-promote observer's clock until its proof window expires, after which it skips the checkpoint and the clock advances; resuming production prunes the doomed epoch and all nodes converge. Asserting on the sync clock (a persistent signal) rather than the transient detection event avoids a flaky race where a late observer detects the checkpoint during its initial sync before a listener attaches.
…ipped prunable one is not ingested (A-1252) Adds a unit test exercising the handleCheckpoints loop-break (not just the in-batch filter): with a tiny batch size, a malformed-blob checkpoint (CP2, prunable) and a good-blob checkpoint that chains off it (CP3) fall in separate L1-block batches. The fix must skip CP3 along with CP2; without the break, addCheckpoints hits InitialCheckpointNumberNotSequentialError on the gap and re-freezes sync. Verified red without the break. Widens the buildArchiver test helper to accept a batchSize override.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
During L1 sync the archiver fetched and decoded checkpoint blob data before validating committee attestations. A selected L1 proposer could publish a checkpoint with invalid/insufficient attestations plus malformed-but-hash-matching blob data; the blob decode threw
BlobDeserializationErrorbefore the invalid-attestation skip path ran, so the checkpoint was never recorded as rejected and the L1 sync point never advanced past it. The archiver then re-queried the same L1 blocks every poll and re-threw forever — taking any valid checkpoints in the same batch down with it. This is a sync-liveness / DoS issue at the L1 ingestion boundary (no invalid state accepted, no asset loss).Fix
Validate attestations from L1 calldata first — the signed consensus payload (header, archive root, fee asset price modifier) is fully available without blobs. Reject invalid and descendant-of-rejected checkpoints (emitting the same events and persisting the same rejected entries as before) without fetching their blobs, then fetch blobs in parallel only for the surviving checkpoints and ingest.
validation.ts: addvalidateCheckpointAttestationsFromCalldataand extract the shared core validator (validateCheckpointAttestationsnow delegates to it).l1_synchronizer.ts: reorderhandleCheckpointsto validate-then-fetch;lastSeenCheckpointis tracked from calldata ({ checkpointNumber, l1 }) since rejected checkpoints are no longer built intoPublishedCheckpoints.archiver-sync.test.ts: regression test — a checkpoint with invalid attestations and malformed blob data is rejected without aBlobDeserializationErrorretry loop.Verification
BlobDeserializationError) and passes with the fix.syncImmediate()throws every time, the sync point never advances (syncedL1=undefined), and no checkpoint is ingested or rejected.archiver-sync.test.ts+validation.test.ts: 68/68 pass.tsc --noEmitclean.Source: AztecProtocol/aztec-claude#1553
Follow-up: blob-failure liveness hardening (rows 4/5)
The reorder above fixes the case of a checkpoint with invalid attestations + a bad blob. A checkpoint with valid attestations (a ≥2/3 committee bribe) but a malformed or withheld blob still threw during blob decode — passing attestation validation, then aborting the whole
syncFromL1iteration. Becausethis.l1Timestampis only advanced at the end ofsyncFromL1, the L1 sync clock froze, so every honest proposer'scheckSyncfailed and nobody proposed — meaning the epoch-prune recovery never fired. A single bribed-committee + gossip-withholding attack could permanently halt the chain.Fix (
l1_synchronizer.ts,handleCheckpoints): gate the blob fetch/decode failure onrollup.canPruneAtTime.Only
NoBlobBodiesFoundError/BlobDeserializationErrorare caught; other errors propagate as before.Tests
archiver-sync.test.ts: malformed blob + valid attestations while the epoch can still be proven → still throws; once prunable → does not throw, the checkpoint is skipped, the pending chain rolls back to the proven tip, and the sync clock advances; a matching local proposed checkpoint is still promoted (blob fetch skipped) even when its on-chain blob is malformed.e2e_epochs/epochs_invalidate_block(invalid attestations): an invalid-attestations checkpoint is detected from calldata before any blob fetch and a fresh node syncs past it with its blob withheld from the shared store.e2e_epochs/epochs_blob_unavailable_prune(valid attestations, rows 4/5): a gossip-disabled observer whose copy of a valid-attestations checkpoint's blob is withheld keeps its L1 sync clock frozen while the epoch is still provable, then unfreezes and advances past the checkpoint once its proof window expires (it becomes prunable) — instead of looping on the blob fetch forever.