Skip to content

refactor(e2e): consolidate the multi-node test category#24201

Open
spalladino wants to merge 16 commits into
merge-train/spartan-v5from
spl/e2e-consolidation
Open

refactor(e2e): consolidate the multi-node test category#24201
spalladino wants to merge 16 commits into
merge-train/spartan-v5from
spl/e2e-consolidation

Conversation

@spalladino

@spalladino spalladino commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Motivation

The e2e suite under end-to-end/src/ accumulated many ad-hoc base classes with overlapping
responsibilities. The survey (A-1175) settled on consolidating it into a few setup/lifecycle categories —
automine, single-node, multi-node, p2p (+ out-of-scope infra) — each backed by one base class that
owns environment, with domain behavior composed on top. This PR delivers the multi-node category end
to end.

Approach

Built incrementally — one commit per step (see the map below). In order:

  • Category + base hierarchy. Moved the epochs cluster into end-to-end/src/multi-node/ and split its
    base into SingleNodeTestContext (environment, prover lifecycle, epoch/proof/reorg waiters) ←
    MultiNodeTestContext (validator set + committee/gossip helpers). The 14 single-node-topology tests ride
    the parent under multi-node/single-node/ — a deferred pure git mv will later promote them to a
    top-level single-node/ category.
  • Subfolders by setup shape: single-node/, consensus/, prune/, ha/, slashing/.
  • Shared presets + helpers so each test overrides as little as possible: timing profiles
    (FAST_REORG_TIMING / MV_REORG_TIMING / MV_CONSENSUS_TIMING / MBPS_TIMING over a shared
    REORG_TIMING_BASE), buildMockGossipValidators, MOCK_GOSSIP_MULTI_VALIDATOR_OPTS, setupHaPairs,
    slashing-penalty factories; plus intent-revealing wait-helpers (waitForNodeCheckpoint,
    waitForBlockNumber, …) that replace raw retryUntil / event listeners / polling in test bodies.
  • Slashing → mock gossip. Extracted ValidatorRegistrationHarness and converted the
    duplicate-proposal/attestation slashing tests onto the in-memory mock bus (fixing the previously-inert
    mockGossipSubNetwork flag for them).
  • Dedup. Merged near-identical tests into *.parallel files (proving trio, block-building pair,
    slashing pair) so CI still splits per-it; dissolved the 3 standalone MBPS files, folding their 8 unique
    its into consensus/+prune/ and dropping 1 redundant.
  • Env-value consolidation. Converged incidental timing values onto the shared profiles, run-and-fallback
    (each value change was run; the FAST/MV reorg-eth unification was reverted when it broke l1_reorgs).

No production code changes — test infrastructure + CI test-discovery only.

API changes

Test-infrastructure only (no public/product API). EpochsTestContextMultiNodeTestContext (now
extending the new SingleNodeTestContext); e2e_epochs/ removed, all its tests live under multi-node/**
with the epochs_ prefix dropped from filenames and describe titles; all importers updated.

Commit → scope map (review in order)

Commit Scope
ea168cd Pilot: create multi-node/; EpochsTestContextMultiNodeTestContext; first shared wait-helpers; 3 pilot tests + README
6b5384f Empty e2e_epochs/ — 24 tests moved in, epochs_ prefix dropped; CI discovery re-pointed
0e060a1 Extract ValidatorRegistrationHarness; 2 slashing tests → mock gossip
71453d1 waitForNodeCheckpoint (bidirectional); 5 reorg/proving tests off raw retryUntil
8831d55 Extract SingleNodeTestContext base + presets (inboxLag:2 default); e2e_slashing/slashing/
2a5d416 single-node/ subfolder; merge proving.parallel + node_reorg_recovery
4520d3d consensus/+prune/+ha/ subfolders; MV presets; merge block_building.parallel; setupHaPairs
15e277b Merge duplicate_*_slashequivocation_slash.parallel; slashing-penalty presets
62148ef Rewrite multi-node/README.md
c1de638 Dissolve 3 mbps* files → consensus/mbps.parallel + prune/pipeline_prune.parallel
1c91e3a Add timing profiles + low-risk env convergences
60547b8 Verify-risk env convergences (all held; reorg-eth unification fell back)
671a3f16a734abe CI fixes: dead test glob; duplicate/illegal-char test titles breaking docker container names
6fbb112 Scoped skip of 2 CI-deterministic l1_reorgs proof-reorg cases (see below)

Review notes

  • CI is green. The trailing fix(e2e) commits resolve CI test-discovery issues the reorg surfaced
    (an empty glob emitting a bogus job; per-it .parallel test titles producing duplicate / illegal docker
    container names) — root-cause fixes, not skips.
  • One coverage caveat (not introduced here): two l1_reorgs proof-reorg cases are skipped because they
    fail deterministically on CI from a pre-existing node bug (DELETE_FORK: Fork not found during
    single-node prune/rebuild) — config is byte-identical to base and the test was already flaky-flagged
    (test(e2e): unskip pipelining related e2e tests #23642). The other 5 cases in the file stay covered; follow-up tracked separately.
  • Scope: multi-node only. The automine / single-node (prod-seq) / p2p rollouts and the
    domain-harness conversions (A-1176 phases 3–6) are follow-up work, not in this PR.
  • The ci-no-fail-fast label was added to surface all failures at once; remove it before merge.

Part of A-1176 (consolidation roadmap) / A-1064.

spalladino and others added 15 commits June 19, 2026 13:29
…ared wait-helpers

First code commit of the e2e suite consolidation. Establishes the `multi-node`
category (N validators on the in-memory mock-gossip bus) by promoting the epochs
base class and extracting the highest-value shared wait-helpers.

Base class:
- Move end-to-end/src/e2e_epochs/epochs_test.ts ->
  end-to-end/src/multi-node/multi_node_test_context.ts and rename
  EpochsTestContext -> MultiNodeTestContext (EpochsTestOpts -> MultiNodeTestOpts).
  All 24 non-migrated epochs tests updated to import from the new location/name.

Migrated pilot tests into end-to-end/src/multi-node/:
- epochs_simple_block_building.test.ts (control)
- epochs_missed_l1_publish.test.ts (helper-extraction target)
- epochs_mbps.parallel.test.ts (stress + prover)

Helpers introduced/extended:
- fixtures/wait_helpers.ts: waitForBlockNumber / waitForProvenBlock, waitForTxs.
- ChainMonitor.waitUntilCheckpointProven (event-driven, mirrors waitUntilCheckpoint).
- MultiNodeTestContext.waitForAllNodes (+ waitForAllNodesToReachProvenCheckpoint /
  waitForAllNodesToReachBlockAtSlot): multi-node fan-out convergence.
- MultiNodeTestContext.findSlotsWithProposers: extracts the EpochNotStable
  slot-search/warp loop.
- MultiNodeTestContext.waitForSequencerEvent: one-shot sequencer-event waiter.

The migration changes only setup wiring and async-waiting style; no test assertions
were changed. Adds end-to-end/src/multi-node/README.md.
…s_ prefix

Relocates all 24 e2e_epochs/*.test.ts files into the multi-node/ category folder
established by the pilot commit, and drops the redundant epochs_ prefix from both
the filenames and the describe titles (now multi-node/<name>). Also retroactively
renames the 3 pilot files already in multi-node/ for consistency. e2e_epochs/ is
now empty and removed.

Rewires CI test discovery in end-to-end/bootstrap.sh: the epochs globs are replaced
with src/multi-node/!(long_proving_time).test.ts (plus the long_proving_time special
case), and the NAME derivation now strips the src/ prefix so it produces multi-node/
names for the new folder while leaving e2e_<dir>/<file> names unchanged. Re-points the
four .test_patterns.yml entries that referenced e2e_epochs paths / epochs_ test names to
the new multi-node/ paths.

No test assertions change — only file location, titles, and import/CI wiring.
Extracts a ValidatorRegistrationHarness (multi-node/validator_registration_harness.ts)
that composes MultiNodeTestContext: it registers a validator set on the in-memory
mock-gossip bus via initialValidators (genesis staking + validator-set-lag advance,
the mock-gossip replacement for P2PNetworkTest's MultiAdder/GSE flow), exposes the
per-validator keys/addresses, spawns validator nodes on the mock bus, and resolves the
slasher/slashing-proposer L1 contracts.

Converts the duplicate_attestation_slash and duplicate_proposal_slash tests from
P2PNetworkTest (which ran REAL libp2p despite mockGossipSubNetwork:true — the flag was
inert because setup_p2p_test.createNode never passed a p2pServiceFactory) to genuine
mock gossip, and relocates them into multi-node/e2e_slashing/. The generic offense/
proposer helpers in e2e_p2p/shared.ts are reused unchanged via cross-folder import.
No test assertions change — only the gossip transport, registration wiring, and location.

Wires src/multi-node/e2e_slashing/*.test.ts into CI discovery (end-to-end/bootstrap.sh)
and re-points the duplicate_proposal_slash flaky entry in .test_patterns.yml to the new path.
…ests to intent-revealing helpers

Add waitForNodeCheckpoint / waitForNodeProvenCheckpoint to fixtures/wait_helpers.ts,
a single-node checkpoint-number convergence wait supporting eq/gte/gt/lte/lt comparisons
(reorg/prune tests wait for the number to drop, not just rise). Convert l1_reorgs,
optimistic_proving, manual_rollback, partial_proof, and proof_public_cross_chain to the
new and existing helpers (waitForBlockNumber/waitForProvenBlock/waitUntilProvenCheckpointNumber),
removing hand-rolled retryUntil checkpoint/block polls and local getCheckpointNumber helpers.
… setup presets

Carve the prod-sequencer single-node lifecycle out of MultiNodeTestContext into a
new SingleNodeTestContext parent (environment, node spawning, prover lifecycle, and
the epoch/checkpoint/proof-window/reorg waiters), leaving MultiNodeTestContext with
only the validator-node spawning and committee-convergence helpers. Set inboxLag: 2
as a base default (the intended value when pipelining). Add shared presets/helpers:
buildMockGossipValidators, MOCK_GOSSIP_MULTI_VALIDATOR_OPTS, FAST_REORG_TIMING,
defaultSlashingPenalties/withOnlyOffense. Rename multi-node/e2e_slashing -> slashing
(git mv + describe titles + .test_patterns.yml + bootstrap.sh discovery). All 29
tests still import MultiNodeTestContext and pass unchanged.
…SingleNodeTestContext

Relocate the 14 single-node-topology tests into multi-node/single-node/ and switch
them onto SingleNodeTestContext. Apply the FAST_REORG_TIMING preset to l1_reorgs and
the six optimistic_proving reorg blocks; drop now-redundant explicit inboxLag: 2.
Merge the proving trio (multiple + empty_blocks_proof + long_proving_time) into
single-node/proving.parallel.test.ts (per-it setup, keeping multiple's world-state-
prune assertion as its own it). Co-locate partial_proof (its unique startProof path
preserved) with partial_proof_multi_root, and manual_rollback with sync_after_reorg
(node_reorg_recovery.test.ts, flattening the redundant nested describe). Update
.test_patterns.yml + bootstrap.sh CI discovery for the new subfolder (proving.parallel
keeps the 15m timeout the long-proving scenario needs).
…bfolders

Create consensus/, prune/, ha/ under multi-node/ and relocate the multi-validator tests,
adopting buildMockGossipValidators + MOCK_GOSSIP_MULTI_VALIDATOR_OPTS to drop the
copy-pasted validator-builder block and the shared mock-gossip setup cluster. Merge
simple_block_building + high_tps_block_building into consensus/block_building.parallel
(per-it setup). Keep missed_l1_publish + orphan_block_prune as two prune/ files,
reconciling orphan_block_prune's hand-rolled slot loop onto findSlotsWithProposers.
Extract setupHaPairs for the two ha/ tests' shared pair wiring (ha_sync keeps its
initial sequencer). Move equivocation + invalidate_block.parallel into slashing/.
Add CI discovery globs for the new subfolders.
Merge duplicate_attestation_slash + duplicate_proposal_slash into
slashing/equivocation_slash.parallel.test.ts (shared harness opts, per-it setup),
preserving the proposal test's all-node offense poll + proposer-for-slot check verbatim.
Apply withOnlyOffense('slashDuplicateProposalPenalty') to equivocation (replacing its
~9-line manual penalty zero-out) and MOCK_GOSSIP_MULTI_VALIDATOR_OPTS +
buildMockGossipValidators to equivocation and invalidate_block (penalties left explicit
where the test relies on config defaults). Update .test_patterns.yml flake entry to the
merged file. The remaining ~10 e2e_p2p slashing/sentinel conversions are deferred: they
run real libp2p (P2PNetworkTest/createNodes) and the heaviest ones stub libp2p internals
the mock bus cannot reproduce — genuine work, not a folder move.
…lder map

Document the SingleNodeTestContext -> MultiNodeTestContext split, the shared presets/helpers
(FAST_REORG_TIMING, buildMockGossipValidators, MOCK_GOSSIP_MULTI_VALIDATOR_OPTS,
defaultSlashingPenalties/withOnlyOffense, setupHaPairs), the single-node/consensus/prune/ha/
slashing subfolder map, the deferred top-level single-node category move, and that the MBPS
dissolution is pending review. CI discovery for all subfolders was added in the prior commits;
this confirms the mbps files stay discovered.
Dissolves the three MBPS test files in the multi-node category, relocating
every unique assertion verbatim:

- consensus/mbps.parallel.test.ts: the proposed-anchor monotonicity, L2->L1,
  L1->L2, non-validator re-exec/cold-sync, and deploy+call sub-slot ordering
  its from mbps.parallel; the proposer-pipelining offset + blob-promotion it
  from mbps.pipeline.parallel; and both redistribution its (budget
  redistribution kept verbatim, multiplier asymmetry) from mbps_redistribution.
- prune/pipeline_prune.parallel.test.ts: the prune-on-skip-publish-under-
  pipelining it from mbps.pipeline.parallel.

mbps.parallel it 1 (checkpointed-anchored MBPS block-count + proven) is dropped
as redundant: the block-count and proven-checkpoint properties are already
covered by consensus/block_building/high_tps.

assertProposerPipelining is lifted onto SingleNodeTestContext (with a shared
BlockProposedEvent type) so both relocated pipelining its reuse it. test_patterns
flake entries for the prune and redistribution its move to their new paths.

This is a behavior-preserving relocation, revertible as a unit.
…isk convergences

Introduce three named timing-only profiles in single_node_test_context.ts (re-exported
from multi_node_test_context.ts) to collapse the byte-identical timing clusters copied
across the multi-node category, plus a shared REORG_TIMING_BASE factored out of the reorg
profiles:

- REORG_TIMING_BASE = { aztecSlotDuration: 36, blockDurationMs: 8000, aztecEpochDuration: 4 }
- FAST_REORG_TIMING = base + { ethereumSlotDuration: 4, anvilSlotsInAnEpoch: 32 } (single-node reorg)
- MV_REORG_TIMING = base + { ethereumSlotDuration: 6, attestationPropagationTime: 0.5 } (MV prune/HA/equivocation)
- MV_CONSENSUS_TIMING = { ethereumSlotDuration: 12, aztecSlotDurationInL1Slots: 3, blockDurationMs: 6000 }
- MBPS_TIMING = { ethereumSlotDuration: 12, aztecSlotDuration: 72, blockDurationMs: 5500, aztecEpochDuration: 4, perBlockAllocationMultiplier: 8, aztecTargetCommitteeSize: 3 }

Profiles are spread BEFORE per-test overrides everywhere so per-test knobs win.
MV_CONSENSUS_TIMING keeps aztecSlotDurationInL1Slots:3 (not an explicit 36) to preserve
eth-coupling. MBPS_TIMING's JSDoc carries the A-914 rationale.

Reorg-profile unification attempt (FAST_REORG eth 4 -> 6 to share one L1 cadence with
MV_REORG): FELL BACK. Under eth=6 the single-node l1_reorgs suite times out on its
proof-removal and proof-restore reorg assertions (TimeoutError: checkpoint proven eq 0 /
checkpointed lte 1) because the longer L1 slot shifts the proof-submission window; 5/7
cases passed but those two failed. FAST_REORG stays at ethereumSlotDuration:4 (the shared
REORG_TIMING_BASE is kept either way). MV_REORG stays at eth=6.

Low-risk naming convergences (timing already equals the profile; byte-identical relocations):
- consensus/block_building simple + high_tps -> MV_CONSENSUS_TIMING (high_tps is 36s per Codex
  correction; keeps fakeProcessingDelayPerTxMs + attestationPropagationTime:1)
- consensus/first_slot -> MV_CONSENSUS_TIMING (keeps epoch 32, committee, propagation, polling)
- consensus/proof_at_boundary -> MV_CONSENSUS_TIMING
- prune/orphan_block_prune + prune/missed_l1_publish -> MV_REORG_TIMING
- consensus/mbps setupMbps + setupPipeline -> MBPS_TIMING (redistribution KEPT on its own 4s/36s/6s timing)
- prune/pipeline_prune -> MBPS_TIMING

Verified-risk convergence in this commit:
- slashing/equivocation -> MV_REORG_TIMING (timing byte-identical; run because of slashing
  offense-detection timing). HELD: test passed (DUPLICATE_PROPOSAL offense detected, chain healed).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…iles

Verify-risk value convergences for the multi-node category. Every change alters a value
(not pure naming); each affected test was run individually and sequentially. All HELD —
no fallbacks.

- ha/ha_sync: ethereumSlotDuration 4 -> 6 (adopt MV_REORG_TIMING). Keeps numberOfAccounts:1,
  min/maxTxs, pxeOpts.syncChainTip:'proposed', and the absence of skipInitialSequencer.
  HELD: 1/1 passed (23.7s).
- ha/ha_checkpoint_handoff: adopt MV_REORG_TIMING + aztecEpochDuration 8 -> 4. The
  same-HA-pair consecutive-slot finder did not starve at epoch=4 (found slots 17/18 without
  excess EpochNotStable warping). HELD: 1/1 passed (95s).
- single-node/missed_l1_slot: converge the 48s slot to 36 via ethereumSlotDuration:6 +
  aztecSlotDurationInL1Slots:6 (6*6=36), preserving the load-bearing 6-L1-slots-per-L2-slot
  invariant. Keeps perBlockAllocationMultiplier:8, useHardcodedAccount, mockGossipSubNetwork,
  blockDurationMs:8000. The bug-fix INITIALIZING_CHECKPOINT assertion and the
  account-fitting math hold at the 36s slot. HELD: 1/1 passed (77s).
- slashing/invalidate_block: aztecSlotDuration 32 -> 36 (eth8/block6000 unchanged); keeps the
  6-validator topology, anvil ports, slashing-round config, and committee-invalidation delay.
  The multi-block invalidation timetable holds. HELD: 9/9 passed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…I job

No test files live directly under src/multi-node/ (all are in subfolders),
so the src/multi-node/*.test.ts glob matched nothing. nullglob is not set in
ci3/source_options, so the unmatched pattern leaked as a literal into the test
command list, producing a job named multi-node/* that ran docker with
--name 'multi-node_*' and failed with an invalid-container-name error (exit 125).
The per-subfolder globs already cover every real multi-node test.
Both describes in block_building.parallel.test.ts had an
it('builds blocks without any errors'). CI's .parallel discovery
(bootstrap.sh extract_test_names) keys one isolated job per it title, so the
duplicate produced two byte-identical jobs sharing the same docker container
name; the second collided ("container is marked for removal and cannot be
started") and the test never ran. Rename to 'builds simple/high-tps blocks
without any errors' so each job has a unique NAME and its testNamePattern
selects exactly one describe.
The per-it CI job NAME for a .parallel.test.ts is derived from the test
title, and becomes the docker container name via docker_isolate. The old
sanitizer only replaced spaces, so a title with parentheses (e.g. the mbps
multiplier-asymmetry it: "... (no fair-share re-execution)") produced an
invalid container name and 'docker run --name' rejected it with exit 125
before the test could run, failing the whole fast CI run via fail-fast.

Collapse every character outside docker's allowed set [a-zA-Z0-9_.-] to an
underscore in both the simple and compat parallel-test command builders.
@spalladino spalladino added the ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure label Jun 20, 2026
The proof-removal and proof-restore cases in l1_reorgs.parallel assert a transient
post-prune node tip (proven == 0 / checkpointed <= 1) in the single-node self-building
topology. After the proof is reorged out and the node prunes, its own sequencer rebuilds
the pending chain immediately and a DELETE_FORK "Fork not found" world-state inconsistency
disturbs the tip read, so under CI's L1 cadence the asserted value is never observed within
the wait window. Both the original run and the flake retry time out, which makes the flaky
mechanism hard-fail (it only tolerates fail-then-pass). The cases pass locally at the faster
L1=4 cadence.

This is pre-existing timing fragility, not introduced by the multi-node move: the file was
previously skip:true (skip lifted in #23642) and FAST_REORG_TIMING resolves byte-identically
to the prior eth4/slot36/block8000/epoch4/anvil32 config. Scoped per-it so the other five
cases in the file remain covered.
@spalladino spalladino changed the title feat(e2e): pilot multi-node category (MultiNodeTestContext + helpers) refactor(e2e): consolidate the multi-node test category Jun 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant