test(cardano): epoch-boundary resume reproductions + import-path evaluation (#1018) by scarmuega · Pull Request #1019 · txpipe/dolos

scarmuega · 2026-06-18T12:52:02Z

What

Adds tests/boundary_resume.rs, a self-contained, deterministic harness that crosses real epoch boundaries with actual PoolState/AccountState/EpochState, interrupts the boundary in several ways, and compares the resumed ledger against an uninterrupted run (epoch, per-entity snapshot epochs, and pots) — the reproduction called for in #1018's acceptance criteria, plus a detailed evaluation of the import crash/resume path.

It uses a shrunken-but-coherent genesis (epoch_length=100, byron k=1, f=0.05) so the randomness/stability windows land inside the epoch and RUPD/EWRAP/ESTART fire after a few hundred synthetic blocks, driven through ToyDomain.

Tests

Test	Result	Meaning
`baseline_clean_run_crosses_boundaries_aligned`	✅	De-risk: a clean run crosses ≥2 boundaries with a registered pool, every entity aligned to `EpochState.number`.
`import_crash_mid_estart_resumes_to_identical_state`	✅	Crash mid-ESTART (commit shards 0..16, no finalize) → resume → byte-identical (incl. pots). ESTART shard resume is safe (shard-skip via `estart_progress` + atomic, guarded finalize).
`rollback_across_boundary_reapplies_to_identical_state`	🔴 `#[ignore]`	Bug 1 — reproduces #1018.
`import_crash_ewrap_finalize_window_resumes_to_identical_state`	🔴 `#[ignore]`	Bug 2 — reproduces #1018.

Both #[ignore]d tests assert the post-fix behavior (byte-identical resume), so CI stays green and they flip to passing once fixed. They run under --ignored.

Bug 1 — rollback across a boundary (sync path)

Cross the slot-100 boundary into epoch 1, rollback to slot 80 (epoch 0), re-apply to tip:

After the rollback, EpochState.number is still 1 and the pool snapshot is still @1 — the boundary transition was not reverted, because boundary transitions are not written to the WAL (only ROLL block deltas are; EWRAP/RUPD/ESTART use the default no-op commit_wal, and rollback in core/sync.rs only undoes WAL deltas).
Re-applying re-fires the boundary and double-advances every entity to epoch 4 vs the correct 2 — silently (no guard/error).

Same WAL/rollback asymmetry as the reported pool-snapshot lag; the manifestation here is a lead. Fix: make boundary transitions reversible on rollback (WAL-backed or re-derive).

Bug 2 — import crash in the EWRAP finalize window

Found while evaluating import crash/resume safety across the full RUPD → EWRAP → ESTART boundary.

The last EWRAP shard sets ewrap_progress.committed == total (EWrapProgress::apply).
EpochWrapUpV3 — which assembles the final EndStats and rotates rolling/pparams — commits separately and does not touch ewrap_progress.
On restart, EwrapWorkUnit::initialize reads committed == total → is_complete() → skips both shards and finalize.

So a crash in [last EWRAP shard committed → before EpochWrapUpV3 commits] permanently skips EpochWrapUpV3 on resume. ESTART's define_new_pots then reads a non-finalized end (default epoch_incentives) and un-rotated rolling, so the epoch's monetary expansion is never applied:

Clean: reserves 14,982,005,400,001,109, treasury 17,994,600,000,277
Crashed: reserves 14,991,000,000,000,000, treasury 9,000,000,000,000

→ reserves ~8.99e12 too high, treasury ~8.99e12 too low — exactly the skipped epoch-1 incentive split, silently (no guard/error). The off-by-one: the finalize-skip marker is advanced by the last shard, one commit before finalize runs. Fix: gate EWRAP's is_complete on a marker set by EpochWrapUpV3 itself.

Import-path evaluation summary

RUPD re-run is safe: PendingRewardState writes are overwrite-by-key, the recompute is deterministic (RUPD fires before its trigger block rolls), finalize overwrites.
ESTART crash + resume is safe: shard-skip + atomic, guarded (EpochBoundaryIncomplete) finalize; no is_complete skip.
EWRAP has the finalize-window gap above (Bug 2).
Graceful forced stop is safe (it lands only after ESTART finalize, at a fully-closed boundary). An ungraceful kill mid-boundary hits the EWRAP window.

Notes

Runs in --release (matching the epoch_pots convention). The synthetic block generator funds tx fees and registration deposits from un-pot-backed custom_utxos, which trips the unrelated pots.is_consistent debug_assert at a boundary — a monetary-accounting invariant orthogonal to the snapshot-rotation logic under test. Debug mode skips with a message. Follow-up to make the generator boundary-safe: Synthetic block generator conjures value (unfunded fees/deposits) — breaks pots invariant across epoch boundaries #1020.

cargo test --release --test boundary_resume                 # baseline + ESTART-resume (green); Bug 1 & 2 ignored
cargo test --release --test boundary_resume -- --ignored    # runs the #1018 reproductions (fail, by design)

🤖 Generated with Claude Code

…1018) Adds `tests/boundary_resume.rs`, a self-contained, deterministic harness that crosses real epoch boundaries with actual PoolState/AccountState/EpochState and interrupts the boundary two ways, classifying lead vs lag. Uses a shrunken-but-coherent genesis (epoch_length=100, byron k=1, f=0.05) so the randomness/stability windows land inside the epoch and RUPD/EWRAP/ESTART fire after a few hundred synthetic blocks, driven through ToyDomain. Tests: - baseline_clean_run_crosses_boundaries_aligned (passes): a clean run crosses >=2 boundaries with every entity aligned to EpochState.number. - import_crash_mid_estart_resumes_to_identical_state (passes): crash mid-ESTART (commit shards 0..16, no finalize) then resume yields byte-identical state. Refutes the import resume path as the lag source. - rollback_across_boundary_reapplies_to_identical_state (#[ignore], reproduces #1018): rollback across a boundary does NOT revert the boundary transition (EpochState.number stays advanced) because boundary transitions are not in the WAL; re-applying re-fires the boundary and double-advances every entity, silently. Un-ignore once boundary transitions are reversible on rollback. Runs in --release: the synthetic generator funds tx fees and registration deposits from un-pot-backed custom_utxos, which trips the unrelated pots.is_consistent debug_assert at a boundary (orthogonal to the snapshot- rotation logic under test). Debug mode skips with a message. Follow-up tracked separately. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-06-18T12:52:15Z

📝 Walkthrough

Walkthrough

A new Cargo integration test target boundary_resume is registered in Cargo.toml pointing to tests/boundary_resume.rs. That file implements a deterministic test harness for issue #1018, providing synthetic genesis and block world setup, epoch fingerprinting and drift classification utilities, import lifecycle and crash/resume simulation helpers, and three scenario tests: baseline clean-run idempotency, crash/resume idempotency verification (Scenario A), rollback defect reproduction (Scenario B, ignored), and EWRAP finalize window edge case (Scenario C, ignored).

Changes

Epoch-boundary resume/rollback integration test

Layer / File(s)	Summary
Test registration and synthetic world setup `Cargo.toml`, `tests/boundary_resume.rs`	Registers the `boundary_resume` Cargo test target and defines the release-only gate, tiny coherent genesis builder with reduced epoch length, and tuned protocol parameters to force epoch boundaries within test window.
Fingerprinting and drift classification utilities `tests/boundary_resume.rs`	Introduces per-ledger and per-entity epoch fingerprint computation and drift classifier labeling pool snapshot and account stake epochs as lag/lead/genesis relative to ledger epoch, with human-readable reporting.
Import lifecycle helpers and crash/resume simulation `tests/boundary_resume.rs`	Implements full work-unit lifecycle execution without WAL (initialize → shard load/compute/commit/archive/indexes → finalize), driver functions to drain queued work and feed synthetic blocks, and ESTART crash/resume helper that commits only shards `0..k`, then re-initializes from persisted `estart_progress` to complete remaining shards and finalize.
Baseline clean-run test `tests/boundary_resume.rs`	Baseline test runs clean synthetic import across multiple epoch boundaries, asserts ledger advancement and pool existence, and verifies zero entity drift.
Scenario A: crash/resume idempotency test `tests/boundary_resume.rs`	Injects exactly one simulated crash/resume at first ESTART work unit and asserts final fingerprint matches clean baseline run, with drift diagnostics on failure.
Scenario B: rollback defect reproduction (ignored) `tests/boundary_resume.rs`	Reproduces rollback across epoch boundary via sync `roll_forward`, observes ledger epoch state not reverting as expected, and asserts resulting fingerprint diverges from baseline with entity lead/lag diagnostics.
Scenario C: EWRAP finalize window crash recovery (ignored) `tests/boundary_resume.rs`	Simulates import crash in EWRAP finalize window at selectable work unit ordinal, resumes on same unit (short-circuiting finalize when `committed == total`), and asserts resumed result matches uninterrupted clean run.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~28 minutes

Possibly related issues

Epoch boundary resume is not idempotent — root cause of persisted pool-snapshot lag #1018: This PR directly implements the deterministic crash-mid-boundary test harness and fingerprint-verification tests described in issue #1018, including baseline clean-run and crash/resume scenario acceptance criteria.
Synthetic block generator conjures value (unfunded fees/deposits) — breaks pots invariant across epoch boundaries #1020: The test harness includes a release-only gate to work around the synthetic-UTxO invariant problem documented in issue #1020, creating a tight implementation-level coupling between the two issues.

Possibly related PRs

txpipe/dolos#1017: The crash/resume test exercises estart_progress and commit_finalize completion logic that directly overlaps with the fail-loud commit_finalize guards added in PR #1017.
txpipe/dolos#991: The new integration test targets epoch-boundary crash/resume idempotency and EWRAP finalize behavior, directly exercising the EWRAP short-circuit/replay logic and EpochWrapUpV3 boundary finalize behavior from PR #991.

Poem

🐇 Hop through each epoch's boundary gate,
I crash mid-shard but resume just fine!
My fingerprints match—both early and late,
The rollback bug lurks (ignored, by design).
No drift, no lag, the ledger stays aligned,
This bunny tests boundaries of every kind! 🌸

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main changes: adding a deterministic test harness for issue `#1018` that reproduces epoch-boundary resume scenarios and evaluates the import path.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch test/boundary-resume-repro-1018

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/boundary_resume.rs`:
- Around line 44-53: The require_release() function masks test failures as
passes in debug mode, hiding regression coverage. Remove the require_release()
function entirely and replace its runtime guard pattern with explicit
compile-time semantics. For all test functions that use require_release()
(including those at lines 236-240, 311-314, 374-376), add
#[cfg_attr(debug_assertions, ignore = "tests require --release")] attribute
above each test function declaration and remove the if !require_release() {
return; } guard statement from the test body.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 005bb5a1-ec80-4ff9-87b8-f19e0965c70d

📥 Commits

Reviewing files that changed from the base of the PR and between 9502e29 and 027e49d.

📒 Files selected for processing (2)

Cargo.toml
tests/boundary_resume.rs

coderabbitai · 2026-06-18T12:58:06Z

+fn require_release() -> bool {
+    if cfg!(debug_assertions) {
+        eprintln!(
+            "SKIP: boundary_resume tests require --release \
+             (synthetic pots debug_assert is unrelated to the snapshot-lag bug under test)"
+        );
+        return false;
+    }
+    true
+}


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Release-only guard currently turns unexecuted tests into passing tests.

Returning early in debug mode marks these tests as pass, so default cargo test can silently miss this regression coverage. Prefer explicit ignore semantics (e.g., #[cfg_attr(debug_assertions, ignore = "...")]) and remove runtime early-return gating.

🔧 Proposed fix

-fn require_release() -> bool { - if cfg!(debug_assertions) { - eprintln!( - "SKIP: boundary_resume tests require --release \ - (synthetic pots debug_assert is unrelated to the snapshot-lag bug under test)" - ); - return false; - } - true -} - #[test] +#[cfg_attr( + debug_assertions, + ignore = "requires --release (synthetic pots debug_assert is unrelated to the snapshot-lag bug under test)" +)] fn baseline_clean_run_crosses_boundaries_aligned() { - if !require_release() { - return; - } let (blocks, domain) = synthetic_world(260); @@ #[test] +#[cfg_attr( + debug_assertions, + ignore = "requires --release (synthetic pots debug_assert is unrelated to the snapshot-lag bug under test)" +)] fn import_crash_mid_estart_resumes_to_identical_state() { - if !require_release() { - return; - } @@ #[test] #[ignore = "reproduces `#1018`: rollback across an epoch boundary is not idempotent (un-ignore once fixed)"] +#[cfg_attr( + debug_assertions, + ignore = "requires --release (synthetic pots debug_assert is unrelated to the snapshot-lag bug under test)" +)] fn rollback_across_boundary_reapplies_to_identical_state() { - if !require_release() { - return; - }

Also applies to: 236-240, 311-314, 374-376

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/boundary_resume.rs` around lines 44 - 53, The require_release() function masks test failures as passes in debug mode, hiding regression coverage. Remove the require_release() function entirely and replace its runtime guard pattern with explicit compile-time semantics. For all test functions that use require_release() (including those at lines 236-240, 311-314, 374-376), add #[cfg_attr(debug_assertions, ignore = "tests require --release")] attribute above each test function declaration and remove the if !require_release() { return; } guard statement from the test body.

Extends the boundary-resume harness while evaluating import crash/resume safety across the full RUPD -> EWRAP -> ESTART boundary. Findings: - RUPD re-run is idempotent (PendingRewardState writes are overwrite-by-key, recompute is deterministic, finalize overwrites). - ESTART shard crash + resume is safe (shard-skip via estart_progress + atomic, guarded finalize) — Scenario A still passes, now also asserting pots. - EWRAP has a finalize-window gap: the last shard sets ewrap_progress.committed == total (EWrapProgress::apply), but EpochWrapUpV3 (which assembles the final EndStats and rotates rolling/pparams) commits separately and does NOT touch ewrap_progress. On restart, EwrapWorkUnit::initialize reads committed == total -> is_complete() -> skips BOTH shards and finalize. A crash in [last EWRAP shard committed, before EpochWrapUpV3 commits] therefore permanently skips EpochWrapUpV3 on resume, so ESTART reset consumes a non-finalized `end` (default epoch_incentives) and un-rotated rolling stats: reserves stay too high, treasury too low. Adds: - pots (reserves, treasury, utxos, rewards, fees) to the state fingerprint so monetary-accounting corruption is caught, not just per-entity epoch drift. - import_crash_ewrap_finalize_window_resumes_to_identical_state (#[ignore], reproduces the gap): crashes the 2nd EWRAP after all shards, before finalize, then resumes; pots diverge from a clean run by exactly the skipped epoch-1 incentives (~8.99e12 reserves/treasury). Un-ignore once the EWRAP finalize marker is fixed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

scarmuega mentioned this pull request Jun 18, 2026

Synthetic block generator conjures value (unfunded fees/deposits) — breaks pots invariant across epoch boundaries #1020

Open

3 tasks

coderabbitai Bot reviewed Jun 18, 2026

View reviewed changes

scarmuega changed the title ~~test(cardano): deterministic epoch-boundary resume reproduction (#1018)~~ test(cardano): epoch-boundary resume reproductions + import-path evaluation (#1018) Jun 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(cardano): epoch-boundary resume reproductions + import-path evaluation (#1018)#1019

test(cardano): epoch-boundary resume reproductions + import-path evaluation (#1018)#1019
scarmuega wants to merge 2 commits into
mainfrom
test/boundary-resume-repro-1018

scarmuega commented Jun 18, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 18, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

scarmuega commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Tests

Bug 1 — rollback across a boundary (sync path)

Bug 2 — import crash in the EWRAP finalize window

Import-path evaluation summary

Notes

Uh oh!

coderabbitai Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

scarmuega commented Jun 18, 2026 •

edited

Loading

coderabbitai Bot commented Jun 18, 2026 •

edited

Loading