Skip to content

test(cardano): epoch-boundary resume reproductions + import-path evaluation (#1018)#1019

Open
scarmuega wants to merge 2 commits into
mainfrom
test/boundary-resume-repro-1018
Open

test(cardano): epoch-boundary resume reproductions + import-path evaluation (#1018)#1019
scarmuega wants to merge 2 commits into
mainfrom
test/boundary-resume-repro-1018

Conversation

@scarmuega

@scarmuega scarmuega commented Jun 18, 2026

Copy link
Copy Markdown
Member

What

Adds tests/boundary_resume.rs, a self-contained, deterministic harness that crosses real epoch boundaries with actual PoolState/AccountState/EpochState, interrupts the boundary in several ways, and compares the resumed ledger against an uninterrupted run (epoch, per-entity snapshot epochs, and pots) — the reproduction called for in #1018's acceptance criteria, plus a detailed evaluation of the import crash/resume path.

It uses a shrunken-but-coherent genesis (epoch_length=100, byron k=1, f=0.05) so the randomness/stability windows land inside the epoch and RUPD/EWRAP/ESTART fire after a few hundred synthetic blocks, driven through ToyDomain.

Tests

Test Result Meaning
baseline_clean_run_crosses_boundaries_aligned De-risk: a clean run crosses ≥2 boundaries with a registered pool, every entity aligned to EpochState.number.
import_crash_mid_estart_resumes_to_identical_state Crash mid-ESTART (commit shards 0..16, no finalize) → resume → byte-identical (incl. pots). ESTART shard resume is safe (shard-skip via estart_progress + atomic, guarded finalize).
rollback_across_boundary_reapplies_to_identical_state 🔴 #[ignore] Bug 1 — reproduces #1018.
import_crash_ewrap_finalize_window_resumes_to_identical_state 🔴 #[ignore] Bug 2 — reproduces #1018.

Both #[ignore]d tests assert the post-fix behavior (byte-identical resume), so CI stays green and they flip to passing once fixed. They run under --ignored.

Bug 1 — rollback across a boundary (sync path)

Cross the slot-100 boundary into epoch 1, rollback to slot 80 (epoch 0), re-apply to tip:

  • After the rollback, EpochState.number is still 1 and the pool snapshot is still @1 — the boundary transition was not reverted, because boundary transitions are not written to the WAL (only ROLL block deltas are; EWRAP/RUPD/ESTART use the default no-op commit_wal, and rollback in core/sync.rs only undoes WAL deltas).
  • Re-applying re-fires the boundary and double-advances every entity to epoch 4 vs the correct 2 — silently (no guard/error).

Same WAL/rollback asymmetry as the reported pool-snapshot lag; the manifestation here is a lead. Fix: make boundary transitions reversible on rollback (WAL-backed or re-derive).

Bug 2 — import crash in the EWRAP finalize window

Found while evaluating import crash/resume safety across the full RUPD → EWRAP → ESTART boundary.

  • The last EWRAP shard sets ewrap_progress.committed == total (EWrapProgress::apply).
  • EpochWrapUpV3 — which assembles the final EndStats and rotates rolling/pparams — commits separately and does not touch ewrap_progress.
  • On restart, EwrapWorkUnit::initialize reads committed == totalis_complete()skips both shards and finalize.

So a crash in [last EWRAP shard committed → before EpochWrapUpV3 commits] permanently skips EpochWrapUpV3 on resume. ESTART's define_new_pots then reads a non-finalized end (default epoch_incentives) and un-rotated rolling, so the epoch's monetary expansion is never applied:

  • Clean: reserves 14,982,005,400,001,109, treasury 17,994,600,000,277
  • Crashed: reserves 14,991,000,000,000,000, treasury 9,000,000,000,000

→ reserves ~8.99e12 too high, treasury ~8.99e12 too low — exactly the skipped epoch-1 incentive split, silently (no guard/error). The off-by-one: the finalize-skip marker is advanced by the last shard, one commit before finalize runs. Fix: gate EWRAP's is_complete on a marker set by EpochWrapUpV3 itself.

Import-path evaluation summary

  • RUPD re-run is safe: PendingRewardState writes are overwrite-by-key, the recompute is deterministic (RUPD fires before its trigger block rolls), finalize overwrites.
  • ESTART crash + resume is safe: shard-skip + atomic, guarded (EpochBoundaryIncomplete) finalize; no is_complete skip.
  • EWRAP has the finalize-window gap above (Bug 2).
  • Graceful forced stop is safe (it lands only after ESTART finalize, at a fully-closed boundary). An ungraceful kill mid-boundary hits the EWRAP window.

Notes

cargo test --release --test boundary_resume                 # baseline + ESTART-resume (green); Bug 1 & 2 ignored
cargo test --release --test boundary_resume -- --ignored    # runs the #1018 reproductions (fail, by design)

🤖 Generated with Claude Code

…1018)

Adds `tests/boundary_resume.rs`, a self-contained, deterministic harness that
crosses real epoch boundaries with actual PoolState/AccountState/EpochState and
interrupts the boundary two ways, classifying lead vs lag.

Uses a shrunken-but-coherent genesis (epoch_length=100, byron k=1, f=0.05) so
the randomness/stability windows land inside the epoch and RUPD/EWRAP/ESTART
fire after a few hundred synthetic blocks, driven through ToyDomain.

Tests:
- baseline_clean_run_crosses_boundaries_aligned (passes): a clean run crosses
  >=2 boundaries with every entity aligned to EpochState.number.
- import_crash_mid_estart_resumes_to_identical_state (passes): crash mid-ESTART
  (commit shards 0..16, no finalize) then resume yields byte-identical state.
  Refutes the import resume path as the lag source.
- rollback_across_boundary_reapplies_to_identical_state (#[ignore], reproduces
  #1018): rollback across a boundary does NOT revert the boundary transition
  (EpochState.number stays advanced) because boundary transitions are not in
  the WAL; re-applying re-fires the boundary and double-advances every entity,
  silently. Un-ignore once boundary transitions are reversible on rollback.

Runs in --release: the synthetic generator funds tx fees and registration
deposits from un-pot-backed custom_utxos, which trips the unrelated
pots.is_consistent debug_assert at a boundary (orthogonal to the snapshot-
rotation logic under test). Debug mode skips with a message. Follow-up tracked
separately.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 18, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

A new Cargo integration test target boundary_resume is registered in Cargo.toml pointing to tests/boundary_resume.rs. That file implements a deterministic test harness for issue #1018, providing synthetic genesis and block world setup, epoch fingerprinting and drift classification utilities, import lifecycle and crash/resume simulation helpers, and three scenario tests: baseline clean-run idempotency, crash/resume idempotency verification (Scenario A), rollback defect reproduction (Scenario B, ignored), and EWRAP finalize window edge case (Scenario C, ignored).

Changes

Epoch-boundary resume/rollback integration test

Layer / File(s) Summary
Test registration and synthetic world setup
Cargo.toml, tests/boundary_resume.rs
Registers the boundary_resume Cargo test target and defines the release-only gate, tiny coherent genesis builder with reduced epoch length, and tuned protocol parameters to force epoch boundaries within test window.
Fingerprinting and drift classification utilities
tests/boundary_resume.rs
Introduces per-ledger and per-entity epoch fingerprint computation and drift classifier labeling pool snapshot and account stake epochs as lag/lead/genesis relative to ledger epoch, with human-readable reporting.
Import lifecycle helpers and crash/resume simulation
tests/boundary_resume.rs
Implements full work-unit lifecycle execution without WAL (initialize → shard load/compute/commit/archive/indexes → finalize), driver functions to drain queued work and feed synthetic blocks, and ESTART crash/resume helper that commits only shards 0..k, then re-initializes from persisted estart_progress to complete remaining shards and finalize.
Baseline clean-run test
tests/boundary_resume.rs
Baseline test runs clean synthetic import across multiple epoch boundaries, asserts ledger advancement and pool existence, and verifies zero entity drift.
Scenario A: crash/resume idempotency test
tests/boundary_resume.rs
Injects exactly one simulated crash/resume at first ESTART work unit and asserts final fingerprint matches clean baseline run, with drift diagnostics on failure.
Scenario B: rollback defect reproduction (ignored)
tests/boundary_resume.rs
Reproduces rollback across epoch boundary via sync roll_forward, observes ledger epoch state not reverting as expected, and asserts resulting fingerprint diverges from baseline with entity lead/lag diagnostics.
Scenario C: EWRAP finalize window crash recovery (ignored)
tests/boundary_resume.rs
Simulates import crash in EWRAP finalize window at selectable work unit ordinal, resumes on same unit (short-circuiting finalize when committed == total), and asserts resumed result matches uninterrupted clean run.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~28 minutes

Possibly related issues

Possibly related PRs

  • txpipe/dolos#1017: The crash/resume test exercises estart_progress and commit_finalize completion logic that directly overlaps with the fail-loud commit_finalize guards added in PR #1017.
  • txpipe/dolos#991: The new integration test targets epoch-boundary crash/resume idempotency and EWRAP finalize behavior, directly exercising the EWRAP short-circuit/replay logic and EpochWrapUpV3 boundary finalize behavior from PR #991.

Poem

🐇 Hop through each epoch's boundary gate,
I crash mid-shard but resume just fine!
My fingerprints match—both early and late,
The rollback bug lurks (ignored, by design).
No drift, no lag, the ledger stays aligned,
This bunny tests boundaries of every kind! 🌸

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main changes: adding a deterministic test harness for issue #1018 that reproduces epoch-boundary resume scenarios and evaluates the import path.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch test/boundary-resume-repro-1018

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/boundary_resume.rs`:
- Around line 44-53: The require_release() function masks test failures as
passes in debug mode, hiding regression coverage. Remove the require_release()
function entirely and replace its runtime guard pattern with explicit
compile-time semantics. For all test functions that use require_release()
(including those at lines 236-240, 311-314, 374-376), add
#[cfg_attr(debug_assertions, ignore = "tests require --release")] attribute
above each test function declaration and remove the if !require_release() {
return; } guard statement from the test body.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 005bb5a1-ec80-4ff9-87b8-f19e0965c70d

📥 Commits

Reviewing files that changed from the base of the PR and between 9502e29 and 027e49d.

📒 Files selected for processing (2)
  • Cargo.toml
  • tests/boundary_resume.rs

Comment thread tests/boundary_resume.rs
Comment on lines +44 to +53
fn require_release() -> bool {
if cfg!(debug_assertions) {
eprintln!(
"SKIP: boundary_resume tests require --release \
(synthetic pots debug_assert is unrelated to the snapshot-lag bug under test)"
);
return false;
}
true
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Release-only guard currently turns unexecuted tests into passing tests.

Returning early in debug mode marks these tests as pass, so default cargo test can silently miss this regression coverage. Prefer explicit ignore semantics (e.g., #[cfg_attr(debug_assertions, ignore = "...")]) and remove runtime early-return gating.

🔧 Proposed fix
-fn require_release() -> bool {
-    if cfg!(debug_assertions) {
-        eprintln!(
-            "SKIP: boundary_resume tests require --release \
-             (synthetic pots debug_assert is unrelated to the snapshot-lag bug under test)"
-        );
-        return false;
-    }
-    true
-}
-
 #[test]
+#[cfg_attr(
+    debug_assertions,
+    ignore = "requires --release (synthetic pots debug_assert is unrelated to the snapshot-lag bug under test)"
+)]
 fn baseline_clean_run_crosses_boundaries_aligned() {
-    if !require_release() {
-        return;
-    }
     let (blocks, domain) = synthetic_world(260);
@@
 #[test]
+#[cfg_attr(
+    debug_assertions,
+    ignore = "requires --release (synthetic pots debug_assert is unrelated to the snapshot-lag bug under test)"
+)]
 fn import_crash_mid_estart_resumes_to_identical_state() {
-    if !require_release() {
-        return;
-    }
@@
 #[test]
 #[ignore = "reproduces `#1018`: rollback across an epoch boundary is not idempotent (un-ignore once fixed)"]
+#[cfg_attr(
+    debug_assertions,
+    ignore = "requires --release (synthetic pots debug_assert is unrelated to the snapshot-lag bug under test)"
+)]
 fn rollback_across_boundary_reapplies_to_identical_state() {
-    if !require_release() {
-        return;
-    }

Also applies to: 236-240, 311-314, 374-376

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/boundary_resume.rs` around lines 44 - 53, The require_release()
function masks test failures as passes in debug mode, hiding regression
coverage. Remove the require_release() function entirely and replace its runtime
guard pattern with explicit compile-time semantics. For all test functions that
use require_release() (including those at lines 236-240, 311-314, 374-376), add
#[cfg_attr(debug_assertions, ignore = "tests require --release")] attribute
above each test function declaration and remove the if !require_release() {
return; } guard statement from the test body.

Extends the boundary-resume harness while evaluating import crash/resume
safety across the full RUPD -> EWRAP -> ESTART boundary.

Findings:
- RUPD re-run is idempotent (PendingRewardState writes are overwrite-by-key,
  recompute is deterministic, finalize overwrites).
- ESTART shard crash + resume is safe (shard-skip via estart_progress +
  atomic, guarded finalize) — Scenario A still passes, now also asserting pots.
- EWRAP has a finalize-window gap: the last shard sets
  ewrap_progress.committed == total (EWrapProgress::apply), but EpochWrapUpV3
  (which assembles the final EndStats and rotates rolling/pparams) commits
  separately and does NOT touch ewrap_progress. On restart,
  EwrapWorkUnit::initialize reads committed == total -> is_complete() -> skips
  BOTH shards and finalize. A crash in [last EWRAP shard committed, before
  EpochWrapUpV3 commits] therefore permanently skips EpochWrapUpV3 on resume,
  so ESTART reset consumes a non-finalized `end` (default epoch_incentives) and
  un-rotated rolling stats: reserves stay too high, treasury too low.

Adds:
- pots (reserves, treasury, utxos, rewards, fees) to the state fingerprint so
  monetary-accounting corruption is caught, not just per-entity epoch drift.
- import_crash_ewrap_finalize_window_resumes_to_identical_state (#[ignore],
  reproduces the gap): crashes the 2nd EWRAP after all shards, before finalize,
  then resumes; pots diverge from a clean run by exactly the skipped epoch-1
  incentives (~8.99e12 reserves/treasury). Un-ignore once the EWRAP finalize
  marker is fixed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@scarmuega scarmuega changed the title test(cardano): deterministic epoch-boundary resume reproduction (#1018) test(cardano): epoch-boundary resume reproductions + import-path evaluation (#1018) Jun 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant