Skip to content

fix(spartan): give internal bootstrap nodes a stable ENR (ClusterIP announce + persisted peer id)#24197

Closed
AztecBot wants to merge 1 commit into
merge-train/spartanfrom
cb/bootstrap-stable-enr
Closed

fix(spartan): give internal bootstrap nodes a stable ENR (ClusterIP announce + persisted peer id)#24197
AztecBot wants to merge 1 commit into
merge-train/spartanfrom
cb/bootstrap-stable-enr

Conversation

@AztecBot

Copy link
Copy Markdown
Collaborator

Problem

Internal (non-public) bootstrap nodes bake two ephemeral values into the ENR they serve, and both rotate on every pod restart/rollout:

  1. Ephemeral pod IPspartan/aztec-node/templates/_pod-template.yaml set the announce address to hostname -i (the pod IP) whenever service.p2p.publicIP is false.
  2. Ephemeral node identity — the bootstrap ran as a Deployment with no persistence, so the discv5/peer-id key at DATA_DIRECTORY/p2p-private-key (yarn-project/p2p/src/util.ts) was regenerated on every start, rotating the node id/pubkey.

Peers fetch the bootstrap ENR once at startup (bootstrap_getEncodedEnr) and cache it, so a single bootstrap rollout invalidates everyone's cached ENR. This is what stranded the staging-public prover node at 0 peers (it had cached an ENR for a bootstrap pod that no longer existed), which in turn caused network reorgs.

Fix (internal IPs only)

Make the bootstrap ENR immutable across pod restarts:

  1. Stable internal address (ClusterIP). New service.p2p.clusterIP option on the aztec-node chart. When enabled (and publicIP is false), the chart creates a stable cluster-internal *-p2p ClusterIP Service (TCP+UDP on the announce port) and the pod advertises that Service's ClusterIP — resolved at startup via getent — instead of its pod IP. A ClusterIP is stable for the cluster's lifetime and survives pod rollovers, so the ENR's address no longer changes. Enabled for the bootstrap in p2p-bootstrap.yaml.

  2. Persisted node identity. The bootstrap now runs as a StatefulSet with a small PVC for the data directory, so p2p-private-key survives restarts and the node id/pubkey stays constant. Enabled in p2p-bootstrap-resources-{dev,prod}.yaml.

Together the full ENR (id + ip + ports) is constant across restarts, so a bootstrap rollout can no longer strand peers on a dead ENR.

The extra resilience from persisting the peer store (so previously-connected nodes can reconnect without the bootnode) is intentionally left out — that comes through a separate PR currently in review.

Scope / safety

  • Public deployments are unchanged: when service.p2p.publicIP is true, the pod still uses P2P_QUERY_FOR_IP and no ClusterIP Service is created (the new Service is guarded on not publicIP).
  • Non-bootstrap nodes are unaffected: service.p2p.clusterIP.enabled defaults to false, so they keep using hostname -i and remain Deployments.

Verification

  • helm lint ./aztec-node passes.
  • helm template of the bootstrap release in internal mode renders: the *-p2p ClusterIP Service (TCP+UDP), P2P_IP resolved via getent ... -p2p.<ns>.svc.cluster.local, and a StatefulSet with a 2Gi volumeClaimTemplate.
  • helm template in public mode renders P2P_QUERY_FOR_IP=true and no -p2p Service.
  • Default node render is unchanged (hostname -i, no -p2p Service, still a Deployment).

Files

  • spartan/aztec-node/values.yaml — new service.p2p.clusterIP option (default off).
  • spartan/aztec-node/templates/_pod-template.yaml — announce the ClusterIP when enabled.
  • spartan/aztec-node/templates/svc.p2p.yaml — new ClusterIP Service (internal mode only).
  • spartan/terraform/deploy-aztec-infra/values/p2p-bootstrap.yaml — enable clusterIP for the bootstrap.
  • spartan/terraform/deploy-aztec-infra/values/p2p-bootstrap-resources-{dev,prod}.yaml — StatefulSet + persistence for the bootstrap.

Created by claudebox · group: slackbot

@AztecBot AztecBot added ci-draft Run CI on draft PRs. ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure claudebox Owned by claudebox. it can push to this PR. labels Jun 19, 2026
@alexghr

alexghr commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Superseded by #24220

@alexghr alexghr closed this Jun 22, 2026
@alexghr alexghr deleted the cb/bootstrap-stable-enr branch June 22, 2026 11:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-draft Run CI on draft PRs. ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure claudebox Owned by claudebox. it can push to this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants