fix(spartan): give internal bootstrap nodes a stable ENR (ClusterIP announce + persisted peer id)#24197
Closed
AztecBot wants to merge 1 commit into
Closed
fix(spartan): give internal bootstrap nodes a stable ENR (ClusterIP announce + persisted peer id)#24197AztecBot wants to merge 1 commit into
AztecBot wants to merge 1 commit into
Conversation
…nnounce + persisted peer id)
Contributor
|
Superseded by #24220 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Internal (non-public) bootstrap nodes bake two ephemeral values into the ENR they serve, and both rotate on every pod restart/rollout:
spartan/aztec-node/templates/_pod-template.yamlset the announce address tohostname -i(the pod IP) wheneverservice.p2p.publicIPis false.Deploymentwith no persistence, so the discv5/peer-id key atDATA_DIRECTORY/p2p-private-key(yarn-project/p2p/src/util.ts) was regenerated on every start, rotating the node id/pubkey.Peers fetch the bootstrap ENR once at startup (
bootstrap_getEncodedEnr) and cache it, so a single bootstrap rollout invalidates everyone's cached ENR. This is what stranded the staging-public prover node at 0 peers (it had cached an ENR for a bootstrap pod that no longer existed), which in turn caused network reorgs.Fix (internal IPs only)
Make the bootstrap ENR immutable across pod restarts:
Stable internal address (ClusterIP). New
service.p2p.clusterIPoption on theaztec-nodechart. When enabled (andpublicIPis false), the chart creates a stable cluster-internal*-p2pClusterIP Service (TCP+UDP on the announce port) and the pod advertises that Service's ClusterIP — resolved at startup viagetent— instead of its pod IP. A ClusterIP is stable for the cluster's lifetime and survives pod rollovers, so the ENR's address no longer changes. Enabled for the bootstrap inp2p-bootstrap.yaml.Persisted node identity. The bootstrap now runs as a
StatefulSetwith a small PVC for the data directory, sop2p-private-keysurvives restarts and the node id/pubkey stays constant. Enabled inp2p-bootstrap-resources-{dev,prod}.yaml.Together the full ENR (id + ip + ports) is constant across restarts, so a bootstrap rollout can no longer strand peers on a dead ENR.
The extra resilience from persisting the peer store (so previously-connected nodes can reconnect without the bootnode) is intentionally left out — that comes through a separate PR currently in review.
Scope / safety
service.p2p.publicIPis true, the pod still usesP2P_QUERY_FOR_IPand no ClusterIP Service is created (the new Service is guarded onnot publicIP).service.p2p.clusterIP.enableddefaults tofalse, so they keep usinghostname -iand remain Deployments.Verification
helm lint ./aztec-nodepasses.helm templateof the bootstrap release in internal mode renders: the*-p2pClusterIP Service (TCP+UDP),P2P_IPresolved viagetent ... -p2p.<ns>.svc.cluster.local, and aStatefulSetwith a 2GivolumeClaimTemplate.helm templatein public mode rendersP2P_QUERY_FOR_IP=trueand no-p2pService.hostname -i, no-p2pService, still a Deployment).Files
spartan/aztec-node/values.yaml— newservice.p2p.clusterIPoption (default off).spartan/aztec-node/templates/_pod-template.yaml— announce the ClusterIP when enabled.spartan/aztec-node/templates/svc.p2p.yaml— new ClusterIP Service (internal mode only).spartan/terraform/deploy-aztec-infra/values/p2p-bootstrap.yaml— enableclusterIPfor the bootstrap.spartan/terraform/deploy-aztec-infra/values/p2p-bootstrap-resources-{dev,prod}.yaml— StatefulSet + persistence for the bootstrap.Created by claudebox · group:
slackbot