Architecture
Yggdrasil is organized as a Rust workspace with explicit crate boundaries so protocol, ledger, storage, and integration work can evolve independently.
Crate Topology
crates/crypto: hashing, signatures, VRF, KES, and cryptographic encoding boundaries.crates/ledger: transaction and block state transitions plus era-aware domain modeling. Per-era CBOR codecs (crates/ledger/src/eras/*/cbor.rs) are hand-coded against upstream CDDL (.reference-haskell-cardano-node/deps/cardano-ledger/eras/<era>/impl/cddl/data/<era>.cddl); CDDL is treated as authoritative documentation, not as input for code generation, because real upstream parity needs Byron / array-vs-map / optional-field semantics that CDDL underspecifies.crates/storage: immutable storage, rollback-aware volatile storage, ledger snapshot facilities, slot-indexed chain-dependency sidecar helpers, and a minimal ChainDB-style coordination layer.crates/consensus: chain selection, leader election, epoch math, and rollback coordination.crates/consensus/src/mempool: transaction admission, prioritization, and block-application eviction.crates/network: handshake, mini-protocol state machines, peer management, topology domain types, root-provider snapshots, peer registry state, peer candidate ordering, and multiplexing.node: runtime wiring, CLI, sync loop, and operational entry points.
Dependency Order
cryptoledgerandstorageconsensusandmempoolnetworknode
Design Principles
- Keep public interfaces small and spec-traceable.
- Separate generated types from handwritten state-machine logic.
- Let storage and network depend on stable domain interfaces rather than concrete implementations.
- Build parity tooling alongside implementation rather than as a final afterthought.
- Keep
nodeas an orchestration layer: configuration loading, CLI overrides, runtime startup, and shutdown belong there, but reusable peer policy, tracer transports, and protocol-facing state machines belong in crates. - Extract logic out of
nodewhen any of these become true: the code is reused by more than one runtime path, it owns non-trivial protocol or peer-selection state, or it would need independent tests that do not depend on the CLI/runtime entrypoint.
Current Milestone
The project has a complete Cardano-era type system, a functional node binary, and a fully-tested multi-peer dispatch layer:
- Full era type coverage from Byron through Conway with typed CBOR codecs.
- Multi-era UTxO validation with coin and multi-asset preservation checks.
- Network transport + mux + handshake + peer lifecycle with all five mini-protocol state machines, wire codecs, and typed client/server drivers.
- Reusable topology domain types, topology-root configuration parsing, root-provider snapshots, peer registry state, peer candidate ordering, bootstrap-target sequencing, reconnect attempt ordering, and preferred-peer retry state now live in
crates/network;nodeonly feeds those helpers into runtime startup. - Multi-era block decode (all 7 era tags) with consensus header verification (KES/OpCert).
- Node binary with
clapCLI (run,validate-config,status,default-config), JSON configuration, upstream-aligned tracing config fields, local runtime trace emission, and managed sync service with graceful shutdown. - Epoch-boundary wiring in runtime sync paths now includes real per-pool performance inputs for rewards and Shelley PPUP application at epoch transition.
- Mempool with TTL-aware admission, fee ordering, and block-application eviction.
- File-backed storage implementations behind
ImmutableStore,VolatileStore, andLedgerStoretraits. - Storage crate now also exposes a minimal
ChainDbcoordination layer for best-known tip recovery, typed ledger checkpoint restore and replay, checkpoint retention/truncation, volatile-prefix promotion into immutable storage, rollback-time snapshot truncation, and opaque slot-indexedchain_dep_state/<slot-hex>.cborsidecar persistence without moving sync policy intonode. - Consensus hardening with
SecurityParam,ChainStatevolatile chain tracker, rollback depth enforcement, and stability window detection.
Live runtime data flows (2026-Q2 audit closure)
End-to-end wired and tested between the consensus, network, and node crates:
- Genesis density signal — ChainSync
RollForwardpushes per-peerslotobservations intocrates/consensus::DensityWindow; the runtimenode::sync::DensityRegistryaggregates per peer; thecrates/network::governor::run_governor_loopreads density intoPeerMetrics.densitybefore each tick;combined_scoreadds aHIGH_DENSITY_BONUSfor peers aboveLOW_DENSITY_THRESHOLD = 0.6, biasing hot demotion toward laggards. - Hot-peer scheduling weights —
crates/network::governor::HotPeerSchedulingcarries upstream-default per-MiniProtocolNumweights (BlockFetch=10, ChainSync=3, TxSubmission=2, KeepAlive=1, PeerSharing=1);node::runtime::apply_hot_weightsreads from the table on every promote-to-hot call so operator overrides viaset_hot_protocol_weightland at the next promotion; the mux writer’s per-round weighted round-robin readsWeightHandleatomically each round so updates take effect immediately. - Multi-peer BlockFetch dispatch primitives —
node::sync::partition_fetch_range_across_peerstranslates themax_concurrent_block_fetch_peersconfig knob into per-peerBlockFetchAssignments usingcrates/network::blockfetch_pool::split_range;node::sync::execute_multi_peer_blockfetch_plan<B>dispatches assignments concurrently viatokio::JoinSet, propagates errors withabort_all, and reassembles chunks in chain order viaReorderBuffer<B>;node::sync::dispatch_range_with_tentativewraps the above in the consensus-correctness contract (announcetry_set_tentative_headerbefore dispatch, callclear_tentative_trapon any chunk failure). The dispatcher itself stays tentative-state-agnostic so async tasks cannot race on mutation — the consensus boundary lives in the single layer.
The 2026-Q2 audit (docs/AUDIT_VERIFICATION_2026Q2.md) closed every confirmed-active parity slice plus the runtime integrations originally tracked as follow-ups (+117 cycle delta plus E-Phase6-Seam, E-Inline, E-Workers, E-Production-Spawn, E-Migration, E-Wire, E-Promote, E-Runbook, and Phase 6 observability). The 2026-Q3 operational pass on main then closed the audit C-1/H-1/H-2/M-1..M-8/L-1..L-9 findings and surfaced + fixed the byron→shelley fee-validation parity bug at preprod slot 518 460 (see docs/REAL_PREPROD_POOL_VERIFICATION.md). R238 closes the code-level Phase D.1 rollback sidecar hardening slice: nonce/OpCert ChainDepState now restores from the newest slot-indexed sidecar at or before the rollback point and replays stored blocks to the selected target. R239 closes Phase E.1 upstream pin maintenance by refreshing the SHA-anchored cardano-base vector tree and aligning all 6 documentary pins with live HEAD. R246 closes the observed preview Plutus blockers through refscan slot 901725 and fixes runtime resume preservation of recovered pool block counts. R247 fixes the Origin-start verified BlockFetch prefix window so clean preview replay stores the initial Byron slots before continuing. Live workspace coverage is 4.7K+ passing tests with cargo check-all, cargo test-all, and cargo lint green after the latest R246 patches, plus the focused R247 prefix regression.
Upstream parity testing is complete with CBOR golden round-trip tests and cross-subsystem integration tests. Wire-format field names align with official Cardano CDDL schemas.
The remaining production-readiness gate is operator-side: the mainnet sync endurance run (docs/MANUAL_TEST_RUNBOOK.md §2–9) and ongoing operational hardening; the parallel-fetch rehearsal (§6.5) was completed at R218 (docs/operational-runs/archive/2026-04-30-round-218-mainnet-multipeer-fetch-rate.md), measured a 67% throughput delta, and graduated the default max_concurrent_block_fetch_peers from 1 to 2 at R258 (docs/operational-runs/archive/2026-05-06-round-258-multipeer-default-graduation.md).
Topology parsing and preset-specific config resolution currently stay in node because they are operational concerns tied to the node binary’s config format. Once peer selection grows into ledger peers, peer sharing, or long-lived governor policy, that logic should move behind a network-crate boundary rather than continuing to grow in node.
Upstream-Aligned Networking Plan
- Phase 1: topology-model parity in
crates/networkis complete. Local and public root topology types now live inyggdrasil-network, with local-root support forhotValency,warmValency,diffusionMode, trustability, legacyvalencycompatibility, and upstream-styleuseBootstrapPeersanduseLedgerPeerssemantics. - Phase 2: root-set providers in
crates/networkis complete. The crate now exposes a resolved startup snapshot, mutable root-provider state, a refresh-oriented provider API for local, bootstrap, and public roots with disjointness and precedence handling, and a DNS-backed provider that covers local roots, bootstrap peers, and configured public roots. An optionalDnsRefreshPolicyadds time-gated re-resolution with exponential backoff on stale results (upstream-aligned 60 s base / 900 s max). That refresh path can also reconcile the peer registry directly. - Phase 3: peer registry state in
crates/networkis complete. The crate now exposes a minimal registry for peer source and status aligned with upstreamPeerSourceandPeerStatusconcepts, including local root, public root, bootstrap, ledger, big-ledger, and peer-share origins plus cold, cooling, warm, and hot states. Root-provider refreshes already reconcile through this registry, and the crate now also exposes set-reconciliation helpers for ledger, big-ledger, and peer-share inputs sonodedoes not need to hand-roll source bookkeeping. Ledger peer provider layer is complete:LedgerPeerProvidertrait,LedgerPeerSnapshotnormalization (deduplicates and enforces disjoint ledger/big-ledger sets),LedgerPeerProviderRefresh(combined/per-kind),apply_ledger_peer_refresh()helper,refresh_ledger_peer_registry()orchestration, andScriptedLedgerPeerProviderfor testing. Provider refreshes reconcile thePeerRegistryon crate-owned paths without node involvement. - Phase 4: consensus-network bridge for ledger peers is complete. The network crate owns the live orchestration seam (
live_refresh_ledger_peer_registry_observed) and applies policy reconciliation from consensus-fed(latest_slot, judgement, ledger_snapshot)plus snapshot-file observations. Node runtime now provides storage-backed source adapters only, and consumes the same observed judgement returned by the network orchestration for governor mode/churn decisions, removing duplicate node-side ledger judgement derivation while preserving startup/reconnect ledger-peer refresh behavior. - Phase 5: governor-style policy. Only after the previous phases exist should Yggdrasil add promotion, demotion, peer sharing, public-root refresh backoff, churn, or Genesis-specific security behavior. The implementation should keep policy separate from mechanism, as in upstream
PeerSelectionActions,PeerSelectionPolicy, and governor state modules. Status: complete. Promotion/demotion logic, peer sharing, public-root and big-ledger backoff, two-phase churn cycle, sensitive/normal mode, association mode, hot-peer scheduling, and density-biased demotion all live incrates/network::governor. -
Phase 6: runtime multi-session orchestration. Status: complete; default-on as of R258. End-to-end multi-peer concurrent BlockFetch is wired and active by default (
max_concurrent_block_fetch_peers = 2, matching upstreambfcMaxConcurrencyBulkSync). Operators wanting strict single-peer behaviour for replay/audit set the knob to1. The consensus-correctness contract is locked indispatch_range_with_tentativeand tested (announce → dispatch → clear-trap-on-failure).-
Warm-peer BlockFetch handle accessor. Done (Slice E-Phase6-Seam, commit
5d44c70).OutboundPeerManager::with_hot_block_fetch_clients<R>(&mut self, f: FnOnce(&mut [(SocketAddr, &mut BlockFetchClient)]) -> R) -> Rexposes hot peers’ BlockFetch handles as a borrow-checked slice;hot_peer_addrs()is the cheap snapshot for sizing concurrency. -
Sync-loop dispatch consumer. Done (Slice E-Wire, commit
9f87447).sync_batch_verified_with_tentativeacceptsblock_fetch: Option<&mut BlockFetchClient>plus an optionalMultiPeerDispatchContext { pool: SharedFetchWorkerPool, max_concurrent_knob }. WhenSomeANDeffective_block_fetch_concurrency(workers, knob) > 1, the per-RollForward fetch step reads the shared pool under a brieftokio::sync::RwLock::readguard, partitions the range viapartition_fetch_range_across_peers, callspool.dispatch_plan(...), and clears the tentative trap on error; otherwise the legacy single-peer path runs unchanged. -
Async-borrow lifetime constraint. Done (Slice E-Workers, commit
434af60; production spawncafc31a; migration0f612aa+7c06baf). Resolved by per-peer worker tasks: each peer’sBlockFetchClientis owned by its own tokio task that drains ampsc::Receiver<FetchRequest>queue. The sync loop dispatches viampsc::Sender::send(no&mut BlockFetchClientborrow crosses the await), and per-requestoneshot::Senderreturns the result. Workers run in parallel because each is its own task. Mirrors upstreamBlockFetch.ClientRegistryper-peerFetchClientStateVars+ STM exactly.PeerSession.block_fetch: Option<BlockFetchClient>plustake_block_fetch()lets the runtime move the client out into a worker without dropping the session. -
Connection-manager coordination. Done (Slice E-Migration
0f612aa+ Slice E-Promote1249f7f). The governor’sevaluate_hot_promotionsproduces N promotions per tick;apply_cm_actionscallsOutboundPeerManager::migrate_session_to_worker(peer)after successfulpromote_to_warmwhenmax_concurrent_block_fetch_peers > 1, emitting aNet.BlockFetch.Workerinfo trace. On peer disconnect, the now-asyncdemote_to_coldcallsunregister_worker(peer)to remove the worker andprune_closed()to GC dead workers. On reconnect, the next promote re-spawns a worker viaFetchWorkerHandle::spawn_with_block_fetch_client. The dispatcher’s error-propagation path (drop pending oneshot receivers, propagate first error) is correct for mid-fetch peer loss — surviving workers stay alive for subsequent iterations; only the offending request’s response is lost. -
Operational rollout. Done (R258,
2026-05-06). Defaultmax_concurrent_block_fetch_peers = 2ships matching upstreambfcMaxConcurrencyBulkSync = 2. Graduated from the prior single-peer default after R218 measured a 67% mainnet throughput delta on the multi-peer path. The §6.5 parallel-fetch rehearsal (steps 6.5a–6.5f covering 2- and 4-peer soak with hash-comparison vs. Haskell node and restart-resilience cycles) remains indocs/MANUAL_TEST_RUNBOOK.mdfor operators stress-testing knob > 2 or running endurance soaks. Phase 6 observability (yggdrasil_blockfetch_workers_registeredgauge +_migrated_totalcounter) gives operator dashboards the instrumentation needed to alert on stuck migration.
Reference: upstream
Ouroboros.Network.BlockFetch.ClientRegistry(per-peerFetchClientStateVars) +Ouroboros.Network.BlockFetch.Decision.fetchDecisions+Ouroboros.Network.BlockFetch.State.completeBlockDownload. -
Planning Constraints
- Prefer the official type split over local simplifications. Upstream distinguishes local root groups from public roots, public roots from bootstrap peers, and ledger peers from all configured root sets.
- Keep the dynamic parts asynchronous. Upstream treats local roots, public roots, ledger peers, and snapshot data as time-varying sources observed by the networking layer rather than one-shot startup inputs.
- Preserve root-set invariants. Official peer-selection state enforces that local roots and public roots do not overlap and that root counts respect peer-selection targets.
- Keep
nodefocused on orchestration. It should provide config loading, CLI overrides, and consensus-facing signals, but the network crate should own peer sources, peer state, retry policy, and future governor behavior.