Monitoring

A production node needs three observability surfaces: metrics for dashboards and alerting, structured traces for debugging, and a health endpoint for orchestrator probes.

Prometheus metrics

Enable with --metrics-port:

$ yggdrasil-node run ... --metrics-port 12798

The node binds an HTTP server on 127.0.0.1:12798 exposing:

GET /metrics — Prometheus text exposition format.
GET /metrics/json — JSON snapshot of the same counters.
GET /health — JSON liveness probe (status, uptime_seconds, blocks_synced, current_slot).
GET /debug, GET /debug/metrics, GET /debug/metrics/prometheus, GET /debug/health — upstream-style aliases.

Bind is intentionally to 127.0.0.1 only. To scrape from a remote Prometheus, run a reverse proxy (nginx, Caddy) or use SSH tunnelling.

Counters and gauges

Yggdrasil emits 40+ counters, gauges, and histograms. Selected highlights:

Sync

Metric	Type	Description
`yggdrasil_blocks_synced`	counter	Total blocks applied during this process lifetime.
`yggdrasil_current_block_number`	gauge	Latest block number applied.
`yggdrasil_current_slot`	gauge	Slot of the latest applied block.
`yggdrasil_checkpoint_slot`	gauge	Slot of the most recent ledger checkpoint.
`yggdrasil_rollbacks`	counter	RollBackward events received.
`yggdrasil_rollback_depth_blocks`	histogram	Rollback depth distribution for R225/R238 rollback recovery validation.
`yggdrasil_stable_blocks_promoted`	counter	Volatile → immutable promotions.
`yggdrasil_reconnects`	counter	Sync session reconnects.
`yggdrasil_batches_completed`	counter	Verified batches applied.
`yggdrasil_apply_batch_duration_seconds`	histogram	Ledger apply duration per verified batch.
`yggdrasil_fetch_batch_duration_seconds`	histogram	BlockFetch duration per verified batch.

Mempool

Metric	Type	Description
`yggdrasil_mempool_tx_count`	gauge	Current transactions in the mempool.
`yggdrasil_mempool_bytes`	gauge	Current mempool byte total.
`yggdrasil_mempool_tx_added`	counter	Successfully admitted transactions.
`yggdrasil_mempool_tx_rejected`	counter	Rejected transactions.

Connection manager

Metric	Type	Description
`yggdrasil_cm_full_duplex_conns`	gauge	Full-duplex peer count.
`yggdrasil_cm_duplex_conns`	gauge	Duplex (uni- + bi-directional) peer count.
`yggdrasil_cm_unidirectional_conns`	gauge	One-way peer count.
`yggdrasil_cm_inbound_conns`	gauge	Currently accepted inbound.
`yggdrasil_cm_outbound_conns`	gauge	Currently established outbound.
`yggdrasil_inbound_connections_accepted`	counter	Cumulative inbound accept count.
`yggdrasil_inbound_connections_rejected`	counter	Inbound connections rejected by rate limit.

BlockFetch workers (Phase 6)

Metric	Type	Description
`yggdrasil_blockfetch_workers_registered`	gauge	Per-peer fetch workers currently active.
`yggdrasil_blockfetch_workers_migrated_total`	counter	Cumulative warm-to-hot migrations into the worker pool.

A healthy multi-peer setup has registered ≈ number of hot peers (usually 2 with max_concurrent_block_fetch_peers = 2) and migrated_total strictly increasing across the run.

Peer lifetime and egress

Metric	Type	Description
`yggdrasil_peer_lifetime_sessions_total`	counter	Cumulative warm-peer sessions across reconnects.
`yggdrasil_peer_lifetime_failures_total`	counter	Cumulative peer session failures.
`yggdrasil_peer_lifetime_bytes_in_total`	counter	Aggregate bytes received from peer block fetch.
`yggdrasil_peer_lifetime_unique_peers`	gauge	Distinct peer addresses observed by the runtime.
`yggdrasil_peer_lifetime_handshakes_total`	counter	Successful NtN handshakes across peers.
`yggdrasil_blockfetch_server_bytes_served_total`	counter	Bytes served by Yggdrasil’s BlockFetch server.
`yggdrasil_chainsync_server_bytes_served_total`	counter	Bytes served by Yggdrasil’s ChainSync server.
`yggdrasil_keepalive_server_bytes_served_total`	counter	Bytes served by Yggdrasil’s KeepAlive server.
`yggdrasil_txsubmission_server_bytes_served_total`	counter	Bytes served by Yggdrasil’s TxSubmission2 server.
`yggdrasil_peersharing_server_bytes_served_total`	counter	Bytes served by Yggdrasil’s PeerSharing server.

The server egress counters are aggregate-only to avoid high-cardinality Prometheus labels. Per-peer egress is folded into runtime lifetime stats internally.

Process

Metric	Type	Description
`yggdrasil_uptime_seconds`	gauge	Process uptime.

Sample Prometheus scrape config

- job_name: yggdrasil
  scrape_interval: 15s
  static_configs:
    - targets: ['127.0.0.1:12798']
      labels:
        node_role: relay
        network: mainnet

Grafana

Grafana dashboards built for upstream cardano-node will work against Yggdrasil with one substitution: replace the cardano_node_metrics_* prefix with yggdrasil_. The metric semantics are aligned where the upstream metric exists, with a couple of names differing where Yggdrasil added new instrumentation (e.g. the blockfetch_workers_* family is Yggdrasil-specific).

Structured tracing

The trace dispatcher writes namespace-scoped events. The namespace is dotted, e.g. Net.BlockFetch.Worker, ChainDB.AddBlockEvent.AddedBlock, Mempool.Eviction, Node.BlockProduction. Per-namespace settings control:

Severity threshold — Debug, Info, Notice, Warning, Error, Critical.
Backends — list of destinations (Stdout MachineFormat, Forwarder, etc.).
maxFrequency — Hz cap on emission per namespace.
detail — DMinimal, DNormal, DDetailed, DMaximum.

Configure in config.json:

{
  "TraceOptions": {
    "": {
      "severity": "Notice",
      "backends": ["Stdout HumanFormatColoured"]
    },
    "ChainDB": {
      "severity": "Info",
      "detail": "DDetailed"
    },
    "Net.BlockFetch": {
      "severity": "Info",
      "maxFrequency": 5.0
    },
    "Node.Recovery.Checkpoint": {
      "severity": "Info",
      "maxFrequency": 1.0
    }
  }
}

The empty-string key is the root default. Longest-prefix wins.

Forwarder backend

For aggregation across many nodes, configure the Forwarder backend with a Unix socket destination. The wire format is CBOR-encoded trace events compatible with upstream cardano-tracer, so you can plug a single tracer in front of Haskell and Yggdrasil nodes interchangeably.

{
  "TraceOptions": {
    "": {
      "backends": ["Forwarder"]
    }
  },
  "TraceOptionForwarder": {
    "address": {
      "filePath": "/var/run/cardano-tracer.sock"
    },
    "mode": "Initiator"
  }
}

Health endpoint

$ curl -s http://127.0.0.1:12798/health
{"status":"ok","uptime_seconds":86412,"blocks_synced":523109,"current_slot":117425831}

Use this for Kubernetes liveness probes, load-balancer health checks, etc.

A Kubernetes example:

livenessProbe:
  httpGet:
    path: /health
    port: 12798
  initialDelaySeconds: 30
  periodSeconds: 30
readinessProbe:
  httpGet:
    path: /health
    port: 12798
  initialDelaySeconds: 60
  periodSeconds: 15

Suggested alerts

A starting alerting baseline:

Alert	Expression	Severity
Node down	`up{job="yggdrasil"} == 0`	critical
Slot lag > 600	`(time() - yggdrasil_current_slot * 1) > 600`	warning
Slot lag > 3600	as above, threshold 3600	critical
Frequent reconnects	`rate(yggdrasil_reconnects[5m]) > 1`	warning
Excessive rollbacks	`rate(yggdrasil_rollbacks[10m]) > 0.1`	warning
Deep rollback spike	`histogram_quantile(0.99, rate(yggdrasil_rollback_depth_blocks_bucket[1h])) > 2160`	warning
Stuck migration	`yggdrasil_blockfetch_workers_registered < hot_peer_count`	warning
Mempool growing unbounded	`yggdrasil_mempool_bytes > 10485760`	warning
Inbound rate-limit hits	`rate(yggdrasil_inbound_connections_rejected[5m]) > 0.5`	info

Adjust thresholds based on your network and traffic profile.

What “synced” means

Practical synced check:

abs(yggdrasil_current_slot - <expected_mainnet_tip>) < 60

Where <expected_mainnet_tip> comes from a trusted second source (e.g. another node, an explorer’s API). Within 60 slots (~20 minutes) of upstream tip is operationally synced for most purposes; within 10 slots is “block-production ready”.

Where to go next

Block Production — extend monitoring to track forge-loop events.
Troubleshooting — interpret common error traces.