Skip to main content

SRE Team Update

· 5 min read
John Lotoski
Service Reliability Engineer

High level summary

The SRE team continues work on Cardano environment improvements and general maintenance.

Some notable recent changes, updates or improvements include:

  • Trace dispatcher was migrated out of cardano-node and into its own repo: hermod-tracing

  • A number of CI improvements to Darwin builder configuration and general CI monitoring and alerting were merged.

  • The cardano-node 10.7.0 pre-release SRE contribution work was completed.

  • Dijkstra network had the van Rossem cost model for PV11 preparation submitted, ratified and enacted.

  • Some cloud resources were relocated to more stable areas after disruption due to conflict in the Middle East.

Repository Work -- Merged

Cardano-monitoring

cardano-monitoring PR#6:

  • Enables Loki alert rule evaluation, and makes the alerts visible in Grafana

Cardano-node

cardano-node PR#6478

  • Bumps iohkNix, updates MinNodeVersion to 10.7.0 and refreshes mainnet-peer-snapshot.json and other ci files.
  • Adds independent lsmDatabasePath NixOS option with uniqueness assertion and mutual-exclusion check against LMDB per instance.
  • Adds kes-agent/kes-agent-control (Linux only) and dmq-node (all platforms) to release binaries.
  • Adds cardano-node-dbtools NixOS test covering db-synthesizer, db-analyser, db-truncater, and the GHC-asserted synthesizer binary against a cardano-testnet create-env environment.
  • Adds --shelley-kes-agent-socket support to run-node and cardano-node-service.nix. Expands the KES assertion to cover three valid forging configurations: relay (none), direct KES key, and KES agent socket.
  • Adds CARDANO_TRACER_SOCKET_NETWORK_{ACCEPT,CONNECT} tracer socket options to run-node.
  • Hardens all node and tracer entrypoint/launch scripts with set -euo pipefail, safe ${VAR:-} expansion throughout, pre-flight file existence checks, and exec for clean process replacement.
  • Consolidates separate relay/block-producer run functions into a single runNode. Derives GENESIS_JSON from CARDANO_CONFIG directory to support non-mainnet deployments.
  • Renames runCommandNoCCLocalrunCommandLocal for nixpkgs 25.11.

Devx-ci

devx-ci PR#145:

  • Upgrades darwin CI infrastructure with version bumps, guest VM lifecycle management, and maintenance improvements.
  • Darwin related infrastructure upgrades:
    • Nix 2.322.33-maintenance (hosts and guests); nix.package now set explicitly
    • nix-darwin 25.0525.11 (guests)
    • UTM versioned per architecture:
      • aarch64-darwin: 4.5.45.0.2
      • x86_64-darwin: pinned to 4.6.5 as UTM > 4.6.5 breaks display driver compatibility with macOS Sequoia+ on Intel
    • ca-derivations experimental feature enabled on hosts and guests
    • Guest bootstrap nix version bumped 2.28.32.32.5
    • Adds a small C binary at /usr/local/bin/nix-daemon-launcher to work around macOS launchd sandbox blocking .dylib loads from the APFS /nix volume
    • darwin.sh gains --system/-s (default: aarch64-darwin) flags; argbash upgraded 2.10.02.11.0
    • Auth-keys-hub is now utilized by the guests and legacy ops-lib usage has been removed
    • A percentage-based threshold garbage collection has been implemented, deriving thresholds from disk size rather than fixed values
  • Bumped nixpkgs-gh-runners v2.330.0v2.332.0
  • See the PR description for additional details

devx-ci PR#146:

  • Deploy hydra-tools hydra-github-bridge 0.2.1.0

devx-ci PR#147:

  • Accommodate the nixos hydra-github-bridge module with an extra secrets file

devx-ci PR#148:

  • Pause the new hydra-github-bridge usage until after pre-release of 10.7.0
  • Remove zramSwap to make more physical RAM available to Hydra
  • Reduce max concurrent evals/jobs down to 4 to keep throughput manageable

devx-ci PR#149:

  • Rekey all secrets to accommodate contributions from another SRE

devx-ci PR#150:

  • Improves alerting infrastructure using Mimir Alertmanager and Loki ruler.
  • Sets up Dead Man's Snitch.
  • Infrastructure changes:
    • Moved OpenTofu configuration from perSystem/packages/opentofuConfig/ to flake/opentofu/ for consistency with other SRE repos
    • Added Mimir provider for Prometheus-style alerting rules
    • Added Loki provider for log-based alerting rules
    • Configured Mimir Alertmanager to route all alerts to PagerDuty
    • Added various metrics based and log based alerts
  • See the PR description for additional details

devx-ci PR#151:

  • Alert on every OOM and setup annotations to show Alertmanager alerts.

Hermod-tracing

hermod-tracing PR#2

  • Adds hydraJobs to flake top level attrs
  • Adds aarch64-linux
  • Tests hydra integration
  • Makes explicit required and nonrequired aggregate jobs
  • Moves default pkgs to trace-dispatcher

Iohk-nix

iohk-nix PR#610:

  • Include updated configs from respun dijkstra net from 2026-02-19
  • Re-add sanchonet back to the available environments since it is persisting as a long lived community test network
  • Update per-environment useLedgerAfterSlot values
  • Update per-environment peer-snapshot.json files
  • Update MinNodeVersion to 10.7.0 as the peer-snapshot files made a version breaking change

Usdcx-infra

usdcx-infra PR#5:

  • Adds missing series eval resolution targets

Repository Work In Progress -- PRs and Branches