Skip to main content

75 posts tagged with "sre"

View All Tags

SRE Team Update

· 4 min read
John Lotoski
Service Reliability Engineer

High level summary

The SRE team continues work on Cardano environment improvements and general maintenance.

Some notable recent changes, updates or improvements include:

  • Cardano-node 11.0.1 was released which supports the PV11 hard fork!

  • Preview network was upgraded to node 11.0.1 in preparation for the PV11 van Rossem hard fork.

  • A van Rossem PV11 cost model governance vote was cast on preprod.

  • A Leios testnet was stood up in cardano-playground with block producers, relays, dbsync, faucet, and custom alloy-based monitoring.

  • Cardano-parts and cardano-playground were updated with cardano-node 10.7.1, cardano-db-sync 13.7.0.4, mithril 2617.0, Linux kernel 6.18 with ZFS 2.4 for LSM compatibility, and EC2 metadata hardened to require IMDSv2.

  • x86_64-darwin support was dropped from a number of repos ahead of the planned nixpkgs deprecation.

Repository Work -- Merged

Cardano-haskell-packages

cardano-haskell-packages PR#1363:

  • Adds cardano-node-11.0.1 to CHaP

Cardano-monitoring

cardano-monitoring PR#7:

  • Adds a sandbox monitoring server with explicit OAuth allow list
  • Bumps org tags on resources
  • Sets http_tokens to required on EC2 resources for IMDSv2 enforcement

Cardano-node

cardano-node PR#6541:

  • Prepares cardano-node for PV11 as the default protocol version and the experimental hard fork gated to PV12; SRE contributed CI and iohkNix updates

cardano-node PR#6555:

  • Bumps iohkNix for the blst flake input narHash and lastModified update
  • Sets cardano-node cabal version to 11.0.1

Cardano-mainnet

cardano-mainnet PR#44:

  • Resizes most root EBS volumes from 300 GB to 600 GB across all boot, bscale, iog, and iogp machine groups to accommodate chain growth
  • Hardens EC2 metadata to require IMDSv2
  • Updates kernel to 6.18 with ZFS 2.4 overlay for LSM compatibility
  • See the PR description for additional details

Cardano-parts

cardano-parts PR#82:

  • Bumps cardano-node to 10.7.1, cardano-cli to 10.16.0.0, cardano-db-sync to 13.7.0.4, and mithril to 2617.0
  • Sets the default Linux kernel to 6.18 for cardano-node >= 10.7.0 LSM compatibility and updates ZFS to 2.4
  • Fixes the ZFS ARC max null check in the AMI module
  • Hardens EC2 metadata to require IMDSv2 (http_tokens = "required")
  • Adds extraJournalReceivers option to the Grafana Alloy nixosModule for additional Loki journal forwarding targets
  • Adds NODE_CONFIG_SKIP_COPY env var and CARDANO_NODE_SHELL_BIN support to the entrypoint
  • Adds leios environment support to the Justfile with start-node, stop-all, query-tip recipes
  • Adds nix-copy-to-machine and nix-store-pin recipes
  • Adds cardano-node binary override support via flakeModules/pkgs.nix
  • Fixes IPv6 AAAA DNS record creation to be conditional on VPC IPv6 availability
  • See the PR description for additional details

Cardano-playground

cardano-playground PR#57:

  • Adds Leios testnet environment with block producers, relays, dbsync, faucet, and custom alloy-based monitoring filtering for Leios-specific trace namespaces
  • Adds nushell scripts for pool delegation management, UTxO defragmentation, and bulk fund transfers
  • Adds low-threshold guardrails plutus script for playground governance testing
  • Converts mainnet1-rel-a-2 from LMDB to LSM storage backend
  • Consolidates dijkstra relay fleet by removing redundant relay nodes
  • Hardens EC2 metadata to require IMDSv2
  • Updates kernel to 6.18 with ZFS 2.4 overlay for LSM compatibility
  • Updates cardano-book for 10.7.1 release and pre-release configurations
  • See the PR description for additional details

Cardano-sandbox

cardano-sandbox:

  • Creates a new environment for testing major stack re-factors, migrations and other complex test difficult or risky to execute on other live environments.

Iohk-nix

iohk-nix PR#612:

  • Updates useLedgerAfterSlot values and peer-snapshot.json files for node 11.0.0

iohk-nix PR#613:

  • Sets ExperimentalHardFork to false for node 11.0.0 on networks not yet forked to PV11, ensuring compatibility with older node versions

iohk-nix PR#614:

  • Fixes blst flake input with correct lastModified and narHash and sets libblst to explicit version 0.3.15
  • Adds GHA validate flake lock CI for push and PR to catch flake input regressions

Repository Work In Progress -- PRs and Branches

SRE Team Update

· 4 min read
John Lotoski
Service Reliability Engineer

High level summary

The SRE team continues work on Cardano environment improvements and general maintenance.

Some notable recent changes, updates or improvements include:

  • Cardano-parts, cardano-playground and cardano-mainnet were updated with cardano-node 10.6.4, cardano-db-sync release 13.6.0.8, pre-release 13.7.0.2, and nix was patched for security vulnerabilities GHSA-g3g9-5vj6-r3gj / CVE-2026-39860.

  • The ZFS AMI module was enhanced with a configurable percentage-based ARC cache sizing option derived from the node RAM.

  • Buildkite infrastructure was updated to accommodate Daedalus Linux CI support.

  • A van Rossem PV11 cost model governance vote was cast on preview.

Repository Work -- Merged

Cardano-mainnet

cardano-mainnet PR#43:

  • Bumps cardano-node to 10.6.3, and then 10.6.4 with corresponding deployments
  • Bumps cardano-db-sync to 13.6.0.8 and deploys to dbsyncs
  • Adjusts alerts for the remaining block producer to reflect current stake levels
  • Migrates resources out of me-central-1 due to stability issues and into ap-southeast-6
  • Destroys retired block producer machines and secrets
  • Updates webserver and DNS resources to properly serve IOGP metadata for remaining pools that are unused and unfunded but not retired
  • Adds CPU/memory usage panels and totals to cardano-node.json and cardano-node-new-tracing.json Grafana dashboards
  • See the PR description for additional details

Cardano-parts

cardano-parts PR#81:

  • Bumps cardano-node release to 10.6.4, cardano-db-sync release to 13.6.0.8, and cardano-db-sync pre-release to 13.7.0.2
  • Bumps nix to address security vulnerabilities GHSA-g3g9-5vj6-r3gj and CVE-2026-39860
  • Extends the ZFS AMI ami.nix nixosModule with a configurable boot.zfs.zfsArcPct option for percentage-based ARC cache sizing
  • Updates the AWS EC2 spec to include new machine types missing in the existing spec
  • Fixes a race condition in profile-aws-ec2-ephemeral.nix where chown could fail on a disappeared ephemeral file
  • Fixes a tcpTxOpt colmena module breaking change introduced in nixpkgs 25.11
  • Adds CPU/memory usage panels and totals to cardano-node.json and cardano-node-new-tracing.json Grafana dashboards
  • Adds non-NixOS machine handling to consistency-checking and update-ips recipes
  • See the PR description for additional details

Cardano-playground

cardano-playground PR#56:

  • Bumps cardano-node to 10.6.4, cardano-db-sync to 13.6.0.8, and cardano-db-sync pre-release to 13.7.0.2 with deployments to release environments
  • Extends ami.nix with configurable boot.zfs.zfsArcPct option for percentage-based ZFS ARC cache sizing
  • Fixes buildkite NixOS container startup race condition with sops and repurposes a buildkite machine for a Daedalus queue
  • Adds CPU/memory usage panels and totals to Grafana dashboards
  • Updates cardano-book for 10.6.3 and 10.6.4 node releases
  • Casts a governance vote on preview for the van Rossem PV11 cost model update with signed rationale and vote transaction
  • See the PR description for additional details

Devx-ci

devx-ci PR#154:

  • This should bring hydra-tools back up to a sufficiently recent release (hydra-github-bridge to 0.2.1.0), which will make it possible to layer on other fixes on top of it (for example, recovering PostgreSQL hung connections and not crashing while reading build logs).

devx-ci PR#155:

  • Adds 3 types of Oakhost Darwin machines, each with 3 available hydra build slots initially, pending further tuning
  • These machines will likely be short-lived until a new hardware offering from Oakhost is available in a few months
  • Adjusts number of hydra eval worker threads to 3 as 4 tends to cause semi-regular OOMs w/ 4 concurrent large evals

SRE Team Update

· 4 min read
John Lotoski
Service Reliability Engineer

High level summary

The SRE team continues work on Cardano environment improvements and general maintenance.

Some notable recent changes, updates or improvements include:

  • Cardano-parts and cardano-playground were updated with cardano-node 10.6.2, cardano-node pre-release 10.7.0, nixpkgs 25.11, ZFS AMI support, new Loki log dashboards, and extensive monitoring improvements including per-machine absent metrics alerting and mempool timeout alerts.

  • The dijkstra network was fully respun with updated secrets, configs, and a Van Rossem PV11 cost model governance action prepared.

  • CloudFormation stack hardening was applied: dedicated S3 server access logs bucket, TLS-only bucket policies, DynamoDB deletion protection with PITR, and KMS encryption.

  • Ouroboros-network-ops was brought up to a recent cardano-parts release with new resource tagging for CloudFormation and OpenTofu resources.

Repository Work -- Merged

Cardano-airgap

cardano-airgap PR#13:

  • Adds midnight-cli to the air-gapped signing toolset

Cardano-mainnet

cardano-mainnet PR#42:

  • Deploys all nodes to 10.6.2, and all dbsyncs to 13.6.0.7
  • Upgrades nixpkgs to 25.11 and nix to 2.33-maint
  • Adds bootstrap OpenTofu environment and ZFS AMI NixOS module support
  • Adds Loki log shipping with four new log dashboards; removes superseded node-exporter Loki dashboard
  • Adds per-machine machine_metrics_absent alert, tx mempool timeout alerts, and tightened blockHeight threshold
  • Hardens CloudFormation stack with TLS-only policies, DynamoDB deletion protection and PITR, and KMS encryption
  • Rotates the mainnet pool KES keys
  • See the PR description for additional details

Cardano-parts

cardano-parts PR#79:

  • Bumps cardano-node release to 10.6.2, pre-release to 10.7.0, cardano-db-sync release to 13.6.0.7, pre-release to 13.7.0.1, and other component updates
  • Bumps nixpkgs to 25.11 and nix to 2.33-maint with required compatibility fixes
  • Introduces ZFS AMI support via a new ami.nix nixosModule with tank/{root,nix,home,state} dataset layout and new bootstrap OpenTofu environment
  • Removes the deprecated Grafana Agent (EOL 2025-11-01), migrating fully to Grafana Alloy with Loki log shipping support
  • Adds four new Loki log dashboards: cardano-node-logs.json, cardano-node-logs-json.json, systemd-logs.json, and systemd-logs-json.json
  • Adds per-machine machine_metrics_absent alert with multi-offset detection; adds tx mempool timeout alerts; tightens blockHeight unchanged alert from 10 to 7 minutes
  • Hardens CloudFormation stack: dedicated S3 server access logs bucket, TLS-only bucket policies, DynamoDB deletion protection with PITR, and KMS encryption
  • Adds Van Rossem PV11 cost model JSON to template cost-models
  • Restructures cardano-node.json dashboard with mempool timeout panels, instance filtering, and restart/version-change annotations
  • Re-adds sanchonet support to process-compose stacks and template scripts
  • See the PR description for additional details

Cardano-playground

cardano-playground PR#55:

  • Sets cardano-node release to 10.6.2, pre-release to 10.7.0, cardano-db-sync to 13.6.0.7, pre-release to 13.7.0.1
  • Upgrades nixpkgs to 25.11 and nix to 2.33-maint
  • Adds bootstrap OpenTofu environment and ZFS AMI NixOS module support
  • Adds Loki log shipping with four new log dashboards; removes superseded node-exporter Loki dashboard
  • Adds per-machine machine_metrics_absent alert, tx mempool timeout alerts, and tightened blockHeight threshold
  • Creates dijkstra respin with new secrets, updated network configs, and Van Rossem PV11 cost model governance action
  • Converts preview3-bp-c-1 and mainnet1-rel-a-3 to LSM storage backend
  • Hardens CloudFormation stack with TLS-only policies, DynamoDB deletion protection and PITR, and KMS encryption
  • Large colmena cleanup: group-based import system, removes metrics-scraper module
  • Re-integrates sanchonet via upstream iohk-nix
  • See the PR description for additional details

Ouroboros-network-ops

ouroboros-network-ops PR#30:

  • Bumps cardano-parts from v2025-06-24 to post-v2025-08-14
  • Adds new resource tags to CloudFormation and OpenTofu resources: owner, project, costCenter
  • Updates pre-existing organization and environment tags
  • Applies breaking change updates from cardano-parts release

Devx-ci

devx-ci PR#152:

  • Bumps nix in linux and darwin hosts and guests to resolve: GHSA-g3g9-5vj6-r3gj / CVE-2026-39860
  • Also bumps the darwin guest bootstrap nixpkgs version in apply.sh from 25.05 to 25.11

SRE Team Update

· 5 min read
John Lotoski
Service Reliability Engineer

High level summary

The SRE team continues work on Cardano environment improvements and general maintenance.

Some notable recent changes, updates or improvements include:

  • Trace dispatcher was migrated out of cardano-node and into its own repo: hermod-tracing

  • A number of CI improvements to Darwin builder configuration and general CI monitoring and alerting were merged.

  • The cardano-node 10.7.0 pre-release SRE contribution work was completed.

  • Dijkstra network had the van Rossem cost model for PV11 preparation submitted, ratified and enacted.

  • Some cloud resources were relocated to more stable areas after disruption due to conflict in the Middle East.

Repository Work -- Merged

Cardano-monitoring

cardano-monitoring PR#6:

  • Enables Loki alert rule evaluation, and makes the alerts visible in Grafana

Cardano-node

cardano-node PR#6478

  • Bumps iohkNix, updates MinNodeVersion to 10.7.0 and refreshes mainnet-peer-snapshot.json and other ci files.
  • Adds independent lsmDatabasePath NixOS option with uniqueness assertion and mutual-exclusion check against LMDB per instance.
  • Adds kes-agent/kes-agent-control (Linux only) and dmq-node (all platforms) to release binaries.
  • Adds cardano-node-dbtools NixOS test covering db-synthesizer, db-analyser, db-truncater, and the GHC-asserted synthesizer binary against a cardano-testnet create-env environment.
  • Adds --shelley-kes-agent-socket support to run-node and cardano-node-service.nix. Expands the KES assertion to cover three valid forging configurations: relay (none), direct KES key, and KES agent socket.
  • Adds CARDANO_TRACER_SOCKET_NETWORK_{ACCEPT,CONNECT} tracer socket options to run-node.
  • Hardens all node and tracer entrypoint/launch scripts with set -euo pipefail, safe ${VAR:-} expansion throughout, pre-flight file existence checks, and exec for clean process replacement.
  • Consolidates separate relay/block-producer run functions into a single runNode. Derives GENESIS_JSON from CARDANO_CONFIG directory to support non-mainnet deployments.
  • Renames runCommandNoCCLocalrunCommandLocal for nixpkgs 25.11.

Devx-ci

devx-ci PR#145:

  • Upgrades darwin CI infrastructure with version bumps, guest VM lifecycle management, and maintenance improvements.
  • Darwin related infrastructure upgrades:
    • Nix 2.322.33-maintenance (hosts and guests); nix.package now set explicitly
    • nix-darwin 25.0525.11 (guests)
    • UTM versioned per architecture:
      • aarch64-darwin: 4.5.45.0.2
      • x86_64-darwin: pinned to 4.6.5 as UTM > 4.6.5 breaks display driver compatibility with macOS Sequoia+ on Intel
    • ca-derivations experimental feature enabled on hosts and guests
    • Guest bootstrap nix version bumped 2.28.32.32.5
    • Adds a small C binary at /usr/local/bin/nix-daemon-launcher to work around macOS launchd sandbox blocking .dylib loads from the APFS /nix volume
    • darwin.sh gains --system/-s (default: aarch64-darwin) flags; argbash upgraded 2.10.02.11.0
    • Auth-keys-hub is now utilized by the guests and legacy ops-lib usage has been removed
    • A percentage-based threshold garbage collection has been implemented, deriving thresholds from disk size rather than fixed values
  • Bumped nixpkgs-gh-runners v2.330.0v2.332.0
  • See the PR description for additional details

devx-ci PR#146:

  • Deploy hydra-tools hydra-github-bridge 0.2.1.0

devx-ci PR#147:

  • Accommodate the nixos hydra-github-bridge module with an extra secrets file

devx-ci PR#148:

  • Pause the new hydra-github-bridge usage until after pre-release of 10.7.0
  • Remove zramSwap to make more physical RAM available to Hydra
  • Reduce max concurrent evals/jobs down to 4 to keep throughput manageable

devx-ci PR#149:

  • Rekey all secrets to accommodate contributions from another SRE

devx-ci PR#150:

  • Improves alerting infrastructure using Mimir Alertmanager and Loki ruler.
  • Sets up Dead Man's Snitch.
  • Infrastructure changes:
    • Moved OpenTofu configuration from perSystem/packages/opentofuConfig/ to flake/opentofu/ for consistency with other SRE repos
    • Added Mimir provider for Prometheus-style alerting rules
    • Added Loki provider for log-based alerting rules
    • Configured Mimir Alertmanager to route all alerts to PagerDuty
    • Added various metrics based and log based alerts
  • See the PR description for additional details

devx-ci PR#151:

  • Alert on every OOM and setup annotations to show Alertmanager alerts.

Hermod-tracing

hermod-tracing PR#2

  • Adds hydraJobs to flake top level attrs
  • Adds aarch64-linux
  • Tests hydra integration
  • Makes explicit required and nonrequired aggregate jobs
  • Moves default pkgs to trace-dispatcher

Iohk-nix

iohk-nix PR#610:

  • Include updated configs from respun dijkstra net from 2026-02-19
  • Re-add sanchonet back to the available environments since it is persisting as a long lived community test network
  • Update per-environment useLedgerAfterSlot values
  • Update per-environment peer-snapshot.json files
  • Update MinNodeVersion to 10.7.0 as the peer-snapshot files made a version breaking change

Usdcx-infra

usdcx-infra PR#5:

  • Adds missing series eval resolution targets

Repository Work In Progress -- PRs and Branches

SRE Team Update

· 2 min read
John Lotoski
Service Reliability Engineer

High level summary

The SRE team continues work on Cardano environment improvements and general maintenance.

Some notable recent changes, updates or improvements include:

  • Preparation for 10.7.0 pre-release is underway and SRE is working on integrations for kes-agent and dmq-node for the release binaries, node nixos service and OCI containers as appropriate. CI tests for Consensus db-tooling (ie: db-analyser, db-truncater, db-synthesizer) are being added to a nixos test run on Hydra to ensure bundled node version and db-tools version maintain compatibility.

  • Iterative deployments of 10.7.0 pre-release candidates to select pre-release environments are on-going with issues being reported back to developers.

  • Darwin CI build machine updates are underway along with some optimizations and fixes to reduce flaky Darwin platform bugs and noisy alerts as well as a refactor to reduce code complexity. A number of these improvements will appear in the next SRE biweekly update.

  • Loki logging has been added to more of our cardano-parts environments (ie: cardano-playground and cardano-mainnet). Custom Loki dashboards are also being prepared to improve the Loki experience and will appear in the next cardano-parts PR.

Repository Work -- Merged

Cardano-monitoring

cardano-monitoring PR#4:

  • Adds Loki to playground, mainnet and networkteam monitoring servers
  • Raises max_outstanding_per_tenant to accommodate large dashboards w/o errors

cardano-monitoring PR#5:

  • Adjusts Loki log retention to a per-environment setting

Devx-ci

devx-ci PR#144:

  • Increases nofile soft/hard limit to avoid failures on higher nofile requirement builds like virtiofs virtualized images

Repository Work In Progress -- PRs and Branches