Skip to main content

64 posts tagged with "sre"

View All Tags

· 2 min read
John Lotoski

High level summary

The SRE team continues work on Cardano environment improvements and general maintenance.

Some notable recent changes, updates or improvements include:

  • Within approximately 1 day of the preview network partition occurring, mentioned in the last biweekly SRE update, mainnet experienced a similar partition. Again, after an intense multi-team and community collaborative effort which the SRE team was participating in from the beginning, a new cardano-node version 10.5.3 was released to fix the issue, with all network participants on problematic versions encouraged to upgrade immediately so that the partition would resolve itself as the majority of stake aligned under node versions with proper ledger hash handling. Indeed, with the quick reaction of the community as a whole, the network partition resolved within approximately 14 hours. Various after action reports are available detailing the event, such as Pi Lanningham's Poison Piggy - After Action Report

  • Subsequent to the mainnet fork event above, 10.6.1 has also been pre-released and deployed to IOE pre-release infrastructure. The SRE team has also been participating in other after action investigation activity.

Repository Work -- Merged

Cardano-mainnet

  • Bumps and deploys cardano-node release to 10.5.3

  • Fixed Mithril scripts due to Mithril aggregator upstream API breaking changes

  • Updates the save-ssh-config recipe to match opentofu ssh config artifact output

  • Cleaned up flake.nix inputs and some colmena modules code

  • Modifies two bootstrap EBS gp3 volumes to accommodate some data handling work

    cardano-mainnet-pr-40

Cardano-parts

  • Bumps cardano-node release to 10.5.3, pre-release to 10.6.1, and cardano-db-sync pre-release to 13.6.0.6

  • Mithril script code for node entrypoint, systemd ExecPre and process-compose jobs were fixed to accommodate breaking upstream API changes

  • The pkgs.nix flakeModule was re-factored to make switching between localFlake and capkgs pins easier

    cardano-parts-release-v2025-12-04

Cardano-playground

  • Bumps and deploys cardano-node release to 10.5.3, pre-release to 10.6.1, and cardano-db-sync pre-release to 13.6.0.6

  • Updates the cardano-book for the corresponding cardano-node release and pre-release updates

  • Adds a metrics scraper nixosModule for perf team custom data analysis

  • Fixed Mithril scripts due to Mithril aggregator upstream API breaking changes (via cardano-parts pin update)

  • Updates the save-ssh-config recipe to match opentofu ssh config artifact output

  • Cleaned up flake.nix inputs and some colmena modules code

    cardano-playground-pr-52

Repository Work In Progress -- PRs and Branches

· 4 min read
John Lotoski

High level summary

The SRE team continues work on Cardano environment improvements and general maintenance.

Some notable recent changes, updates or improvements include:

  • Cardano-node 10.6.0 has been pre-released with the corresponding long running SRE PRs now merged into this release! See the release notes for details.

  • SRE team identified a ledger replay bug in the 10.6.0 release candidate whereby the legacy tracing system would no longer log ledger replay update statistics. A fix was implemented prior to tagging and pre-releasing.

  • Near the end of this biweekly reporting period, the preview network experienced a network partition. After an intense multi-team and community collaborative effort which the SRE team was participating in from the beginning, the bug causing the partition was identified and a new cardano-node version 10.5.2 was released to fix the issue. This version was deployed to the IOE preview machines immediately upon release and shortly afterwards to the rest of the IOE testnet and mainnet infra.

Repository Work -- Merged

Cardano-mainnet

  • Bump cardano-parts for v2025-11-18

  • Updated CloudFormation terraformState.nix and opentofu/cluster.nix for corresponding tagging updates

  • Added new required flake cluster attribute declaration for new required resource tagging

  • Added a matomo nix module prototype in prep for a legacy bitte cluster matomo migration to prod

  • Fixed script breakages caused by cardano-cli breaking changes

  • Adds a smash delisting

  • Rotates mainnet KES

  • Adjusts an alert for pool 1 infrequent forging threshold noise

    cardano-mainnet-pr-39

Cardano-parts

  • Bumps cardano-node pre-release to 10.6.0, mithril to 2543.1-hotfix and blockperf to a fix branch which includes a patch for the new tracing system proper blockperf configuration

  • Added rsync ssm help bash function and alias to the common machine profile

  • Added peer snapshot files to the ops library function generateStaticHTMLConfigs

  • Added new flakeModule cluster.nix options of infra.generic costCenter, owner and project

  • Added zsh devShell command completion

  • Updated a number of nixosModules to support both new and legacy tracing systems as well as 10.6.0 and 10.5.1 configuration differences

  • Updated template CloudFormation terraform state and opentofu cluster resource definitions for corresponding tagging updates

  • Fixed template script breakages caused by cardano-cli breaking changes included in the 10.6.0 pre-release

  • Fixed the profile-blockperf.nix nixosModule new tracing system configuration

    cardano-parts-release-v2025-11-18

Cardano-playground

  • Added book config updates for 10.6.0 pre-release environments: preprod, preview

  • Added sanchonet environment configs for community disaster test participation

  • Added a "New Pool" document explainer at docs/explain/new-pool.md

  • Added new required flake cluster attribute declaration for new required resource tagging

  • Added a matomo nix module prototype in prep for a legacy bitte cluster matomo migration to prod

  • Added wireguard tunnel endpoints as temporary R2 colo http streaming/timeout bucket workarounds

  • Added misc improvements to playground scripts for governance voting

  • Updated CloudFormation terraform state and opentofu cluster resource declarations for corresponding tagging updates

  • Updated CI to a smaller representative machine subset

  • Updated preview, preprod and non-prod test forgers with KES rotation

  • Fixed script breakages caused by cardano-cli breaking changes

  • Voted on a preview and preprod governance action with drep/pools and CC

    cardano-playground-pr-51

Iohk-nix

  • Merges non-forger and forger configs, with node handling differences internally based on forger status (ie: PeerSharing, TargetNumberOfKnownPeers, TargetNumberOfRootPeers)

  • Includes peerSnapshotFile for all networks, now at v2

  • Allows SRV records for bootstrap resource definitions

  • Adjust default networking mode to p2p without explicit declaration as the only mode for >= 10.6.0 is p2p

  • Bump minNodeVersion to 10.6.0 for default config changes

  • Testnet templates have been adjusted for plutus v3 cost model params at a mainnet matching 251 parameters and Dijkstra genesis added

    iohk-nix-pr-602

Repository Work In Progress -- PRs and Branches

· 2 min read
John Lotoski

High level summary

The SRE team continues work on Cardano environment improvements and general maintenance.

Some notable recent changes, updates or improvements include:

  • A new cloud tagging resource effort is underway to better attribute and track project expenses, implemented largely via infrastructure-as-code updates with appropriate secrets handling.

  • As part of the 10.6.0 pre-release candidate SRE to dev feedback loop, a memory leak was identified in the pre-release candidate when LMDB ledger backend was used. Further testing, deployment and debugging work helped identify root cause and a fix was implemented.

  • SRE team facilitated the passage of a governance action on the preprod network, likely to be followed by a mainnet proposal and full community vote.

  • Aarch64 cardano-node nix builds, binary release artifacts and OCI containers will be added to the cardano-node repo in the near future.

  • SRE team is providing support to the Midnight Scavenger Mine project which is now live.

Repository Work -- Merged

Cardano-monitoring

  • Adds new resource tags to both CloudFormation and tofu resources: owner, project, costCenter

  • Updates the pre-existing organization and environment tags

  • Treats costCenter tag as secret and adds a corresponding secrets file

  • Updates justfile, CloudFormation state template and tofu cluster definition files to accommodate

  • Adds some default resource tofu defns in prep for ipv6 and default resource tagging

    cardano-monitoring-pr-3

Cardano-perf

  • Adds new resource tags to both CloudFormation and tofu resources: owner, project, costCenter

  • Updates the pre-existing organization and environment tags

  • Treats costCenter tag as secret and adds a corresponding secrets file

  • Updates justfile, CloudFormation state template and tofu cluster definition files to accommodate

  • Adds some default resource tofu defns in prep for ipv6 and default resource tagging

    cardano-perf-pr-6

Ouroboros-network-ops (Ready to merge -- waiting to align with next cardano-parts release)

  • Parts was bumped from v2025-06-24 to post-v2025-08-14 release at next-2025-08-14

  • Updates for breaking changes were applied.

  • Adds new resource tags to both CloudFormation and tofu resources: owner, project, costCenter

  • Updates the pre-existing organization and environment tags

  • Treats costCenter tag as secret and adds a corresponding secrets file

  • Updates justfile, CloudFormation state template and tofu cluster definition files to accommodate

    ouroboros-network-ops-pr-30

Repository Work In Progress -- PRs and Branches

· 3 min read
John Lotoski

High level summary

The SRE team continues work on Cardano environment improvements and general maintenance.

Some notable recent changes, updates or improvements include:

  • IOE participated in a community driven Sanchonet network chain disaster recovery test event to purposely break Sanchonet and recover by mechanisms and tools discussed in CIP-0135 including db-truncater and db-synthesizer.

  • Faster test deployment iteration of a 10.6.0 pre-release candidate is underway to some preview and preprod testnet machine deployments with a tight feedback loop between SRE and dev team for observations and debugging when issues are found.

  • A new cardano-tracer OCI container will be provided with the upcoming 10.6.0 pre-release now that the new tracing system will be default.

  • SRE team facilitated the passage of a governance action on the preview network, soon to be submitted to preprod and likely followed by a mainnet proposal and full community vote.

  • SRE team has begun providing some additional support to the Midnight Scavenger Mine project as needed until the Scavenger Mine phase completes.

Repository Work -- Merged

Cardano-node

  • This PR includes various SRE related changes for 10.6.0 pre-release readiness:

    • Bumps iohk-nix for config updates from iohk-nix PR#602
    • Fixes configuration change related CI checks.
    • Merges bp and non-bp configurations into a single config whereby ouroboros-network now automatically determines PeerSharing and Target* parameters which previously required being explicitly declared.
    • The new tracing system is now set as the default configuration; the legacy tracing system config is still made available.
    • The mainnet default topology configuration now includes a peerSnapshot declaration for making testing of GenesisMode more convenient.
    • Adjusts OCI containers for the new config setups and also includes peer-sharing configs for each network
    • Updates the nixos cardano-node service for the deprecation of useNewTopology given P2P is now the only networking mode as of 10.6.0.
    • Updates the nixos cardano-node service for new tracerSocketNetworkAccept and tracerSocketNetworkConnect cardano-tracer connection options.
    • Updates the nixos cardano-node service to support SRV peer records.
    • Updates the nixos cardano-tracer service for option name changes of acceptingSocket to acceptAt and connectingToSocket to connectTo; related workbench services were also updated accordingly.
    • For the binary releases, the cardano-submit-api config and peer-sharing config was added.
    • The default cardano-submit-api config was made compatible with the new tracing system.

    cardano-node-pr-6300

Repository Work In Progress -- PRs and Branches

· One min read
John Lotoski

High level summary

The SRE team continues work on Cardano environment improvements and general maintenance.

Some notable recent changes, updates or improvements include:

  • A sanchonet relay and pool have been spun up to participate in the community driven disaster recovery testing happening in the near future.

  • Cardano-submit-api configs will be updated to be compatible with both the legacy and new tracing system in the next cardano-node release.

  • A legacy Matomo deployment is being migrated to a newer stack so deprecated resources can be turned off in the near future.

Repository Work -- Merged

Capkgs

  • Adds an exclusions.json file to exclude packages that are known to not build, with compatibility code added to packages.cr and a new justfile recipe: filter-packages. The exclusions file was populated with all currenctly failing evals along with a reason. Go package exclusions were re-added after the missing hydra go deps were re-cached after working around a failing test certificate during haskell.nix bootstrap tests. capkgs-pr-7

Repository Work In Progress -- PRs and Branches