Skip to main content

42 posts tagged with "sre"

View All Tags

· 3 min read
John Lotoski

High level summary

The SRE team continues work on Cardano environment improvements and general maintenance.

Some notable recent changes, updates or improvements include:

  • The preprod network was hard forked to Conway era.

  • The nixosModule profile-blockperf in cardano-parts now includes prometheus metrics, automatically scraped with grafana-agent along with a dashboard.

  • A nixosModule profile-tcpdump in cardano-parts is now available to push on-going pcaps to s3 for historical reference.

  • Old dev environments were cleaned up and retired after the completion of the ouroboros-network-ops cluster migration to the cardano-parts stack.

  • Causes of blockperf indicated mainnet relay delayed block headers were investigated and improved with adjustments to RTS parameters and machine class.

  • Conway-era mempool log volume increase was investigated and resolved with ouroboros-network improvements.

  • Scaling capability was added to the cardano-mainnet bootstrap cluster.

Repository Work

Cardano Parts

  • Sets cardano-db-sync (release) to 13.4.0.0. Includes nixosModule improvements to cardano-db-sync snapshots module with a manual trigger, blockperf module new prom metrics, grafana-agent module with auto-blockperf scrape config and a new tcpdump module for persistent pcaps to s3. Recipe improvements for configuration consistency checking and openTofu improved AMI and DNS filtering have been made. The AWS machine reference spec has been updated and one alert tuned for better sensitivity. More detail is available in the PR description: cardano-parts-pull-46

Cardano-mainnet

  • Deploys cardano-db-sync (release) to 13.4.0.0. Deploys nixosModule improvements for cardano-db-sync snapshots module with a manual trigger, blockperf module with new prom metrics, grafana-agent module with auto-blockperf scrape config and a new tcpdump module for persistent pcaps to s3. Recipes improvements for configuration consistency checking and openTofu improved AMI and DNS filtering have been made. Makes changes to pool group relays to eliminate or reduce delayed block headers. Tests additional dev patches for missingBlock errors. Adds bootstrap cluster scaling capability and a bootstrap cluster dashboard. Improvements made in cardano-parts PR#46 are included in this PR. More detail is available in the PR description: cardano-mainnet-pull-20

Cardano-ops (Legacy Mainnet)

  • Over a two week period the legacy relay nodes were scaled down 50% further from the recent machine quantity peak. commit-compare

Cardano-playground

  • Preprod was hard-forked to Conway. Deploys cardano-db-sync to 13.4.0.0. Recipe improvements for configuration consistency checking and openTofu improved AMI and DNS filtering have been made. Improvements made in cardano-parts PR#46 are included in this PR. More detail is available in the PR description: cardano-playground-pull-30

Cardano-world

  • Updates openssh to 9.8p1 on remaining cardano-world (soon-to-be-retired) cluster machines commit

· 5 min read
Michael Fellinger

High level summary

The SRE team continues work on Cardano environment improvements and general maintenance.

Some notable recent changes, updates or improvements include:

  • Preview network was hard forked to Conway era.
  • Cardano-db-sync was updated to 13.4.0.0 across all environments.

Repository Work

Cardano Airgap

Commit-compare

  • Update the image to cardano-cli to 9.2.1.0 and credential-manager to HEAD.
  • Finish testing the airgap image using ext4 partitions and add ventoy to the devShell.

Cardano Parts

Node 9.1.0, Mithril 2430.0, Chang readiness

Overview

Sets cardano-node (release) and cardano-node-ng (pre-release) versions to 9.1.0 and mithril to 2430.0. Includes nixosModule improvements for the new tracing system, a new template-clone recipe, various recipe improvements and fixes. A Chang readiness query has been added.

Details

  • Important versioning updates:
    • cardano-node and cardano-node-ng are now 9.1.0
  • Bumps capkgs:
    • For node releases of 9.1.0
    • For mithril 2430.0 and mithril-unstable
  • Improves the profile-cardano-node-new-tracing nixosModule and new tracing system in general by better cleaning residual legacy config items, restructuring the options for more flexibility in config composition, and configuring the new tracing system to log close to parity volume with the legacy tracing system when using UTXO-HD in memory mode.
  • Adds additional default Tcp and TcpExt metrics to the profile-grafana-agent nixosModules metrics scrape list
  • Adds curl and the pre-push script to the default cardano-parts devShell
  • Adds template alert cardano_node_elevated_restarts
  • Adds a new template recipe of template-clone for when downstream users know they simply want to mirror upstream templates rather than diff or patch them
  • Adds a new template sql script with a Chang era readiness sql query
  • Improves template recipe dbsync-pool-analyze and sql query with parameters to query any CTE in the large dbsync-pool-perf sql from cli
  • Improves template recipe dbsync-pool-analyze to handle queries that result in no non-performing pools
  • Improves template recipe dedelegate-pools with a mempool query instead of a fixed time to handle UTxO on-chain settlement
  • Removes template recipes that are now mostly cardano-playground specific
  • Fixes template dashboard for cardano-node legacy and new tracing application metrics to always show the full environment KES period
  • Fixes template recipe apply-bootstrap
  • Fixes outdated service option name and db-sync snapshot schema description

Cardano-mainnet

Node 9.1.0, Mithril 2430.0, Bp scheduled restart module Schedule restart initial prototyping

Overview

Sets cardano-node version to 9.1.0 and mithril to 2430.0. Adds block producer scheduled restart capability.

Details:

  • Bumps cardano-parts for:
    • Important versioning updates:
      • cardano-node and cardano-node-ng are now 9.1.0
    • Capkgs updates:
      • For node releases of 9.1.0
      • For mithril 2430.0 and mithril-unstable
  • KES rotates mainnet block producers
  • Optimizes bootstrap nodes for -N4 RTS usage
  • Adds cardano-node-schedule-restart nixosModule and associated perSystem packages
  • Adds new alerts for cardano_node_elevated_restarts
  • Fixes dashboard for cardano-node legacy and new tracing application metrics to always show the full environment KES period

Cardano-ops (Legacy Mainnet)

Commit-compare

Over a two week period the legacy relay nodes were scaled down to running only one instance of cardano-node per machine and then the number of running machines was further reduced by 25%.

Cardano-playground

Node 9.1.0, Mithril 2430.0, Preview hardfork to Conway

Overview:

Sets cardano-node (release) and cardano-node-ng (pre-release) versions to 9.1.0 and mithril to 2430.0. Hard forks preview network to Conway. Adds recipe and other improvements, including to the pool performance query recipe interface and a Chang readiness query.

Details:

  • Bumps cardano-parts for:
    • Important versioning updates:
      • cardano-node and cardano-node-ng are now 9.1.0
    • Capkgs updates:
      • For node releases of 9.1.0
      • For mithril 2430.0 and mithril-unstable
  • Adds a new template sql script with a Chang era readiness sql query
  • Adds a babbage-to-conway cost model to the Cardano book
  • Adds a new recipe kes-rotate for easy kes rotation
  • Adds new alerts for cardano_node_elevated_restarts
  • Adds a commit stamp marker for Cardano book updates
  • Adds a new template-clone recipe for mirroring upstream template files when diffing or patching isn't needed
  • Updates the Cardano book environment for cardano-node 9.1.0
  • Updates the explainer docs for kes-rotation, chain-manipulation, new-network
  • Updates the preview faucet for new govtool operations
  • Rotates sanchonet KES, resizes the metadata server
  • Investigates mempool rejections with new tracing system and modified logging
  • Tests a comparison set of machines in mainnet environment for node 9.1.0 and utxo-hd-9.0
  • Tests a new tracing system branch for metrics renaming and KES metrics update calculations
  • Moves some cardano-playground specific recipes to the scripts/recipes-custom.just module
  • Improves template recipe dbsync-pool-analyze and sql query with parameters to query any CTE in the large dbsync-pool-perf sql from cli
  • Improves template recipe dbsync-pool-analyze to handle queries that result in no non-performing pools
  • Improves template recipe dedelegate-pools with a mempool query instead of a fixed time to handle UTxO on-chain settlement
  • Hard forks preview environment to Conway and resizes one relay member of each preview pool group

Iohk-nix

Add conway config for mainnet/preprod/preview

Devx-ci

Fix Hydra alerting immediately on no data

  • Migrate from Grafana Cloud to our self-hosted cardano-monitoring stack
    • Do not filter metrics to keep down number of unique series
    • This allows unlimited collection of metrics from our CI machines for better alerts and measurements for ongoing performance tuning.
  • Upgrade disko partition names manually on remaining machines ci{2,3,4,6,7,8} so boot does not break on the next deployment
  • Grafana dashboard: fix memory usage graph
  • Add alerts

Cardano-monitoring

Commit-compare

  • Add preliminary support for Loki for log collection

· 4 min read
Michael Fellinger

High level summary

The SRE team continues work on cardano environment improvements and general maintenance.

Some notable recent changes, updates or improvements include:

  • Our new baseline version of Cardano Node is 9.1.0 and all environments have been updated. The main change from node 9.0.0 is that node 9.1.0 requires a Conway genesis file at startup, where the genesis file was optional in node 9.0.0.

  • Sanchonet had another respin for node 9.0.0 with new Conway genesis parameters and has since been upgraded to node 9.1.0.

  • The cardano-monitoring cluster received a lot of documentation and improvements and now also serves as the home for devx-ci metrics.

  • Hydra CI performance was improved once again by changes to our custom Nix evaluator. We also found further ways to improve the resource usage of waiting for IFDs.

Cardano Airgap

A new project that provides a completely airgapped environment for constitution members to sign proposals and transactions. It consists of an image for USB sticks and ensures all private data is stored securely with strong encryption.

We'll keep it updated as required with the latest Cardano versions.

cardano-airgap

Cardano Parts

  • cardano-node and cardano-node-ng are now at version 9.0.0
  • cardano-db-sync and cardano-db-sync-ng are now at version 13.3.0.0
  • Several NixOS modules and recipes have been fixed and improved.
  • Bump dependency of capkgs for node, db-sync, mithril, and cardano-wallet updates.
  • Update profile-cardano-db-sync-snapshots for schema 13.3 docs and with script edge case fixes
  • Update profile-cardano-node-group to use a SIGINT instead of SIGTERM for systemd stop
  • Update profile-common to deploy atd service
  • Update template recipe dbsync-prep to match faucet script defaults
  • Update template recipe update-ips to fix a nushell breaking change
  • Update .envrc with a newer direnv version and allows for symlinks on .envrc.local and ~/.age/credentials

PR#44

Cardano Playground

  • All networks are now running cardano-node 9.1.0 in preparation of the Chang hard-fork.
  • Also upgraded db-sync to 13.3.0.0
  • Added the cardano-ipfs module and a derivation for pinata-go-cli that is used to store and distribute documents that can be referenced on chain.
  • Some updates to the Cardano Operations Book about:
    • UseLedgerPeerAfter updates
    • Sanchonet respins configs
    • Dbsync EnableFutureGenesis flag
  • Add a block header block producer readiness test
  • Respin of sanchonet for node 9.0.0, then upgraded to 9.1.0
  • Tune webserver size and Varnish RAM to improve caching efficiency
  • Updates govtool module for multi-nginx module compatibility
  • Updates update-ips recipe for nushell breaking change in nixpkgs 24.05
  • Updates direnv version, allow symlinks config files used by direnv
  • Update .envrc with a newer direnv version, allow symlinks on direnv used config files

PR#28

Cardano Mainnet

  • Upgraded Cardano Node to 9.0.0
  • Upgraded Cardano DB Sync to 13.3.0.0
  • Bump capkgs dependency
  • Investigate bootstrap missingBlock error and deploy fixes for it.
  • Update scripts to be compatible with latest nushell version

PR#17

Cardano Monitoring

  • Write comprehensive documentation for all the Nix code, as well as detailed instructions for usage and deployment.
  • Overhaul most Just tasks to bring them more in sync with the other repositories
  • Upgrade all machines to NixOS 24.05
  • Upgrade auth-keys-hub to prevent lockout in case SOPS is unable to decrypt
  • Fix SOPS decryption failure on boot because of missing network.
  • Limit bootloader entries to 5 since the /boot partition is tiny
  • Additionally add fallback SSH keys for emergency use

PR#1

IOHK Nix

  • Update ledger peers to be after a more recent epoch boundary to improve bootstrapping and fix a private chain p2p delayed sync config issue.
  • Update sanchonet conway-genesis for respin
    • DRep voting thresholds both need to be 65%
    • Set govActionLifetime above the guardrail because of the short epochs.
    • Set minCommitteeSize to 5 (from recommended 7) because only 5 ICC members were able to provide keys for the respin.

PR#584 PR#585

Cardano Ops

  • tweak the stop timeout and change Cardano Node killsignal to SIGINT for clean restarts.

Diff

CAPkgs

Added following packages:

  • cardano-node 9.0.0 and 9.1.0
  • For cardano-db-sync releases of sancho-5.1.0 and 13.3.0.0
  • For mithril 2428.0 and mithril-unstable
  • For cardano-wallet v2024-07-19

· 3 min read
John Lotoski

High level summary

The SRE team continues work on cardano environment improvements and general environment maintenance.

Some notable recent changes, updates or improvements include:

  • Cardano-node 9.0.0 is now deployed to mainnet, preprod, preview, private and shelley-qa environments. The last several weeks have been very busy with pre-release and release activity and environment upgrades involving cardano-node versions 8.9.3, 8.9.4, 8.12.0-pre, 8.12.1, 8.12.2 and now 9.0.0 as of this update.

  • Sanchonet environment remains pinned at cardano-node version 8.11.0-pre until the next respin which will support 9.0.0 or greater.

  • Ogmios service and package options were added to cardano-parts.

  • Four documents were added to cardano-playground to better explain some operational procedures: debugging of peer-to-peer connections; governance voting with the playground stakepools; faucet setup; faucet pool de-delegation. Found at: docs/explain

  • One document was added to cardano-mainnet to explain cardano-snapshot operations. Found at: docs/explain

  • Private chain was stopped and re-spun with 2 hr epochs for testing.

  • Hydra and performance cluster machines had their configuration updated to be more robust to transient nix store caches outages which may re-occur in the future.

  • All machines in cardano-playground and cardano-mainnet clusters were updated to nixpkgs 24.05.

Lower level summary

Cardano-mainnet

  • Sets cardano-node to 8.12.2 as well as usage of a custom gc delay parameter branch for bootstrap nodes. Updates all machines to nixpkgs to 24.05 with openssh 9.8p1. Adds one new explainer readme document, new alerts and various script, recipe, and other improvements. See the PR description for more details: cardano-mainnet-pull-16

Cardano-ops

  • Bumps to cardano-node 9.0.0, adds coredump metrics, adds OOM/coredump alerting, adjusts systemd stop timeout to avoid some unneccesary chain replays: cardano-ops-compare

Cardano-parts

  • Sets cardano-node (release) and cardano-node-ng (pre-release) versions to 8.12.2 and cardano-db-sync-ng to sancho-5-0-0. Updates nixpkgs to 24.05. Includes nixosModule, dashboard, metric, alert and recipe improvements and new features. More detail is available in the PR description: cardano-parts-pull-43

Cardano-perf

  • Adjusts nix config to avoid R2 500 errors on transient cache problems and adds explorer to perf class: cardano-perf-compare

Cardano-playground

  • Sets cardano-node (release) and cardano-node-ng (pre-release) versions to 8.12.2 and cardano-db-sync-ng to sancho-5-0-0. Updates all machines to nixpkgs to 24.05 with openssh 9.8p1. Respins private chain and KES rotates multiple chains. Adds four new explainer readme documents, new alerts and various script, recipe, and other improvements. See the PR description for more details: cardano-playground-pull-27

Iohk-nix

Ops-lib

  • Updates deployers with recent nixpkgs, nix, refactors to preserve legacy nixops usage, adds starship and fzf: ops-lib-pull-134

  • Bumps openssh to 9.8p1 ops-lib-pull-135

· 2 min read
John Lotoski

High level summary

The SRE team continues work on cardano environment improvements and general environment maintenance.

Some notable recent changes, updates or improvements include:

  • Sanchonet was respun for cardano-node 8.11.0-pre

  • Private chain was respun twice for pre-sancho respin testing and short epoch testing with cardano-node 8.11.0-pre

  • Shelley-qa, two-thirds of preview and one-third of preprod networks were deployed to cardano-node 8.11.0-pre

  • Sanchonet, private chain and shelley-qa networks had dbsync sancho-4-3-0 deployed

  • A dbsync show_current_forging prepared statement was added to the cardano-parts profile-cardano-postgres nixosModule to aid with debugging chain quality issues

  • Three documents were added to cardano-playground to better explain some operations procedures: KES rotation, chain quality debugging and new network creation. Found at: docs/explain

  • A new mithril dashboard template is available in cardano-parts

Lower level summary

Capkgs:

  • Avoid git API rate limit errors on update github action via netrc usage and corresponding secret: capkgs-commit

Cardano-parts

  • Sets cardano-node-ng to 8.11.0-pre and cardano-db-sync-ng to sancho-4-3-0. Adds a dbsync prepared statement, mithril dashboard template, updates the node application dashboard template, improves justfile recipe templates and tunes some systemd dependencies. Iohk-nix-ng was updated for sanchonet and private chain respins. More detail is available in the PR description: cardano-parts-pull-41

Cardano-mainnet

  • Rotates KES, pins iogp4 as -ng, adds a mithril dashboard, updates the node application dashboard, improves justfile recipes and tunes systemd node and mithril services to avoid some edge case errors. See the PR description for more details: cardano-mainnet-pull-15

Cardano-ogmios

Cardano-playground

  • Respins sancho and private chains and deploys cardano-node 8.11.0-pre and cardano-db-sync sancho-4-3-0 to appropriate envs and machines. Adds a mithril dashboard template, updates the node application dashboard template, improves justfile recipe templates. Adds three new explainer readme documents. See the PR description for more details: cardano-playground-pull-24