Skip to main content

SRE Team Update

· 2 min read
John Lotoski
Service Reliability Engineer

High level summary

The SRE team continues work on Cardano environment improvements and general maintenance.

Some notable recent changes, updates or improvements include:

  • The IOG private mainnet pools were retired this week. The IOG1 public stakepool remains active and forging.

  • An internal Dijkstra network was spun up for testing of the upcoming intra-era hard fork, followed by Dijkstra hark fork testing.

Repository Work -- Merged

Cardano-airgap

cardano-airgap PR#11:

  • Updates to nixpkgs 25.11
  • Updates adawallet with a nixpkgs also at 25.11 and fixed docopts
  • Bumps capkgs and corresponding bech32 package

Cardano-node

cardano-node PR#6401:

  • Bumps iohkNix flake input and adjusts configuration files for new tracing system parameter changes.

Cardano-parts

cardano-parts PR#78:

  • Adds CI tests for process-compose validation of node and db-sync stacks on the public networks.

Devx-ci

devx-ci PR#140:

  • Provides improvements to hydra-tools, including support for multiple GitHub organizations and GitHub app installations.

Repository Work In Progress -- PRs and Branches

Mithril Team Update

· 3 min read
Jean-Philippe Raynaud
Mithril Tech Lead

High level overview

The Mithril team focused on implementing the SNARK-friendly STM library and began developing the non-recursive SNARK circuit MVP within it. They completed the preparation phase of the SNARK circuit by assessing infrastructure costs, the audit status of the Midnight ZK library, and detecting an under-constrained circuit. They also kept working on the prototype of the recursive SNARK circuit.

Additionally, the team completed the DMQ infrastructure implementation and prepared for its deployment. They prepared a new guide for setting up an aggregator and published two development blog posts about the upcoming DMQ testing program with SPOs and the multiple aggregators testing program.

Finally, they switched to Blockfrost API to fetch the SPO tickers and names in the aggregator, and fixed issues in the CI related to disk space in runners.

Low level overview

Features

  • Published a dev blog post DMQ testing program with SPOs
  • Published a dev blog post Multiple aggregators testing program
  • Completed the issue Add a new guide on how to setup a follower aggregator #2815
  • Completed the issue Evaluate SNARK infrastructure for production/testing #2860
  • Completed the issue Simplify code of STM library #2794
  • Completed the issue Support test mode for the Halo2 circuit #2798
  • Completed the issue Detect an under constrained Halo2 circuit #2801
  • Completed the issue Assess constraints on Halo2 circuit verification #2799
  • Completed the issue Add AVK chaining verification to the recursive IVC circuit #2861
  • Completed the issue Midnight ZK library audit status #2802
  • Completed the issue Change hash function for support in Plutus #2766
  • Worked on the issue DMQ testing with SPOs on preview #2833
  • Worked on the issue Document recursive SNARK solution #2767
  • Worked on the issue Update protocol parameters to SNARK friendly values #2813
  • Worked on the issue Release 2603 distribution #2830
  • Worked on the issue Implement SNARK-friendly changes in STM library #2795
  • Worked on the issue Use Midnight ZK backends for Jubjub and Poseidon in STM #2888
  • Worked on the issue Update the Midnight library dependency in circuit prototype #2910

Protocol maintenance

  • Completed the issue Enhance signer/signature registration metrics in aggregator #2855
  • Completed the issue No more available disk space on GitHub runners #2906
  • Completed the issue Nightly tests does not fetch latest main artifacts #2879
  • Completed the issue Replace SPO ticker API in aggregator #2878
  • Worked on the issue Enhance protocol security page on website #2703

SRE Team Update

· One min read
John Lotoski
Service Reliability Engineer

High level summary

The SRE team continues work on Cardano environment improvements and general maintenance.

Much of the SRE team is on vacation during this biweekly update.

Happy holidays to all of the Cardano community!

Repository Work -- Merged

Capkgs

capkgs Range:

  • Updates the content address package repository CI job to use a netrc token for handling GitHub API rate limits. URL redirection handling is also added.

Devx-ci

devx-ci PR#139:

  • Add extra x86_64-linux build farm machines ci11, ci12 to the build cluster and re-key secrets

Repository Work In Progress -- PRs and Branches

Performance & Tracing Update

· 4 min read
Michael Karg
Performance and Tracing Team Lead

High level summary

  • Benchmarking: 10.6 benchmarks confirming heap size fix; First LSM-trees benchmarks.
  • Infrastructure: New typesetting tool for reporting pipeline.
  • Tracing: Increased robustness of the PrometheusSimple metrics backend; previous quality-of-life improvements released.
  • Leios: Linear temporal logic based trace verifier demo for Leios.

Low level overview

Benchmarking

The underlying cause for increase in RAM usage on Node 10.6.0 has been indentified and addressed. While heap size increase is still present outside of our benchmarking environment, its extent there is negligible. We've re-run cluster benchmarks to confirm the fix is successful.

Additionally, we've performed and analyzed benchmarks on several LSM-trees integration branches. This feature has as of now not been released in some Node version, so it is not yet fully configurable. The benchmarks have to be understood as a very early performance assessment. We've performed benchmarks for both in-memory and on-disk backing stores. Especially for the on-disk benchmarks, we could observe RAM usage decreasing clearly, with only small increases in CPU usage. While there is some extra cost to block adoption, cluster diffusion metrics still remain almost identical to the in-memory benchmarks - mostly due to header pipelining. As we didn't artificially constrain memory, the benchmarks are illustrative of LSM-trees behaviour when there's no pressure from the garbage collector: Given that, will on-disk LSM-trees use caching / buffering efficiently, or will it perform redundant disk I/O? The answer is - the former.

Infrastructure

For convenient creation of reporting documents, we're integrating a new typesetting tool: The brilliant, open-source Typst project promises fully typesettable and scriptable documents, while maintaining a syntax that is (almost) as easy to grasp as Markdown. Typst extensions even render our gnuplots inline - and fast. Easily scriptable styling enables us to deliver an often requested feature: Colorizing individual result metrics based on how risky (or beneficial) a deviation from the baseline is deemed to be. Up to now, our reporting pipeline depended on Emacs Org mode and a medium-sized LaTeX distro as part of the Performance Workbench; we might be able to drop these heavy dependencies in favor of something more modern soon.

Tracing

The Node's internal PrometheusSimple backend to expose metrics has received several robustness improvements. All those aim to mitigate factors in the host environment which can contribute to the backend becoming unreachable. It will now reap dangling socket connections more eagerly, preventing false positives in DoS protection. Furthermore, there now is a restart/retry wrapper in place, should the backend fail unexpectedly. All start and stop events are traced in the logs, exposing a potential error cause. Merged in cardano-node PR#6396.

The previous batch of quality-of-life improvements in cardano-node PR#6377 has also been merged and released. It includes Prometheus HTTP service discovery for cardano-tracer, more robust recovering and tracing of forwarding connection interruptions as well as stability improvements for engineers implementing tracers.

Leios

Our conformance testing framework which evaluates linear temporal logic propositions against trace output has matured. It has seen some performance and usability improvements, for instance a helpful human-readable output as to what minimal sequence of traces caused some proposition to be false - and the ability to consume traces from an arbitrary number of nodes instead of only one. We've already created several propositions targeting the well-behavedness of the block forging loop; diffusion related propositions for Praos and eventually Leios are logical next steps.

Even though this framework was built with Node Diversity in mind, we could showcase it at this month's Leios event, and demonstrate what it could deliver for this project as well - and we were very satisfied with the reception it got.

Performance & Tracing wishes you Happy Holidays...

...and a Joyful New Year!

SRE Team Update

· 2 min read
John Lotoski
Service Reliability Engineer

High level summary

The SRE team continues work on Cardano environment improvements and general maintenance.

Some notable recent changes, updates or improvements include:

  • Starting with the next node release version 10.6.2, release binaries and OCI images will be generated for arm64 architectures.

Repository Work -- Merged

Acropolis

acropolis PR#482:

  • Removes unused packages to free up disk space for running CI tests

acropolis PR#483:

  • Schedules a run of the omnibus bootstrap process every morning at 00:15
  • Fail the job if the process does not complete within 3 hours

Cardano-node

cardano-node PR#6376:

  • This PR improves support for multiple arches in the following ways:

    1. Adds aarch64-linux nix packages, including musl static and OCI tarball generation package variants;

    2. Bumps GHC from 9.6.6 -> 9.6.7 as well as the cardano-automation flake input for aarch64-linux support;

    3. Updates the release-ghcr GHA workflow to produce linux multi-arch manifest OCI and corresponding release images which auto-resolve on container pull to the appropriate arch (amd64 or arm64);

    4. Updates the release-upload GHA workflow to produce new linux and darwin aarch64 artifacts. Produces new OCI/goarch standard name aligned default OCI images.

    More details available in the PR description.

cardano-node PR#6391:

  • Adds db-analyser, db-synthesizer, and db-truncater to the Cardano Node container image.

Devx-ci

devx-ci PR#136:

  • Updates hydra version from 2.28 -> 2.32 + issue patch and explicitly allows IFD
  • Applies nix version 2.32-maintenance to hydra and linux builders
  • Adds ssh stabilization params to the hydra module for connection to remote builders
  • Disables nixos optimise on hydra to avoid GC performance degradation
  • Removes the r2 wireguard tunnel from the remote builders as it is not currently required

Repository Work In Progress -- PRs and Branches