Skip to main content

· 3 min read
Marcin Szamotulski
Marcin Wójtowicz

Overview of spint 100 and 101

Cardano Incident

On November 12th, Cardano experienced a fork due to a bug in ledger (de-)serialisation. Some nodes accepted an invalid tx, while others rejected it, leading to a fork of the chain.

The network team worked closely with the other Cadano teams, Intersect, Cardano Foundation, Emurgo, and stake pool operators to monitor and resolve the incident.

The network layer was not affected by the incident, and its resiliency played a role in the recovery of the network. We identified some areas that can further improve the network layer's robustness in such situations, and we'll be working on addressing these issues.

Churn Mitigation

We rolled out a churn mitigation included in the latest cardano-node-10.5 and 10.6 releases. This change ensures the speed of churn of hot peers is at least the speed of churn for established peers, and the speed of churn of the established peers is at least the speed of churn of known peers. This way, we avoid a situation where, over a long period of time, established peers will accumulate already tried hot peers, and cold peers will accumulate already tried established peers. See #5238 for more details.

The issue was identified using CF's cardano-ignite tool, with analysis provided by Karl Knutsson.

DMQ-Node

We are initiating a public repository for dmq-node to host its codebase.

We removed the KES evolution configuration; as a consequence, the genesis file option is no longer needed in the configuration of dmq-node, see #5244.

Ouroboros Leios development

Leios Demo

We have contributed improvements to consensus team's Leios demo, which includes minimal prototypes of a few miniprotocols and a simple network emulation layer for exchanging data between a leios server and patched cardano nodes. Our updates addressed unrealistic results obtained stemming from the use of a program toxiproxy for modeling bandwidth and delays. We have also improved our packet capture tooling and gathered some preliminary data for analysis. Based on this early investigation, we have identified a few areas where non-trivial changes may have to be made to the network stack to support the new protocol, as well as new features which could be introduced to deal with greatly expanded traffic requirements, while maintaining the base Praos protocol timeliness guarantees. More details about this were provided at the recent November Leios demo presentation, and further work will continue along these lines.

Server-side re-ordering for a request-response mini-protocol

It is a requirement of Leios to support Frishest-First delivery. For that purpose we started to work on a prototype implementationof a request-response mini-protocol which allows for server side re-ordering while maintaining typed-protocols safety guarantees of deadlock & live-lock freedom.

Peer Selection Improvements

Refined peer selection for local root peers behind firewalls: instead of polling, it now waits for incoming connections and reuses them outbound. See #5241 for details.

· 5 min read
Michael Karg

High level summary

  • Benchmarking: Node 10.6 (pre-)release benchmarks; compiler benchmarks: GHC9.12.
  • Development: Ongoing work on queryable timeseries prototype and message-based cardano-tracer library API.
  • Infrastructure: Optimitazation of trace forwarding.
  • Tracing: New quality-of-life features like Prometheus service discovery; significant tech debt removal.
  • Leios: Process specification for Leios conformance and performance testing / benchmarks; simulation result re-evaluation.
  • Node diversity: New framework for system conformance testing based on linear temporal logic.

Low level overview

Benchmarking

We've performed and analysed (pre-)release benchmarks for Node 10.6.0. Our benchmarks exhibited a clear increase in RAM usage (~15%) for a block producer. We're currently investigating the underlying cause, and whether it's reproducible outside of the benchmark, without the constant, artificial stress we subject the system to. The published final results can be found in the Performance report for 10.6.0-pre.

Furthermore, we've performed compiler benchmarks for the Node built with GHC9.12 instead of our current default GHC9.6. After getting both the 10.5 and the 10.6 branches to build with the new version, we performed cluster benchmarks for both, resulting in a cross-comparison to their respective baselines. We could determine that the GHC9.12 build is on par with the GHC9.6 one for Node 10.5, and even slightly better for Node 10.6 as far as block production, diffusion and adoption metrics are concerned. However, it exhibits unexplained increases in CPU time used (~9%) and Allocations + Minor GCs (~6%). We also found that by disabling a particular optimization on GHC9.12 called 'speculative evaluation of dictionary functions', the increase in CPU time is roughly halved. So while the code base seems to be ready overall to switch to the more recent compiler version, that increase, though not dramatic, is still being investigated.

Development

Our prototype for aggregating timeseries of Node metrics, and evaluating PromQL-like queries on them, has made significant progress. The query language is nearly complete, and is evaluated with reasonable speed. Our current work focuses on an efficient in-memory representation of timeseries - before considering any on-disk or database backed solution. To that end, we've created a microbenchmark tracking space usage, and will use real-life scraped metrics data from nodes deployed and monitored by SRE.

The clean separation of cardano-tracer into a library and an application part has entered testing. As it is built for highly concurrent operations, we chose a message-passing API where messages are always typed, and guaranteed to be consumed in the order they were emitted. Custom applications will be able to implement handlers in a composable manner that react to certain message types (or values); at the same time, they can emit messages that won't interfere with the library's internal messaging - which is assured by Haskell's type system.

Infrastructure

We've simplified the trace-forward implementation (a non-breaking change). When forwarding traces, the host application buffers those objects in a bounded queue until they're requested. The implementation used to switch between queue sizes based on congestion, and to mitigate short, temporary service interruptions. After taking a look into the specifics of the queue implementation (in a dependency), we found that there's no overhead for the unused capacity of any queue - meaning switching the queue size to the shorter variant does not yield any advantage. We were able to simplify the implementation based on that insight - which also turned out to be an optimization. Merged: cardano-node PR#6361.

Tracing

The new tracing system is now the default since Node 10.6 (pre-)release. Based on SRE's internal deployments and feedback, we've been adding several quality-of-life features as well as addressing some tech debt, as to enhance the new system's user experience. This includes a Prometheus HTTP service discovery feature for cardano-tracer, which allows for dynamically registering and scraping connected nodes, a CBOR formatting for trace messages which can be used to implement a binlog in the future (for space efficiency), and a basic Haskell type with parser instances for machine-readable logs (JSON or CBOR), which greatly simplifies building any tooling that consumes trace output / log files.

Additionally, there are many small changes that improve robustness, or code quality; such as safe garbage collection of threaded writers, simplification of the LogFormatting typeclass which is an interface of how traces can be rendered, properly tracing forwarding service interruptions, and fixing delay increments in the retry policy for re-establishing a forwarding connection.

The correspoding cardano-node PR#6377 is currently in draft state.

Leios

We're finalizing the Leios performance testing plan. This substantiates our approach based on the impact analysis, and formulates concrete steps regarding formal specification of system traces, generalisation of how benchmarking workloads are defined declaratively (and generated), taking advantage of common cases in performance and conformance testing (reducing the necessity for domain-specific tooling), and setting up dedicated microbenchmarks for key components (like e.g. crypto) which are designed to provide comparability over long time.

Node diversity

For conformance testing across diverse node implementations, we've designed a framework based on linear temporal logic (LTL). This allows for system or protocol properties to be expressed as logical propositions; it does not require any executable specification or similar, and is thus language independent. Evaluation of LTL tends to be very fast and can potentially scale up to routine verifications as part of CI. Even though Haskell was chosen as a language for the project, it can consume trace evidence from Cardano implementations in any language.

The project Cardano Trace LTL is still in prototype stage; it's able to verify a few very basic Praos properties, expressed as LTL propositions, based on our benchmarking environment's trace evidence.

· 4 min read
John Lotoski

High level summary

The SRE team continues work on Cardano environment improvements and general maintenance.

Some notable recent changes, updates or improvements include:

  • Cardano-node 10.6.0 has been pre-released with the corresponding long running SRE PRs now merged into this release! See the release notes for details.

  • SRE team identified a ledger replay bug in the 10.6.0 release candidate whereby the legacy tracing system would no longer log ledger replay update statistics. A fix was implemented prior to tagging and pre-releasing.

  • Near the end of this biweekly reporting period, the preview network experienced a network partition. After an intense multi-team and community collaborative effort which the SRE team was participating in from the beginning, the bug causing the partition was identified and a new cardano-node version 10.5.2 was released to fix the issue. This version was deployed to the IOE preview machines immediately upon release and shortly afterwards to the rest of the IOE testnet and mainnet infra.

Repository Work -- Merged

Cardano-mainnet

  • Bump cardano-parts for v2025-11-18

  • Updated CloudFormation terraformState.nix and opentofu/cluster.nix for corresponding tagging updates

  • Added new required flake cluster attribute declaration for new required resource tagging

  • Added a matomo nix module prototype in prep for a legacy bitte cluster matomo migration to prod

  • Fixed script breakages caused by cardano-cli breaking changes

  • Adds a smash delisting

  • Rotates mainnet KES

  • Adjusts an alert for pool 1 infrequent forging threshold noise

    cardano-mainnet-pr-39

Cardano-parts

  • Bumps cardano-node pre-release to 10.6.0, mithril to 2543.1-hotfix and blockperf to a fix branch which includes a patch for the new tracing system proper blockperf configuration

  • Added rsync ssm help bash function and alias to the common machine profile

  • Added peer snapshot files to the ops library function generateStaticHTMLConfigs

  • Added new flakeModule cluster.nix options of infra.generic costCenter, owner and project

  • Added zsh devShell command completion

  • Updated a number of nixosModules to support both new and legacy tracing systems as well as 10.6.0 and 10.5.1 configuration differences

  • Updated template CloudFormation terraform state and opentofu cluster resource definitions for corresponding tagging updates

  • Fixed template script breakages caused by cardano-cli breaking changes included in the 10.6.0 pre-release

  • Fixed the profile-blockperf.nix nixosModule new tracing system configuration

    cardano-parts-release-v2025-11-18

Cardano-playground

  • Added book config updates for 10.6.0 pre-release environments: preprod, preview

  • Added sanchonet environment configs for community disaster test participation

  • Added a "New Pool" document explainer at docs/explain/new-pool.md

  • Added new required flake cluster attribute declaration for new required resource tagging

  • Added a matomo nix module prototype in prep for a legacy bitte cluster matomo migration to prod

  • Added wireguard tunnel endpoints as temporary R2 colo http streaming/timeout bucket workarounds

  • Added misc improvements to playground scripts for governance voting

  • Updated CloudFormation terraform state and opentofu cluster resource declarations for corresponding tagging updates

  • Updated CI to a smaller representative machine subset

  • Updated preview, preprod and non-prod test forgers with KES rotation

  • Fixed script breakages caused by cardano-cli breaking changes

  • Voted on a preview and preprod governance action with drep/pools and CC

    cardano-playground-pr-51

Iohk-nix

  • Merges non-forger and forger configs, with node handling differences internally based on forger status (ie: PeerSharing, TargetNumberOfKnownPeers, TargetNumberOfRootPeers)

  • Includes peerSnapshotFile for all networks, now at v2

  • Allows SRV records for bootstrap resource definitions

  • Adjust default networking mode to p2p without explicit declaration as the only mode for >= 10.6.0 is p2p

  • Bump minNodeVersion to 10.6.0 for default config changes

  • Testnet templates have been adjusted for plutus v3 cost model params at a mainnet matching 251 parameters and Dijkstra genesis added

    iohk-nix-pr-602

Repository Work In Progress -- PRs and Branches

· 3 min read
Alexey Kuleshevich

High level summary

Ledger team has successfully defined what sub-transaction really means for Nested Transactions in Dijkstra era. We've also reduced quite a bit of duplication with respect to how CDDL is specified with further improvements that are still in progress. We have fixed some serialization issues for Dijkstra era that we couldn't fix for previous eras, which had to do with preventing duplicates being supplied over the wire. Besides these improvements we've also cleaned up some deprecated functionality, reduced memory overhead in ledger state, improved performance of the epoch boundary transition and fixed some issues in conformance tests.

Low level summary

Features

  • PR-5361 - Disable old redeemer deserialization
  • PR-5316 - Store delegators in pool state
  • PR-5375 - More CDDL de-duplication
  • PR-5376 - Remove all deprecated functionality introduced before the latest release
  • PR-5384 - Remove Generic instance from BoundedRatio type
  • PR-5382 - Even more CDDL
  • PR-5391 - Store future pools in PState as StakePoolParams
  • PR-5396 - Replace okeyL method with toOKey
  • PR-5386 - Add a subtransactions field to DijkstraTxBodyRaw
  • PR-5402 - Add custom Show instance for the Mismatch type, to show the Relation between the values
  • PR-5398 - Intern stake credentials in reverse delegations
  • PR-5411 - Switch to using TypeData extension
  • PR-5394 - CDDL: Consolidate certificates and pool params
  • PR-5417 - Fix Monoid instance for VMap
  • PR-5397 - Limit protocol version to Word32 from version 12
  • PR-5373 - Make decoders fail when encountering duplicate elements in TxWits
  • PR-5430 - Reflect subtransactions in dijkstra CDDL

Testing

  • PR-5347 - Move or replace BabbageFeatures tests in cardano-ledger-test
  • PR-5385 - Fix stakepool-test in nightly build
  • PR-5184 - Write an ImpTest to reproduce #5170
  • PR-5370 - Add a golden test for duplicate certificates
  • PR-5406 - Update fls and enable test
  • PR-5415 - Run the specification rules of EPOCH and NEWEPOCH in conformance tests
  • PR-5423 - Clean cardano-ledger-conformance
  • PR-5428 - Update formal-ledger-specifications

Infrastructure and releasing

  • PR-5410 - Use ghc 9.10 for Haddocks in GH Pages CI
  • PR-5422 - Add support for nothunks == 0.3.*
  • PR-5348 - Update SECURITY.md
  • PR-5425 - Upgrade the version of Ruby used in GH CI to match the nix flake

· 2 min read
Jean-Philippe Raynaud

High level overview

This week, the Mithril team completed the first phase of decentralizing configuration parameters and made good progress on implementing a simple aggregator discovery mechanism. Additionally, they kept working on the SNARK-friendly STM library by designing its architecture, implementing the Schnorr signature scheme, refactoring error handling, and experimenting with a Jubjub curve implementation in the BLST library.

Finally, they fixed some bugs, made enhancements to the CI, and continued improving the protocol security page.

Low level overview

Features

  • Completed the issue Decentralization of configuration parameters - Phase 1 #2692
  • Worked on the issue Implement a simple aggregator discovery mechanism #2726
  • Worked on the issue Architecture of the SNARK-friendly STM library #2763
  • Worked on the issue Implement Schnorr signature scheme in STM #2756
  • Worked on the issue Refactor error handling in STM library #2764
  • Worked on the issue Experimental blst-Jubjub #2772
  • Worked on the issue Test Haskell DMQ node message authentication (KES signature) #2786

Protocol maintenance

  • Completed the issue Refactor AggregatorClient trait in signer #2759
  • Completed the issue Add fast bootstrap of Cardano node with LMDB UTxO-HD verification in CI #2679
  • Completed the issue Some tests fail in CI due to no space left on GitHub runners #2782
  • Worked on the issue Support optional cardano_transactions_signing_config #2780
  • Worked on the issue Enhance protocol security page on website #2703