51 posts tagged with "performance-tracing"

View All Tags

Performance & Tracing Update

May 26, 2025 · 4 min read

Michael Karg

Performance and Tracing Team Lead

High level summary

Benchmarking: Feature benchmarks for ledger metrics tracer and InboundGovernor optimizations.
Development: Ledger metrics merged; 2 hotfixes for old tracing.
Infrastructure: Migration plan for on-disk benchmarks (LMDB, LSM-tree); initial Leios impact analysis.
New Tracing: Tracer service now independent of Node; new feature enabling forwarding over TCP.

Low level overview

Benchmarking

We've completed two distinct feature benchmarks: The new periodic ledger metrics tracer and InboundGovernor optimizations on the network layer. Both features have shown a positive performance impact; the former improves CPU usage and block production metrics, the latter slightly improves diffusion metrics.

Development

Having finalized and benchmarked the periodic ledger metrics tracer feature, it was merged to master and will be part of the upcoming 10.5 release. The feature decorrelates obtaining several metrics from the beginning of the forging loop. This avoids competition for synchronization primitives during the "hot phase" of block production. Furthermore, by decoupling those metrics from a forging tracer, we enable exposing those metrics from a relay as well. cardano-node PR#6180

Additionally, we've been vital in creating two hotfixes for the old tracing system:

The old tracing system metric utxoSize was missing due to using the pre-UTxO-HD variant of querying the set size. The fix ports the correct solution from the new tracing system to the old one: cardano-node PR#6217
On the upcoming Node 10.5 integration branch only, the old tracing system could leak file descriptors. Again, the fix was ported from the new tracing system to the old one - kudos to Karl Knutsson: iohk-monitoring PR#654

Infrastructure

We've discussed and set up a migration plan for our benchmarking cluster hardware. For fair and representative performance measurements of on-disk backing stores of UTxO-HD, we require direct SSD storage on the machine instance in the cloud; running disk I/O through additional layers to and from some shared SSD device, even in the same data center, would introduce significant confounding factors. The plan includes invalidating as little of our existing performance baselines as possible when migrating to the new hardware. We're looking forward to benchmark the current on-disk backend (LMDB) for block producers - as well as the futuere LSM-tree based one.

We've also discussed an initial Leios impact analysis. To fairly and reliably benchmark a future Leios implementation, our infrastructure and tooling will need to be extended significantly. Several metrics won't have the same weight they currently carry for Praos, due to Leios' later finality; other metrics will need to be introduced for different new Leios block types, adding appropriate observability to the implementation. Finally, creating and submitting a saturation workload for a system which is built for extremely high throughput will be a challenge in itself.

New Tracing

We've been working on a medium-sized refactoring that eliminates the cardano-node dependency from cardano-tracer. This means, the tracer service can now be built independently of the Node; all shared data types have been moved to some more basic packages of the new tracing system. This also enables us to issue releases of the tracer service independently of the Node's release cycle. cardano-node PR#6125

Last not least, we've kicked off development for a new feature that's been motivated by community feedback: Forwarding observables (trace messages, metrics) over TCP. Forwarding to different hosts currently assumes a UNIX domain socket that connects the Node and the tracer service through an SSH tunnel. This is a portable, versatile, and probably one of the most secure ways to transmit sensitive data. However, in an environment where an operator controls all network port mapping and firewalls, one can argue that forwarding over TCP/IP is equally viable, as it can be properly isolated - and it is much more convenient to set up and configure. The feature aims, when it's completed, to offer both forwarding routes, and let the end user decide.

Performance & Tracing Update

May 7, 2025 · 3 min read

Michael Karg

Performance and Tracing Team Lead

High level summary

Benchmarking: 10.4.1 release benchmarks; UTxO-HD, GC settings and socket I/O feature benchmarks.
Development: Abstracting over quick queries and trace queries; enabling query processing on remote hosts.
Infrastructure: Workbench simplification merged; GHC8.10 tech debt removed.
New Tracing: Provided hotfix for several metrics.

Low level overview

Benchmarking

We've completed release benchmarks for Node 10.4.1. It is the first mainline release of an UTxO-HD node featuring LedgerDB. Leading up to the release, we previously performed and analysed UTxO-HD benchmarks. We were able to document a regression in RAM usage, and assisted in pinpointing its origin, leading to it being fixed swiftly for the 10.4 release.

Additionally, we ran feature benchmarks for a potential socket I/O optimization in the network layer, and GC setting changes catering to the now-default GHC9.6 compiler. Both benchmarks have shown moderate improvements in various performance metrics. This might enable the network team to pick up the optimization for 10.5. Also, we might be able to update the recommended GC settings for block producers, and add them to our own nix service configs for deployment.

The 10.4.1 performance report has been published on Cardano Updates.

Development

We've further evolved the (still experimental) quick query feature of our analysis tool locli. Parametrizable quick queries allow for arbitrary queries into raw benchmarking data, uncovering correlations not part of standard analysis. They are implemented using composable definitions inside a filter-reduce framework. With locli's DB storage backend, we can leverage the DB engine to do much of the work. Now, we're integrating a precursor to quick queries - so called trace queries - into the framework. Those can process raw trace data from archived log files. Currently, we're adding an abstraction layer such that it is opaque to the framework whether the data was retrieved (and possibly pre-processed) from a DB or from raw traces.

Furthermore, we added a custom (CBOR-based) serialization for intermediate results so a query can be evaluated on a remote machine - like the system archiving all benchmarking runs - but triggered, and its results visualized, on your localhost.

Infrastructure

The workbench nix code optimization has finally been merged. Redundant derivations and recursions have been replaced; many nix store entries have been consolidated. Among other things, the new code also aims to maximize nix cache hits. Furthermore, as GHC8.10 has now been officially retired from all build pipelines, we were able to clean up all tech debt in our automations that we had to keep around due to supporting the old compiler version.

Exactly as we had hoped, this has brought down CI time for the Node by orders of magnitude; first, from over an hour to around 15 min, then to under 10 min. Also, all workbench shell invocations are significantly faster, and clutter in the nix store is greatly reduced.

New Tracing

We've been hurrying to provide hotfixes for connectionManager_* and slotsMissed metrics that were faulty on Node 10.3. They have been successfully integrated into the Node 10.4 release.

Performance & Tracing Update

April 15, 2025 · 4 min read

Michael Karg

Performance and Tracing Team Lead

High level summary

Benchmarking: 10.3.1 release benchmarks.
Development: Plutus script calibration tool and profile maintenance updates about to be merged.
Infrastructure: Workbench simplification about to be merged.
New Tracing: System dependencies untangled; preparing 'Periodic tracer' feature for production.
Node Diversity: Participation in Conformance Testing workshop in Paris.

Low level overview

Benchmarking

We're currently running release benchmarks for the upcoming Node 10.3.1 version - a candidate for Mainnet release. Having taken previous measurements on the release integration branch, we expect the results to be closely aligned with these.

Node 10.3.1 will support two compiler versions, namely GHC8.10.7 and GHC9.6.5. As a consequence, we benchmark both Node builds and compare against the previous performance baseline 10.2. So far, the release benchmarks confirm performance improvements in both resource usage and block production metrics seen on the integration branch - for both compiler versions. A final report will be published on Cardano Updates.

Development

The first version of our new tool calibrate-script is about to be merged. It is part of the tx-generator project, and calibrates Plutus benchmarking scripts according to a range of constraints on the expected workload. The tool documents the result and all intermediate steps in a developer-friendly way. A CSV report is generated which shows all properties of some calibration at a glance: How much execution budget was given, and how much of each execution budget was used, was memory or CPU steps the limiting factor for the script, how large will the resulting transaction be and what will it cost and more. Apart from greatly speeding up development of Plutus benchmarks for our team, this tool can also be used to assess changes to Plutus cost models, or efficiency of different Plutus compiler outputs - without running a full benchmark.

Furthermore, the benchmarking profiles defined in cardano-profile have undergone a large maintenance cycle. Besides a cleanup, several profiles were fixed wrt. transaction fees or duration, others now run on a more appropriate performance baseline. There era-dependency of a profile requiring a minimum protocol version has been solved such that it's now impossible to construct incompatible profiles by definition - e.g. a PlutusV3 benchmark in any era prior to Conway. The correspondig PR is about to be merged shortly.

Infrastructure

A large PR simplifying the build of our performance workbench has been finalized and passed testing. The nix code has been greatly optimized to avoid redundant derivations and creating an abundance of nix store paths. This not only makes the workbench better maintainable, it greatly reduces time and size requirements for CI jobs. In testing, we could observe a speedup of 40% - 50% for CI. Additionally, this PR prepares for the future removal of GHC8.10 as a release compiler - which will reduce CI cost even more. The PR is currently under review and to be merged soon.

New Tracing

The work on untangling dependencies in the new tracing system has entered testing phase. The cardano-tracer service no longer depends on the Node - with common data types and typeclass instances having been refactored to a more basic package of the tracing system. Once merged, this will allow for the service to be built, released and operated independently of cardano-node, widening its range of use cases.

On Node 10.1, we've built a prototype of the 'Periodic tracer' feature. It decorrelates tracing ledger metrics from the start of a block producer's forging loop, thus removing competition on certain synchronization primitives. We've already shown in past benchmarks it had a positive impact on block production performance. This prototype is now being developed for production release, complete with configuration options, and we aim to land it in Node 10.4.

Node Diversity

We've contributed to the recent Conformance Testing workshop in Paris. The topic was how to approach detection and documentation of system behaviour across diverse Cardano Node implementations: Where is the behaviour conforming to some blueprint, where does it deviate - intentionally or accidentally. Our tracing system is the prime provider of observability - and all evidence of program execution could in theory be checked against a machine-readable model of the blueprint. This of course assumes observables are implemented uniformly across diverse Node projects, i.e. without changing semantics. Thankfully, our tracing system lead engineer Jürgen Nicklisch was able to join that workshop and add to the discussions around that approach.

Performance & Tracing Update

March 28, 2025 · 5 min read

Michael Karg

Performance and Tracing Team Lead

High level summary

Benchmarking: Preliminary 10.3 benchmarks; GHC8 / GHC9 compiler version comparison; Plutus budget scaling; runtime parameter tuning on GHC9.
Development: Started new Plutus script calibration tool; maintenance updates to benchmarking profiles.
Infrastructure: Adjusted tooling to latest Cardano API version; simplification of performance workbench nearing completion.
New Tracing: Battle-tested metrics monitoring on mainnet; generalized nix service config for cardano-tracer.

Low level overview

Benchmarking

We've run and analyzed several benchmarks these last two weeks:

Preliminary `10.3` integration

As performance improvement is a stated goal for the 10.3 release, we became involved early in the release cycle. Benchmarking the earliest version of the 10.3 integration branch, we could already determine that the effort put in has yielded promising results and confirm improvements in both resource usage and block production metrics. A regular release benchmark will be performed, and published, from the final release tag.

Compiler versions: `GHC9.6.5` vs. `GHC8.10.7`

So far, code generation with GHC9.6 has resulted in a performance regression for block production under heavy load - we've established that in various past benchmarks. The optimization efforts on 10.3 also focused on removing that performance blocker. Benchmarking the integration branch with the newer compiler version has now confirmed it has not only vanished; moreover, code generated with GHC9.6 even exhibited slightly more favourable performance characteristics. So in all likelihood, Node 10.3 will be the last release to support GHC8.10, and we will recommend GHC9.6 as the default build platform for it.

Plutus budget scaling

We've repeated several Plutus budget scaling benchmarks on Node version 10.3 / GHC9.6. By scaling execution budgets to 1.5x and 2x their current mainnet values, we can determine the performance impact on the network of potential increases of said budgets. We independently measured bumping the steps (CPU) limit with a CPU-intensive script, and bumping the memory limit with a script performing lots of allocations. We could observe the performance impact to correspond linearly with the limit bump in each case. This gives certainty and predictability of the impact when suggesting changes to mainnet protocol parameters.

Our team presented those findings and the data to the Parameter Comittee for discussion.

Runtime system (RTS) settings

The recommended RTS settings for cardano-node encompass number of CPU cores to use, behaviour of the allocator, and behaviour of the garbage collector. The recommended settings so far are tuned to GHC8.10's RTS - one cannot assume the same settings are optimal for GHC9.6's RTS, too. So we've started a series of short, exploratory benchmarks comparing a matrix of promising changes to update our recommendation in the future.

Development

We've started to develop a new tool that calibrates our Plutus benchmarking scripts given a range of constraints on the expected workload. These entail exhausting a certain budget (block or transaction), or calibrating for a constant number of transactions per block while exhausting available steps, or memory, budget(s). The result of that directly serves as input to our benchmarking profile definition. This tool may also be of wider interest, as it allows for modifying various inputs, such as Plutus cost models, or script serializations generated by different compilers or compiler versions. That way, one can compare at a glance how effective a given script makes use of the available budgets, given a specific cost model.

Additonally, our benchmarking profiles are currently undergoing a maintenance cycle. This means, setups for which motivation has ceased to exist are removed, several are updated to use the Voltaire performance baseline, others need to be tested for their conformity with the Plomin hard-fork protocol updates.

Infrastructure

The extensive work of simplifying the performance workbench is almost finished and about to enter testing phase. We have been moving away from scripting to declarative (Haskell) definitions of all benchmark profiles and workloads in past PRs. The simplification work now reaps the benefits of that: We can optimize away many recursive / redundant invocations or nix evaluations, we can collate many nix store paths into just a couple ones, reduce the workbench's overall closure size and complexity. Apart from saving significant resources and time for CI runners, this will reduce maintence effort necessary on our end.

Furthermore, we've done maintenance on our tooling by adjusting to the latest changes in cardano-api. This included porting the ProtocolParameters type and type class instances over to us, as our use case requires we continue supporting it. However, it's considered deprecated in the API, and so this unblocks the team for eventually removing it.

New Tracing

Having addressed all feature and change requests relevant for the Node 10.3 release, we performed thorough mainnet testing of the new system's metrics in a monitoring context. We relied on the extremely diligent and helpful feedback from the SRE team. This enabled us to iron out a couple of remaining inconsistencies - a big shout-out and thank you to John Lotoski.

Additionally, again with SRE, a nix service configuration (SVC) has been created for cardano-tracer that was generalized and aligned in structure with the existing cardano-node SVC. It was evolved from the existing SVC in our performance workbench, which however was tied tightly to our team's use case. With the general approach we hope other teams, and the community, can reliably and easily set up and deploy cardano-tracer.

Performance & Tracing Update

March 11, 2025 · 3 min read

Michael Karg

Performance and Tracing Team Lead

High level summary

Development: New benchmark epoch timeline using db-sync; raw benchmark data now with DB storage layer as default - enabling quick queries.
Infrastructure: Merged workbench 'playground' profiles - easing benchmark calibration.
New Tracing: Plenty new features based on community feedback - including a new direct Prometheus backend; untangle system dependencies.
Community: Participation in the first episode of the Cardano Dev Pulse podcast.

Low level overview

Development

For keeping a history of comparable benchmarks, it's essential to have an accurate timeline of mainnet protocol parameter updates by epoch. They represent the environment in which specific measurements took place, and are thus tied inherently to the observation process. Additionally, to reproduce specific benchmarking metrics from the past, our performance workbench has the capability to "roll back" those updates, and perform a benchmark given the protocol parameters of any given epoch. Instead of maintaining this epoch timeline by hand, we've now created an automated way to extract all key epochs applying parameter updates using db-sync. This approach will prove both more robust, and having lower maintenance overhead.

Furthermore, the new DB storage backend for raw benchmarking data in locli is now set to be the default for the performance workbench. Apart from cutting down analysis time for a benchmarking run and reducing the required on-disk size for archiving, this enables the new (still under development) quick queries into raw performance data.

Infrastructure

When creating the Plutus memory scaling benchmarks, we developed so-called 'playground' profiles for the workbench. These allow for easier dynamic change of individual profile parameters, building a resulting benchmark setup including Plutus script calibration, and observing the effect in a short local cluster run. Applying these changes to established profiles is strictly forbidden, as it would put comparability with past benchmarks at risk. So by introducing this separation, we keep that safety guarantee, while still lifting it somewhat for the development cycle only.

New Tracing

We've been extremely busy implementing new features and optimizations for the new tracing system, motivated by the feedback we received from the SPO community. This includes:

A brand new backend that allows for Prometheus exposition of metrics directly from the application - without running cardano-tracer and forwarding to it.
A configurable reconnection interval for the forwarder to cardano-tracer.
An always up-to-date symlink pointing to the most recent log file in a cardano-tracer log rotation.
Optimizations in metrics forwarding and trace message formatting, which should lower the base CPU usage, at least in low congestion scenarios.

All those will be part of the upcoming Node 10.3 release.

Currently, the cardano-tracer service still depends on the Node for a few data type definitions. We're working on a refactoring so we can untangle this dependency. This will allow for the service to be built independently of the Node - simplifying a setup where other processes and applications can forward observables to cardano-tracer and benefit from its features.

Community

We had the opportunity to talk about benchmarking and performance impact of UTxO-HD on the very first episode of the Cardano Dev Pulse Podcast (YouTube). Thank you Sam and Carlos for having us!

High level summary​

Low level overview​

Benchmarking​

Development​

Infrastructure​

New Tracing​

High level summary​

Low level overview​

Benchmarking​

Development​

Infrastructure​

New Tracing​

High level summary​

Low level overview​

Benchmarking​

Development​

Infrastructure​

New Tracing​

Node Diversity​

High level summary​

Low level overview​

Benchmarking​

Preliminary 10.3 integration​

Compiler versions: GHC9.6.5 vs. GHC8.10.7​

Plutus budget scaling​

Runtime system (RTS) settings​

Development​

Infrastructure​

New Tracing​

High level summary​

Low level overview​

Development​

Infrastructure​

New Tracing​

Community​

High level summary

Low level overview

Benchmarking

Development

Infrastructure

New Tracing

High level summary

Low level overview

Benchmarking

Development

Infrastructure

New Tracing

High level summary

Low level overview

Benchmarking

Development

Infrastructure

New Tracing

Node Diversity

High level summary

Low level overview

Benchmarking

Preliminary `10.3` integration

Compiler versions: `GHC9.6.5` vs. `GHC8.10.7`

Plutus budget scaling

Runtime system (RTS) settings

Development

Infrastructure

New Tracing

High level summary

Low level overview

Development

Infrastructure

New Tracing

Community