Skip to main content

Performance & Tracing Update

· 5 min read
Michael Karg
Performance and Tracing Team Lead

High level summary

  • Benchmarking: Release benchmarks for 11.0.1; Feature benchmarks for: TxSubmissionLogicV2; Compiler version.
  • Development: Removal of legacy tracing completed - not yet merged.
  • Infrastructure: Genesis caching and post-processing completed - not yet merged.
  • Tracing: cardano-tracer HTTP API for metrics timeseries queries and Grafana datasource - not yet merged.
  • Leios: Leios/Mempool benchmarks using tx-centrifuge.
  • Node Diversity: Formal trace schema definition merged; Conformance framework to be presented at Porto workshop.

Low level overview

Benchmarking

We've performed, analysed and published relase benchmarks for Node version 11.0.1 - the release shows no performance regressions compared to 10.7.1. These benchmarks ran under Protocol Version 11, and were required to ensure there's no performance risk in using this version.

Furthermore, we've run feature benchmarks for a new incarnation of v2 of the tx submission logic. The new logic is an optimization and aims, among other things, to reduce redundancy in tx diffusion. While the feature is experimental, the benchmarks provided valuable measurements and data for the network team to move it forward.

Additionally, we've re-run benchmarks using the GHC9.12 compiler version on the new 11.0.1 baseline; since 10.6.2, there have been many changes in Ledger which impact generated code and compiler optimizations. While there's no fundamental performance blocker to use this more recent compiler on our code base, there are still a few unknowns. The data is currently still under review and discussion.

Development

With the upcoming 11.1 release, the legacy tracing system 'iohk-monitoring-framework' will finally be removed from the Node. The change extensive, as it involves large differences in project dependencies, in code, in configuration and in test suites. Old and new tracing system have been part of the Node build side-by-side for roughly two years now, with the new tracing system gaining wider adoption the last half year. Removing the need to stay backwards compatible with the legacy system within the same build unblocks several planned features for the new system, as well as finally moving it out into its own self-contained Hermod Tracing System project repository.

While the implemention is complete, the PR cardano-node PR#6580 is currently still in draft state, awaiting full verification and testing.

Infrastructure

The modularization of our automation's genesis cache is completed. In addition to quickly stitching together a custom genesis with a huge amount of injected staking data, it allows for all protocol-relevant fields of genesis to be freshly generated by cardano-cli - and not taken from the cache. This means, the post-processing has now been reduced to a minimum; that improves confidence in the benchmarking profiles insofar as it eliminates testing of workbench changes still being correctly patched onto potentially very long-lived cache entries on a variety of hosts.

Moreover, this change includes a proper profile overlay for Protocol Version 11, which includes changes to Plutus cost models and execution budgets that have already been submitted as a gov action on Mainnet. The (quite extensive) PR is currently in draft state and under testing: cardano-node PR#6544.

Tracing

The new version of cardano-tracer will come with an (opt-in via config) HTTP REST API to query metrics timeseries directly. As cardano-tracer can now store metrics of all connected Nodes, it's able to evaulate PromQL-like queries directly. This can be used as an alternative to having Prometheus scrape all of those processes. With the new release, we made the (previously experimental) API as much aligned to what users are accustomed to from Prometheus, so that it has reached reasonable stability.

Moreover, we built - from scratch - a Grafana datasource using that API. This datasource contains a dashboard to replace the deprecated 'RTView' component of cardano-tracer, and is intended to serve as a reference for the community to define their own dashboards and queries according to their monitoring needs.

This PR, too, is fairly extensive, and also contains several improvements and fixes of the underlying cardano-timeseries-io package: cardano-node PR#6562, currently in testing phase.

Leios

We've created, and performed, full cluster benchmarks for Leios - using our new high-pressure submission tool tx-centrifuge. The point of interest of these benchmarks was observing Mempool behaviour, under various levels of fragmentation, and various configurations as to its capacity. These benchmarks are meant close a gap to the Leios simulations providing evidence, by measuring concrete timings of a concrete Mempool implementation. The benchmarks have shown that a standard Mempool tuned to Praos will likely throttle maximum throughput for Leios. With this benchmark at hand, and Mempool identified as a potential bottleneck, the necessary adjustments or optimizations can always be confirmed and backed up by evidence.

Node Diversity

The comprehensive formal schema definition of the Node's existing trace messages has been merged (cardano-node PR#6527). This encodes the syntax and semantics of all the observable events that the Haskell Node implementation provides. Thus, it can serve as a reference to what diverse clients may implement - to gain comparability in protocol conformance, network performance, and the reuse of existing tooling relying on those observables.

That being said, the cardano-recon-framework is one such example. We've continuosly improved our Linear Temporal Logic based trace verifier for system behaviour; we've set several interesting properties that can be checked continuously from Node logs. One of our team will attend the Node Diversity workshop in Porto beginning of June, and contribute a presentation and a demo of this framework.