Performance & Tracing Update
High level summary
- Benchmarking: Release benchmarks for
11.0.1; Feature benchmarks for:TxSubmissionLogicV2; Compiler version. - Development: Removal of legacy tracing completed - not yet merged.
- Infrastructure: Genesis caching and post-processing completed - not yet merged.
- Tracing:
cardano-tracerHTTP API for metrics timeseries queries and Grafana datasource - not yet merged. - Leios: Leios/Mempool benchmarks using
tx-centrifuge. - Node Diversity: Formal trace schema definition merged; Conformance framework to be presented at Porto workshop.
Low level overview
Benchmarking
We've performed, analysed and published relase benchmarks for Node version 11.0.1 - the release shows no performance regressions compared to 10.7.1. These benchmarks ran under Protocol Version 11, and were required
to ensure there's no performance risk in using this version.
Furthermore, we've run feature benchmarks for a new incarnation of v2 of the tx submission logic. The new logic is an optimization and aims, among other things, to reduce redundancy in tx diffusion. While the feature is experimental, the benchmarks provided valuable measurements and data for the network team to move it forward.
Additionally, we've re-run benchmarks using the GHC9.12 compiler version on the new 11.0.1 baseline; since 10.6.2, there have been many changes in Ledger which impact generated code and compiler optimizations. While there's no fundamental
performance blocker to use this more recent compiler on our code base, there are still a few unknowns. The data is currently still under review and discussion.
Development
With the upcoming 11.1 release, the legacy tracing system 'iohk-monitoring-framework' will finally be removed from the Node. The change extensive, as it involves large differences in project dependencies, in code, in configuration
and in test suites. Old and new tracing system have been part of the Node build side-by-side for roughly two years now, with the new tracing system gaining wider adoption the last half year. Removing the need to stay backwards compatible
with the legacy system within the same build unblocks several planned features for the new system, as well as finally moving it out into its own self-contained Hermod Tracing System project repository.
While the implemention is complete, the PR cardano-node PR#6580 is currently still in draft state, awaiting full verification and testing.
Infrastructure
The modularization of our automation's genesis cache is completed. In addition to quickly stitching together a custom genesis with a huge amount of injected staking data, it allows for all protocol-relevant fields of genesis to be freshly
generated by cardano-cli - and not taken from the cache. This means, the post-processing has now been reduced to a minimum; that improves confidence in the benchmarking profiles insofar as it eliminates testing of workbench changes
still being correctly patched onto potentially very long-lived cache entries on a variety of hosts.
Moreover, this change includes a proper profile overlay for Protocol Version 11, which includes changes to Plutus cost models and execution budgets that have already been submitted as a gov action on Mainnet. The (quite extensive) PR is currently in draft state and under testing: cardano-node PR#6544.
Tracing
The new version of cardano-tracer will come with an (opt-in via config) HTTP REST API to query metrics timeseries directly. As cardano-tracer can now store metrics of all connected Nodes, it's able to evaulate PromQL-like queries
directly. This can be used as an alternative to having Prometheus scrape all of those processes. With the new release, we made the (previously experimental) API as much aligned to what users are accustomed to from Prometheus, so that it has
reached reasonable stability.
Moreover, we built - from scratch - a Grafana datasource using that API. This datasource contains a dashboard to replace the deprecated 'RTView' component of cardano-tracer, and is intended to serve as a reference for the community to define their own dashboards and queries according to their monitoring needs.
This PR, too, is fairly extensive, and also contains several improvements and fixes of the underlying cardano-timeseries-io package: cardano-node PR#6562, currently in testing phase.
Leios
We've created, and performed, full cluster benchmarks for Leios - using our new high-pressure submission tool tx-centrifuge. The point of interest of these benchmarks was observing Mempool behaviour, under various
levels of fragmentation, and various configurations as to its capacity. These benchmarks are meant close a gap to the Leios simulations providing evidence, by measuring concrete timings of a concrete Mempool implementation. The benchmarks
have shown that a standard Mempool tuned to Praos will likely throttle maximum throughput for Leios. With this benchmark at hand, and Mempool identified as a potential bottleneck, the necessary adjustments or optimizations can always
be confirmed and backed up by evidence.
Node Diversity
The comprehensive formal schema definition of the Node's existing trace messages has been merged (cardano-node PR#6527). This encodes the syntax and semantics of all the observable events that the Haskell Node implementation provides. Thus, it can serve as a reference to what diverse clients may implement - to gain comparability in protocol conformance, network performance, and the reuse of existing tooling relying on those observables.
That being said, the cardano-recon-framework is one such example. We've continuosly improved our Linear Temporal Logic based trace verifier for system behaviour; we've set several interesting properties that can be
checked continuously from Node logs. One of our team will attend the Node Diversity workshop in Porto beginning of June, and contribute a presentation and a demo of this framework.
