Performance & Tracing Update
High level summary
- Benchmarking: Compiler benchmarks on
10.6.2; Trace evaluation feature benchmarks. - Development: Started new project
tx-centrifuge: A tx submission service generating extremely high, continuous workload. - Infrastructure: Small maintenance items, such as fixing profiled
nixbuilds for local benchmarking. - Tracing: New tracing system now its own project: Hermod Tracing; New library
cardano-timeseries-io, which accumulates metrics into queryable timeseries, released. - Leios:
cardano-recon-framework(formerly LTL Trace Verifier) integrated and in use. - Node Diversity: Formal trace schema definition nearing merge; Trace forwarding in native Rust on hiatus.
Low level overview
Benchmarking
We've repeated the GHC9.12 compiler benchmarks on Node 10.6.2, which we now know to be completely free of regressions or any space leak. This confirmed our earlier findings that the code generated by GHC9.12
is on par performance-wise as far as block production, diffusion and adoption metrics go, but it exhibits unexplained increases in CPU time used, Allocations & Minor GCs. Several potential suspects for causing this
have been identified with a profiled build. However, many of those will be replaced or changed in the 10.7 release, so that this benchmark will have to be re-run on Node 10.7.
The feature for new tracing, which forces a lazy trace value in a controlled section of code, is slated for inclusion in Node 10.7. To that end, we backported it to Node 10.6.2 and performed feature benchmarks
for it - to ensure it won't distort the upcoming 10.7 performance baseline. Indeed we found the performance impact of that feature to be negligible in all categories of observed metrics.
Development
We've started a new project - tx-centrifuge - for transaction submission (i.e. workload generation) during benchmarks and other scenarios. It is meant to be complementary to the existing
tx-generator. The latter is tailored very much to our Praos benchmarking use case and the implementation is based on a rather monolithic design. tx-centrifuge's approach however is a different one.
It's built for seamless scaling, both horizontally and vertically. This means it will be able to saturate a network running Leios over extended periods of time, due to its massive tx output. Furthermore, it's able
to cut down the setup phase (where UTxOs are created for benchmarking) and immediately launch into the benchmark phase. This also enables it to function as a potentially long-running, configurable submission service
for scenarios other than benchmarking. The implementation is currently in prototype stage.
Infrastructure
As far as infrastructure is concerned, we've addressed various small-sized maintenance tasks. This includes fixing profiled nix builds for local benchmarks, migrating benchmarking profiles and configs to the upcoming
Node 10.7 release and increasing robustness of the locli analysis tool in dealing with incomplete / partial trace output.
Tracing
Our new tracing system has been set up as its own project - and named the Hermod Tracing System. As of now, we've only migrated the core package trace-dispatcher. This marks the first step
of eventually moving all tracing and metrics related packages out of the cardano-node project, and bundling them with consistent branding, API and documentation. Eventually, the system will
be generalized so that it can be used by any Haskell application - not just cardano-node. Seeing that the dmq-node already adopted it, we have reason to assume it might be considered
by the broader community as go-to choice to add principled observability to an application.
We've built and released a new Haskell library cardano-timeseries-io (cardano-node PR#6495). The library builds and stores timeseries of metrics from multiple source applications, much like Prometheus. It can process queries over those timeseries in a query language quite similar to PromQL. Integration into cardano-tracer, the trace / metrics processing service, is ongoing work. It will allow for custom monitoring solutions and alerts directly from cardano-tracer, without the need to scrape metrics and maintain them externally. It is not meant to replace existing Prometheus endpoints, rather provide richer functionality out of the box if desired: cardano-node PR#6473.
Leios
We've released cardano-recon-framework, formerly known as the Linear Temporal Logic (LTL) Trace Verifier (cardano-node PR#6454). It's already seen adoption, and is used productively to verify
system properties and conformance exclusively based on live trace output. We've been asked by Formal Methods Engineering to extend the LTL fragment the framework uses, such that a wider range of
properties can be expressed; work on that is already ongoing.
Node Diversity
The comprehensive formal schema definition of all the Node's existing trace messages is nearing integration / merging. The initial version will be able to extract all definitions from the actual implementation into a fully validated JSON schema. Future work will address completing the automated verification suite, adding a mechanism to amend the extracted schema manually (e.g. with comments or refinement types) and a pipeline to facilitate usage, such as automatic derivation of a parser, or rendering of a human-readable specification PDF.
Due to resourcing issues, the trace / metrics forwarding mini-protocol implementation in native Rust, unfortunately, had to be put on hiatus for the forseeable future.
