Skip to main content

Performance & Tracing Update

· 5 min read
Michael Karg

High level summary

  • Benchmarking: We've performed and analysed benchmarks in the Conway era, with DReps injected.
  • Development: Tracing DRep data has been implemented; improved error reporting in tx-generator and analysis quick queries are ongoing work.
  • Workbench: We now fully supports the new CLI create-testnet-data command and DRep injection into Conway genesis. Haskell profile definition work is ongoing.
  • Tracing: Various additions to Node metrics are being worked on, such as build info and block producer role. Metrics naming will be further harmonized.
  • UTxO Growth: We've finalized analysis and reports of all benchmarks targeting UTxO scaling scenarios.
  • UTxO-HD / LMDB: We've performed multiple runs benchmarking the LMDB (on-disk) backend of UTxO-HD.

Low level overview

Benchmarking

We've run and analyzed a full set of benchmarks comparing the Conway ledger against the Babbage one, on Node 8.10.1-pre. For Conway, our additional goal was to measure a vanilla ledger state against one with a large amount of DReps - and delegations to those DReps - present. The benchmarks used our existing value and Plutus workloads to remain comparable to each other.

Development

Additional ledger queries for the tracing system have been implemented and merged to master. Those capture the amount of, and the number of existing delegations to, DReps as trace output - and thus enable creating a metric on top of it, which can then be monitored.

The (in our case) non-deterministic nature of shutting down different cluster setups - both local and cloud-based - carries the possibility that our transaction generation service occasionally misclassifies a regular shutdown as an error. Furthermore, in the case of network malfunctions, the service's errors are too unspecific. By implementing thread labels for submission threads, corresponding to each submission target, and by adding custom smart signal handlers, we'll improve the generator's error reporting significantly.

The initial tests for quick queries are being developed further. We're moving towards a principled, and generalized, syntax that supports both prepared, parametrizable queries from the application code, as well as ad-hoc queries stated e.g. on the command line.

Workbench

The performance workbench now fully supports the new cardano-cli command create-test-data. We use it to inject both stake delegated to stake pools into genesis, and - recently added - stake delegated to DReps as well. It has been proven very useful and versatile so far, and will eventually replace the current create-staked command.

Work on porting our performance workbench's profile definitions to Haskell, and providing them with an appropriate test suite, is still ongoing; currently, we're integrating all new profile families that came out of the UTxO growth scenarios.

Tracing

New metrics are being implemented for the tracing system. They will also be part of Prometheus output and as such accessible to monitoring services. There'll be cardano-node's detailed build info, as well as a node's block producer status, meaning the presence of forger credentials. Those new metrics are being backported to the legacy tracing system, too.

Furthermore, we've determined the need to revisit metrics naming. There's still a divergence between naming in the legacy and the new system. While this could be mitigated by passing in extra config options, we think that a transition to the new system should not impose any unnecessary effort for node operators. A design to fully harmonize the existing naming schemata is currently being set up.

UTxO Growth

The UTxO Growth benchmarking series has been finalized. We've finished analyses and reports for all scenarios that were tested and explored.

The overarching questions were, given a network of 32GB host systems, how large can the UTxO set grow in general, how large can it grow before the nodes have to operate close to the RAM limit over extended periods of time, and how does scaling the UTxO set size affect network metrics, such as block diffusion.

A dedicated "UTxO Scaling Squad" was set up, who was driving the entire process, and we enjoyed a very focused and productive collaboration with them.

UTxO-HD / LMDB

Last not least, we were able to benchmark UTxO-HD's on-disk backend on a network of block producing nodes, on a recent 8.9.1 version of cardano-node. The setup allowed of using a direct access SSD device for performance critical disk I/O, whereas the bulk of ChainDB and ledger snapshots remained on a standard AWS EBS volume.

The benchmarks comprised both optimistic and pessimistic RAM assumptions for the host OS to further optimize I/O via page cache, as well as medium and large UTxO set sizes - the latter almost tripling current mainnet's size. The results were promising; the LMDB backend has proven to be able to accomodate large UTxO sets using significantly less RAM than the default all-in-memory node - and with a more than reasonable trade-off performance-wise. Furthermore, running with pessimistic assumptions, the performance impact on LMDB was very moderate only.