Skip to main content

45 posts tagged with "performance-tracing"

View All Tags

· 3 min read
Michael Karg

High level summary

  • Benchmarking: Release benchmarking for node 8.7.0. Also, we performed the first-ever Conway benchmarks.
  • Development: Conway capability of our workload generator has been implemented and merged to master.
  • Infrastructure: Changes to our workbench facilitating easy access and archiving of raw benchmarking data.
  • Tracing: Quality-of-life improvements to tracing output and addition of a test suite.
  • Nomad cluster: Expand the list of benchmarking profiles that can be run on Nomad; generalize cluster topology generation.

Low level overview

Benchmarking

A full set of benchmarks for node 8.7.0 has been performed, with the focus of enabling the next mainnet release. We've measured slight performance improvements of 8.7.0 over 8.6.0, and can confirm no regressions have been introduced.

Furthermore, we've run system integration level benchmarks in the Conway era for the first time, on the same node version. Only Babbage-compatible workloads have entered comparison as to ascertain performance consequences of only changing the ledger version, and nothing else. The results are very promising, as we could show that switching ledger versions for existing workloads does not come with a performance penalty.

Development

Our transaction generator has been extended to be able to submit all present benchmarking workflows in the Conway era. Currently, we're looking into adding Conway-exclusive features, such as DRep registration. Those would be submitted at the very beginning of a run, as we're interested in seeing potential performance implications of maintaining DRep sets of varying size in ledger. Furthermore, this will serve as the basis for future development Conway-exclusive workloads, such as governance actions or vote tallying.

Infrastructure

As our workbench will be pivotal in orchestrating and organizing benchmarking runs on the Nomad cloud backend, we've improved how raw benchmark data is tagged, which metadata is documented in an automated manner. This enhances both access to existing run sets, as well as maintaining an archive for benchmarking data.

Tracing

The new tracing system is currently receiving usability improvements as we're reworking the output of several trace messages. Additionally, we're setting up a rigorous test suite to provide safety for future development of and component integration inte the system.

Nomad backend

We've been working on adapting various benchmarking workloads, which are defined by our workbench's profiles, to running on the new infrastructure. This mainly concerns a workload utilizing Plutus, as well as peer-to-peer flavoured workloads. Furthermore, we're implementing a solution to create all possible cluster topologies algorithmically, instead of still using fixed literal definitions for some cases.

· 3 min read
Michael Karg

High level summary

  • Benchmarking: Release benchmarking for node 8.6.0 as well as benchmarks scrutinizing GHC versions and the new tracing system.
  • Development: PlutusV3 capability of our workload generator has been implemented.
  • Tracing: First round of optimization of the cardano-tracer service has completed, awaiting validation.
  • Nomad backend: A significant PR has landed addressing automation features and debugging capabilites.
  • Workbench: Configurable remote environments and improvements to run documentation have been merged to master.

Low level overview

Benchmarking

We've performed and analyzed a full set of benchmarks for node 8.6.0, both in comparison to recent release tags and mainnet version 8.1.2. A lot of development work has entered the system since then, so it is crucial we can rule out any potential performance risks for the next mainnet release.

Additionally, we've been benchmarking GHC9.6.3 builds of cardano-node. Overall, we've observed reliable optimization behaviour by that compiler version - which is much more in line with expectations than what we've seen on GHC9.2.7. Getting evidence on how predictable (and malleable, by code annotations) performance is when building with a certain compiler version is essential for settling on a version as supported release platform.

A last set of benchmarks was dedicated to the new tracing system with node 8.6.0. We were able to show that there is no performance risk to enabling the new system, even when forwarding all trace messages to a cardano-tracer service on the receiving end. Key metrics for block forging, as well as block diffusion, did not exhibit any regression.

Development

For future benchmarks to be built around PlutusV3, we've equipped our transaction generator with basic integration and tests for the upcoming Plutus version. This enables us to target the new cost model and potential changes to the execution budgets by developing specialized workloads.

Tracing

The cardano-tracer service has received its first batch of optimizations. Profiling output is promising; to measure performance for a long service run time, we're currently equipping the service binary with the same capability to emit regular resource traces as cardano-node. Analysis of those will be the basis for validating this and possible future optimization efforts.

Nomad backend

Many improvements for the nomad backend have been implemented and merged to master. This encompasses a unified naming schema for all nomad profiles, improved internal management of cluster topology, a more fine-grained healthcheck service, more detailed automated documentation of underlying hardware, as well as lazy resource release. The latter enables our team to investigate and debug interrupted runs for the exact moment and in the exact cluster state a potential failure occurred.

Workbench

Our performance workbench has seen upgrades in documenting and reporting cardano-node builds. This ranges from capturing package versions and commit ids of key dependencies, to querying a deployed node for its build compiler. When alternating between compiler versions and benchmarking custom built branches, automating such documentation is essential.

Furthermore, the workbench is now able to access several remote deployments on all active clusters. This allows for fetching data, analyzing, comparing and reporting on all benchmarks from just one centralized workbench instance.

· 3 min read
Michael Karg

High level summary

  • Benchmarking: Continued benchmarking of UTxO-HD and performed benchmarks for the new tracing system.
  • Consensus QTAs: Our protoype approach is applied to potential regression fixes with GHC 9.2.7.
  • Development: We've developed strategies for future benchmarks of PlutusV3 and UTxO-HD's on-disk backing store.
  • Tracing: The machine-readable tracer configuration has been merged. Optimization of cardano-tracer started.
  • Nomad backend: Ongoing variance analysis and refined cluster topology.

Low level overview

Benchmarking

Performing and analyzing benchmarks for the UTxO-HD feature is an ongoing effort; we can reliably assess the performance of the in-memory backing store and evaluate possible optimizations (or regressions) for it.

Furthermore, benchmarks of our new tracing system after several rounds of optimization have been performed. The results show all key metrics now being unaffected by the choice of tracing system (legacy or new) - with the new system being able to provide more features and flexibility in comparison. The benchmarks also highlighted further points for optimization, with the focus now on the cardano-tracer service.

Consensus QTAs

The Quantitative Timeliness Agreements (QTA) prototype is being used in coordination with Consensus and DevX to validate a series of patches that address optmization opportunities which GHC8.10 seizes, but GHC9.2 misses. The feedback from this approach is much more immediate than running benchmarks at system integration level. But once we eventually do, we expect to reproduce the relevant observations - which would mean a big step towards maturing the prototype.

Development

Benchmarking UTxO-HD's on-disk backing store needs special attention: in virtualized environments, disk I/O is not a reliable metric as it passes several layers of indirection. As this is the very metric which will influence overall performance of this UTxO-HD flavour, we developed a plan to monitor such nodes, connected to a running network, on dedicated hardware - having direct SSD access. Replicating this setup for an entire benchmarking cluster of such nodes will be a future effort.

PlutusV3 will come with new builtins and a new cost model. It will take a specialized benchmark to ascertain the soundness of that model running a full cluster of nodes, possibly stressing expensive builtins. At the same time, we'd like to validate the many improvements that have gone into the Plutus evaluator.

Tracing

The focus for further optimization of the new tracing system has shifted to cardano-tracer - the service receiving and processing traces from one (or more) nodes. Whilst undisputed that the code living in cardano-node is more performance critical, the receiving service must still minimize its resource footprint. Moreover, it can generate load for a running node when querying data points from it - which calls for tight control of that mechanism and its possible configurations.

Nomad backend

Variance analysis of new nomad backend has revealed a necessary adjustment of the cluster's topology. We repeated the same analysis and now see even better confidence in the measurements taken with nomad. This concludes the work on the backend proper for the time being. The last steps before production use will focus on the interface between backend and our workbench, which provides all high-level benchmark definitions and analysis machinery.

· 3 min read
Michael Karg

High level summary

  • Benchmarking: We've performed both low-level network and high-level variance analysis of our benchmarking clusters.
  • Infrastructure: Our reporting pipeline was adjusted to classify various workloads easily reducing rework time.
  • Tracing: Work on machine-readable tracing of tracer configuration is ongoing.
  • Nomad backend: We've been able to eliminate several possible confounders on the nomad cluster.
  • Team: We're currently onboarding a new team member: Welcome to Cardano Performance & Tracing, Baldur Blöndal!

Low level overview

Benchmarking

As part of the effort to bring the Nomad backend into production use, we've been equipping both that and the existing benchmarking backend with means to measure and document network latency for each run. Furthermore we've implemented means to capture TCP packets for a limited time window during a benchmarking run - which will allow us to spot differences in the behaviour of the underlying networking stack at OS level.

Additionally, we're running variance analysis in parallel on both backends to ascertain confidence in metrics originating from either. We've concluded that baseline profile runs aren't directly comparable between the two, so we decided to compare standard deviations instead to validate the measurements from nomad.

Infrastructure

Reporting on benchmarks does require human time and effort to rework the final document. Improvements to the reporting pipeline have been merged to master. They reduce the time necessary to do so by various changes to the template and the workload classification logic in analysis.

Beyond that, we've looked into issues where services would quit with an unjustified exit failure upon shutdown - under rare circumstances. By reworking shutdown logic for trace-dispatcher and tx-generator we were able to address those issues.

Tracing

After various steps in constructing a configuration upon node startup, it is vital to document which runtime configuration the node arrived eventually. We're working on providing a machine-readable JSON/YAML trace message for that purpose.

This will facilitate hot-reloading a node's tracer configuration in the future: users will be able to take such a trace message, apply their intended change and hot-reload it immediately into the node.

Nomad backend

As with the existing benchmarking cluster, nomad is currently under scrutiny with regard to the reliability of metrics it produces, as well as the behaviour of its OS-level network stack. For instance, differing kernel versions can have an impact on our measurements, as we'd be basically using two different instruments to take them.

Along the way we've already been successful in eliminating some possible confounders that had been introduced by the nomad service or the slightly different system architecture of the new cluster.

New team member

Baldur Blöndal is an extremely capable and experienced Haskell developer. Also, he's an excellent fit for our existing team. So I'm very pleased to welcome him onboard with IOG, and with Performance & Tracing. He will be working on cardano-tracer, the component receiving, processing and making available node traces and metrics.

· 3 min read
Michael Karg

High level summary

  • Benchmarking: We've performed and analysed feature benchmarks for both UTxO-HD and the current P2P stack.
  • Infrastructure: Various improvements of our analysis pipeline have been merged to master, supporting safe log truncation.
  • Tracing: Namespace consistency checks have been merged to master along with a curated configuration for benchmarking.
  • Nomad backend: We're productively using the new backend to measure new vs. legacy tracing system, adding many quality-of-life improvements.

Low level overview

Benchmarking

We've completed various runs and analyses targeting two distinct features of the node: UTxO-HD and Peer2Peer.

With our UTxO-HD benchmark we could clearly localize one point where this new way of maintaining ledger state is still costly, but at the same time confirm that in basically all other aspects UTxO-HD makes no difference in performance.

The Peer2Peer benchmarks focused on the effects that enabling this feature on a block producing node has on propagation times, as well as scrutinized a proposed change to the Peer2Peer network stack.

Infrastructure

As a result of optimizing in-memory representation of log objects, which are constructed from a node's traces, we can now analyse runs that last longer in total. For runs that exceed their expected duration, analysis now supports a truncation operation that keeps the interdependencies of block events intact.

Truncation might happen at a slightly different point in time - and therefore in its log object stream - for each node in the cluster. An additional step validating the block hash timeline of the cluster has been implemented for the pipeline. It provides early feedback on whether a specific truncation will lead to a valid full analysis, which requires much more time.

Tracing

Consistency checking of namespace implementation and configuration when using the new system has been completed. This feature enables feedback on when tracer implementation details in some component might have changed. It's also able to detect when a configuration used for operating a cardano-node shows inconsistencies with the namespaces the system provides - and hence needs attention.

Furthermore, we've created a fine-grained configuration of the new system that caters to benchmarking's need of very many detailed trace messages. It's aimed at mirroring the same amount of trace messages, and information, we're seeing from our usage of the legacy system; an important step in making benchmarks between systems comparable.

Nomad backend

The new backend is currently being used for further validation with regard to the existing cluster. Additionally, we're using it in production mode to comparatively benchmark both tracing systems after merging past month's optimizations - which is the first real-life application of the nomad cluster. Hands-on experience in that phase translates into many small improvements which can be immediately applied to enhance user experience for the new backend.