52 posts tagged with "performance-tracing"

View All Tags

Performance & Tracing Update

January 31, 2024 · 2 min read

Michael Karg

Performance and Tracing Team Lead

High level summary

Benchmarking: GHC 9.6.3 benchmarks for node 8.7.2 have been performed.
Development: Additional features for our reporting pipeline, while simultaneously reducing dependency footprint.
Tracing: Implementation for cardano-tracer to work on handles instead of files; work on New Tracing Quickstart document has begun.
Nomad cluster: We're preparing an upgrade to the latest Nomad version.

Low level overview

Benchmarking

We've performed a full set of GHC 9.6.3 benchmarks for node 8.7.2. For recommending GHC9.6 as a default build platform for cardano-node - from a performance perspective - we observe only one residual issue. As a way to address this, we've decided to create a reproduction benchmark targeting the affected component.

Development

Our reporting pipeline will be expanded to support a wider range of rendering formats, as well as report templates. As the pipeline is part of our workbench - and thus gets downloaded and built when entering the workbench shell - it's good practice to keep a small dependency footprint. When reworking the pipeline, we aim to simultaneously reduce dependencies.

Tracing

So far, cardano-tracer has internally been using files, or file names, for the purpose of logging trace messages it receives via forwarding. This is simple, but induces quite some overhead at runtime: files have to be opened and closed for each message. Using and managing open file handles inside cardano-tracer does remove that overhead, but unsurprisingly introduces some complexity into the application code. Currently, we're working on implementing that change.

Furthermore, we're working on a Quickstart document for the new tracing system, with end users as its intended audience. It will contain recommended production use setup(s), and how to efficiently configure and run them step by step. Additionally, it will provide a brief, but comprehensive overview over the features at the user's disposal.

Nomad backend

On the Nomad cluster, we've experienced undesired system behaviour when the heartbeat between the Nomad server and a client is interrupted temporarily - although the Nomad job itself is still 100% functional. A Nomad upgrade to the latest version promises to fix that, but it turn comes with other issues. We're currently working on adapting our automation and deployment around those known issues, before we can eventually apply the upgrade.

Performance & Tracing Update

December 11, 2023 · 4 min read

Michael Karg

Performance and Tracing Team Lead

High level summary

Benchmarking: Release benchmarking for node 8.7.2, as well as P2P benchmarks - the latter being a premiere for the Nomad cluster.
Development: Work on automating the submission of Conway transactions for registration of and stake delegation to DReps is ongoing.
Infrastructure: We're preparing to safely retire our current benchmarking cluster.
Tracing: The documentation for the new tracing system is being reworked. Additionally, we've added a feature to cardano-tracer to capture performance over a long runtime.
Nomad cluster: The Nomad backend has been successfully equipped with support for up-to-date P2P topology, as well as deployment for Plutus script data.

Low level overview

Benchmarking

We've performed a full set of release benchmarks for node 8.7.2. From a performance perspective, it has been greenlit for mainnet release. Starting with this version, our team will publish observations alongside the original comparative analysis of benchmarks, providing insight into key metrics and resource usage. Hence, for the post on version 8.7.2, see here.

Additionally, we're running P2P versus no-P2P benchmarks on the same version. We intend to establish future baseline runs using P2P topology as the default setting. All our cluster nodes being block producers, it is crucial to establish the P2P stack does not exhibit any regression regarding block forging. Furthermore, the evidence gathered from those benchmarks forms the base for a recommended setting for P2P on mainnet block producers.

It deserves special mention that those P2P benchmarks are our first production runs with the Nomad cluster - and using the new tracing system exclusively. Having finalized all optimization rounds of the latter, and having meticulously eliminated all confounding factors from the Nomad infrastructure, we're confident in the measurements being subject to extremely low variance - which we made sure of in many past validation runs on Nomad.

Development

Orchestrating DRep actions into benchmarking workflows has opened up a fairly large design space. Currently, we're focusing on having stake delegated to DReps, in order to trigger ledger pulses for calculations particular to DRep actions. We can benchmark a possible performance impact of those pulses - even if there are no actions ongoing - as a first step.

The contributing factors to the cost of those pulses are both the number of DReps registered, and the number of delegations to them. It is still under debate which values represent a probable model for mainnet, and whether we can achieve stake delegation programmatically (i.e. by submitting transactions), or if, for large numbers, we need means to inject those delegations into genesis.

Infrastructure

With the switch of all production benchmarks to the Nomad cluster, we will retire the legacy cardano-ops cluster very shortly. Currently, we're making sure that when all its resources are released, we keep an archive for all runs performed, including all raw log data - with the oldest runs dating back as far as December 2019.

Tracing

We've outfitted the cardano-tracer service with the same kind of resource tracing machinery that's used by cardano-node - as well as created a dedicated benchmarking profile for it. It puts very little pressure on the node; instead it causes 6 nodes to emit traces at a rate of >35 messages per second, putting pressure on cardano-tracer via trace forwarding at >200 messages per second for an extended period of time. Analysis of these traces will form the ground for various optimizations of cardano-tracer that are currently being worked on.

As we aim for early 2024 to be able to recommend the new tracing system as default in production use, we're currently also reworking the comprehensive documentation to reflect all changes made over the last months.

Nomad backend

In addition to being able to deploy Plutus script data and redeemers seperately (instead of inlining them as the legacy cluster did), the Nomad cluster now supports being set up with recent P2P topology. No-P2P topology format will still be supported for occasional regression benchmarking of the P2P stack, when desired. Furthermore, we've completed porting all benchmarking profiles from the legacy cluster to Nomad.

Performance & Tracing Update

December 4, 2023 · 3 min read

Michael Karg

Performance and Tracing Team Lead

High level summary

Benchmarking: Release benchmarking for node 8.7.0. Also, we performed the first-ever Conway benchmarks.
Development: Conway capability of our workload generator has been implemented and merged to master.
Infrastructure: Changes to our workbench facilitating easy access and archiving of raw benchmarking data.
Tracing: Quality-of-life improvements to tracing output and addition of a test suite.
Nomad cluster: Expand the list of benchmarking profiles that can be run on Nomad; generalize cluster topology generation.

Low level overview

Benchmarking

A full set of benchmarks for node 8.7.0 has been performed, with the focus of enabling the next mainnet release. We've measured slight performance improvements of 8.7.0 over 8.6.0, and can confirm no regressions have been introduced.

Furthermore, we've run system integration level benchmarks in the Conway era for the first time, on the same node version. Only Babbage-compatible workloads have entered comparison as to ascertain performance consequences of only changing the ledger version, and nothing else. The results are very promising, as we could show that switching ledger versions for existing workloads does not come with a performance penalty.

Development

Our transaction generator has been extended to be able to submit all present benchmarking workflows in the Conway era. Currently, we're looking into adding Conway-exclusive features, such as DRep registration. Those would be submitted at the very beginning of a run, as we're interested in seeing potential performance implications of maintaining DRep sets of varying size in ledger. Furthermore, this will serve as the basis for future development Conway-exclusive workloads, such as governance actions or vote tallying.

Infrastructure

As our workbench will be pivotal in orchestrating and organizing benchmarking runs on the Nomad cloud backend, we've improved how raw benchmark data is tagged, which metadata is documented in an automated manner. This enhances both access to existing run sets, as well as maintaining an archive for benchmarking data.

Tracing

The new tracing system is currently receiving usability improvements as we're reworking the output of several trace messages. Additionally, we're setting up a rigorous test suite to provide safety for future development of and component integration inte the system.

Nomad backend

We've been working on adapting various benchmarking workloads, which are defined by our workbench's profiles, to running on the new infrastructure. This mainly concerns a workload utilizing Plutus, as well as peer-to-peer flavoured workloads. Furthermore, we're implementing a solution to create all possible cluster topologies algorithmically, instead of still using fixed literal definitions for some cases.

Performance & Tracing Update

November 17, 2023 · 3 min read

Michael Karg

Performance and Tracing Team Lead

High level summary

Benchmarking: Release benchmarking for node 8.6.0 as well as benchmarks scrutinizing GHC versions and the new tracing system.
Development: PlutusV3 capability of our workload generator has been implemented.
Tracing: First round of optimization of the cardano-tracer service has completed, awaiting validation.
Nomad backend: A significant PR has landed addressing automation features and debugging capabilites.
Workbench: Configurable remote environments and improvements to run documentation have been merged to master.

Low level overview

Benchmarking

We've performed and analyzed a full set of benchmarks for node 8.6.0, both in comparison to recent release tags and mainnet version 8.1.2. A lot of development work has entered the system since then, so it is crucial we can rule out any potential performance risks for the next mainnet release.

Additionally, we've been benchmarking GHC9.6.3 builds of cardano-node. Overall, we've observed reliable optimization behaviour by that compiler version - which is much more in line with expectations than what we've seen on GHC9.2.7. Getting evidence on how predictable (and malleable, by code annotations) performance is when building with a certain compiler version is essential for settling on a version as supported release platform.

A last set of benchmarks was dedicated to the new tracing system with node 8.6.0. We were able to show that there is no performance risk to enabling the new system, even when forwarding all trace messages to a cardano-tracer service on the receiving end. Key metrics for block forging, as well as block diffusion, did not exhibit any regression.

Development

For future benchmarks to be built around PlutusV3, we've equipped our transaction generator with basic integration and tests for the upcoming Plutus version. This enables us to target the new cost model and potential changes to the execution budgets by developing specialized workloads.

Tracing

The cardano-tracer service has received its first batch of optimizations. Profiling output is promising; to measure performance for a long service run time, we're currently equipping the service binary with the same capability to emit regular resource traces as cardano-node. Analysis of those will be the basis for validating this and possible future optimization efforts.

Nomad backend

Many improvements for the nomad backend have been implemented and merged to master. This encompasses a unified naming schema for all nomad profiles, improved internal management of cluster topology, a more fine-grained healthcheck service, more detailed automated documentation of underlying hardware, as well as lazy resource release. The latter enables our team to investigate and debug interrupted runs for the exact moment and in the exact cluster state a potential failure occurred.

Workbench

Our performance workbench has seen upgrades in documenting and reporting cardano-node builds. This ranges from capturing package versions and commit ids of key dependencies, to querying a deployed node for its build compiler. When alternating between compiler versions and benchmarking custom built branches, automating such documentation is essential.

Furthermore, the workbench is now able to access several remote deployments on all active clusters. This allows for fetching data, analyzing, comparing and reporting on all benchmarks from just one centralized workbench instance.

Performance & tracing update

October 6, 2023 · 3 min read

Michael Karg

Performance and Tracing Team Lead

High level summary

Benchmarking: Continued benchmarking of UTxO-HD and performed benchmarks for the new tracing system.
Consensus QTAs: Our protoype approach is applied to potential regression fixes with GHC 9.2.7.
Development: We've developed strategies for future benchmarks of PlutusV3 and UTxO-HD's on-disk backing store.
Tracing: The machine-readable tracer configuration has been merged. Optimization of cardano-tracer started.
Nomad backend: Ongoing variance analysis and refined cluster topology.

Low level overview

Benchmarking

Performing and analyzing benchmarks for the UTxO-HD feature is an ongoing effort; we can reliably assess the performance of the in-memory backing store and evaluate possible optimizations (or regressions) for it.

Furthermore, benchmarks of our new tracing system after several rounds of optimization have been performed. The results show all key metrics now being unaffected by the choice of tracing system (legacy or new) - with the new system being able to provide more features and flexibility in comparison. The benchmarks also highlighted further points for optimization, with the focus now on the cardano-tracer service.

Consensus QTAs

The Quantitative Timeliness Agreements (QTA) prototype is being used in coordination with Consensus and DevX to validate a series of patches that address optmization opportunities which GHC8.10 seizes, but GHC9.2 misses. The feedback from this approach is much more immediate than running benchmarks at system integration level. But once we eventually do, we expect to reproduce the relevant observations - which would mean a big step towards maturing the prototype.

Development

Benchmarking UTxO-HD's on-disk backing store needs special attention: in virtualized environments, disk I/O is not a reliable metric as it passes several layers of indirection. As this is the very metric which will influence overall performance of this UTxO-HD flavour, we developed a plan to monitor such nodes, connected to a running network, on dedicated hardware - having direct SSD access. Replicating this setup for an entire benchmarking cluster of such nodes will be a future effort.

PlutusV3 will come with new builtins and a new cost model. It will take a specialized benchmark to ascertain the soundness of that model running a full cluster of nodes, possibly stressing expensive builtins. At the same time, we'd like to validate the many improvements that have gone into the Plutus evaluator.

Tracing

The focus for further optimization of the new tracing system has shifted to cardano-tracer - the service receiving and processing traces from one (or more) nodes. Whilst undisputed that the code living in cardano-node is more performance critical, the receiving service must still minimize its resource footprint. Moreover, it can generate load for a running node when querying data points from it - which calls for tight control of that mechanism and its possible configurations.

Nomad backend

Variance analysis of new nomad backend has revealed a necessary adjustment of the cluster's topology. We repeated the same analysis and now see even better confidence in the measurements taken with nomad. This concludes the work on the backend proper for the time being. The last steps before production use will focus on the interface between backend and our workbench, which provides all high-level benchmark definitions and analysis machinery.

High level summary​

Low level overview​

Benchmarking​

Development​

Tracing​

Nomad backend​

High level summary​

Low level overview​

Benchmarking​

Development​

Infrastructure​

Tracing​

Nomad backend​

High level summary​

Low level overview​

Benchmarking​

Development​

Infrastructure​

Tracing​

Nomad backend​

High level summary​

Low level overview​

Benchmarking​

Development​

Tracing​

Nomad backend​

Workbench​

High level summary​

Low level overview​

Benchmarking​

Consensus QTAs​

Development​

Tracing​

Nomad backend​

High level summary

Low level overview

Benchmarking

Development

Tracing

Nomad backend

High level summary

Low level overview

Benchmarking

Development

Infrastructure

Tracing

Nomad backend

High level summary

Low level overview

Benchmarking

Development

Infrastructure

Tracing

Nomad backend

High level summary

Low level overview

Benchmarking

Development

Tracing

Nomad backend

Workbench

High level summary

Low level overview

Benchmarking

Consensus QTAs

Development

Tracing

Nomad backend