Skip to main content

48 posts tagged with "performance-tracing"

View All Tags

· 3 min read
Michael Karg

High level summary

  • Benchmarking: Release benchmarks for 8.8.0 have been performed; we created a local repro for a residual issue.
  • Performance: We've implemented and benchmarked two candidates investigating residual issues with GHC9.6.
  • Development: Work on the reporting pipeline is ongoing; integration of DReps into benchmarking workloads has begun.
  • Workbench: Implementation of high-level profile definition is ongoing.
  • Tracing: The handle registry feature for cardano-tracer is completed; currently in testing.
  • Nomad cluster: Increased robustness of deployment and run monitoring has been merged; work on garbage collection has started.

Low level overview

Benchmarking

We've performed a full set of release benchmarks for Node 8.8.0-pre. Comparing with release 8.7.2, we could not detect any performance risks for that version. We even saw a slight improvement in block fetch related metrics, which led to slightly improved block diffusion times.

Furthermore, we've managed to boil down a complex residual performance issue measured on the cluster to a local reproduction. This enables our DevX team, with highly specialized knowledge of GHC's compiler internals, to investigate each step in code generation and optimzation, and independently observe the effects of code changes to the affected component.

Performance

Work on the remaining performance issue with GHC9.6 led us to produce two candidates based on Node 8.7.2, benchmarking the implacations local small changes have for GHC9.6's optimizer. Though those candidates did not uncover the issue's root cause, they were able to disprove a hypothesis as to its nature, and quantify the performance impact of said small changes.

Development

Node 8.8.0 comes with capabilities to inject DReps and DRep delegations into Conway genesis. We've started work on integrating those into our automations, and setting sensible values for benchmarking. The aforementioned delegations representing a new data structure in the Conway ledger, we aim to run existing workflows extended with varying sizes of that new structure, measuring their pressure on ledger queries and operations.

Workbench

The performance workbench relies heavily on shell scripting and manipulating JSON data for a great part of its features. This approach is very effective for quick experimentation, but lacks in verifiable properties as well as accessibility for new users of workbench.

After the successful Haskell port of cluster topology creation, and verification, we're currently applying the same model in porting the entirety of benchmarking profiles to Haskell. The obvious gains are widening workbench's audience both for users and developers, as well as implementing a principled approach to all workbench data structures and transformations.

At the same time, we're porting workbench's many options to create fine-tuned geneses, following the same approach.

Tracing

We've outfitted cardano-tracer with a handle registry feature that lets the service work on file handles internally, rather than opening and closing files for each operation. The feature is completed; at the moment we're adding appropriate test cases to the service's test suite for validation of its behaviour, and for safeguarding future development.

Nomad backend

Several improvements for our cluster backend have been merged to master, increasing its overall robustness. We can now safely handle some corner cases where Nomad processes unexpectedly exited, or deployments errored out. Furthermore, an ongoing run can now reliably survive a temporary loss of heartbeat connection between Nomad client and server, without the benchmarking metrics being affected.

Currently, we're working on a reliable automation of garbage collecting old nix store entries on the cluster machines, as they fill up disk space. The design has to consider both not interfering with ongoing benchmarks, and avoiding deployment overhead caused by cleaning the store too frequently.

· 2 min read
Michael Karg

High level summary

  • Benchmarking: GHC 9.6.3 benchmarks for node 8.7.2 have been performed.
  • Development: Additional features for our reporting pipeline, while simultaneously reducing dependency footprint.
  • Tracing: Implementation for cardano-tracer to work on handles instead of files; work on New Tracing Quickstart document has begun.
  • Nomad cluster: We're preparing an upgrade to the latest Nomad version.

Low level overview

Benchmarking

We've performed a full set of GHC 9.6.3 benchmarks for node 8.7.2. For recommending GHC9.6 as a default build platform for cardano-node - from a performance perspective - we observe only one residual issue. As a way to address this, we've decided to create a reproduction benchmark targeting the affected component.

Development

Our reporting pipeline will be expanded to support a wider range of rendering formats, as well as report templates. As the pipeline is part of our workbench - and thus gets downloaded and built when entering the workbench shell - it's good practice to keep a small dependency footprint. When reworking the pipeline, we aim to simultaneously reduce dependencies.

Tracing

So far, cardano-tracer has internally been using files, or file names, for the purpose of logging trace messages it receives via forwarding. This is simple, but induces quite some overhead at runtime: files have to be opened and closed for each message. Using and managing open file handles inside cardano-tracer does remove that overhead, but unsurprisingly introduces some complexity into the application code. Currently, we're working on implementing that change.

Furthermore, we're working on a Quickstart document for the new tracing system, with end users as its intended audience. It will contain recommended production use setup(s), and how to efficiently configure and run them step by step. Additionally, it will provide a brief, but comprehensive overview over the features at the user's disposal.

Nomad backend

On the Nomad cluster, we've experienced undesired system behaviour when the heartbeat between the Nomad server and a client is interrupted temporarily - although the Nomad job itself is still 100% functional. A Nomad upgrade to the latest version promises to fix that, but it turn comes with other issues. We're currently working on adapting our automation and deployment around those known issues, before we can eventually apply the upgrade.

· 4 min read
Michael Karg

High level summary

  • Benchmarking: Release benchmarking for node 8.7.2, as well as P2P benchmarks - the latter being a premiere for the Nomad cluster.
  • Development: Work on automating the submission of Conway transactions for registration of and stake delegation to DReps is ongoing.
  • Infrastructure: We're preparing to safely retire our current benchmarking cluster.
  • Tracing: The documentation for the new tracing system is being reworked. Additionally, we've added a feature to cardano-tracer to capture performance over a long runtime.
  • Nomad cluster: The Nomad backend has been successfully equipped with support for up-to-date P2P topology, as well as deployment for Plutus script data.

Low level overview

Benchmarking

We've performed a full set of release benchmarks for node 8.7.2. From a performance perspective, it has been greenlit for mainnet release. Starting with this version, our team will publish observations alongside the original comparative analysis of benchmarks, providing insight into key metrics and resource usage. Hence, for the post on version 8.7.2, see here.

Additionally, we're running P2P versus no-P2P benchmarks on the same version. We intend to establish future baseline runs using P2P topology as the default setting. All our cluster nodes being block producers, it is crucial to establish the P2P stack does not exhibit any regression regarding block forging. Furthermore, the evidence gathered from those benchmarks forms the base for a recommended setting for P2P on mainnet block producers.

It deserves special mention that those P2P benchmarks are our first production runs with the Nomad cluster - and using the new tracing system exclusively. Having finalized all optimization rounds of the latter, and having meticulously eliminated all confounding factors from the Nomad infrastructure, we're confident in the measurements being subject to extremely low variance - which we made sure of in many past validation runs on Nomad.

Development

Orchestrating DRep actions into benchmarking workflows has opened up a fairly large design space. Currently, we're focusing on having stake delegated to DReps, in order to trigger ledger pulses for calculations particular to DRep actions. We can benchmark a possible performance impact of those pulses - even if there are no actions ongoing - as a first step.

The contributing factors to the cost of those pulses are both the number of DReps registered, and the number of delegations to them. It is still under debate which values represent a probable model for mainnet, and whether we can achieve stake delegation programmatically (i.e. by submitting transactions), or if, for large numbers, we need means to inject those delegations into genesis.

Infrastructure

With the switch of all production benchmarks to the Nomad cluster, we will retire the legacy cardano-ops cluster very shortly. Currently, we're making sure that when all its resources are released, we keep an archive for all runs performed, including all raw log data - with the oldest runs dating back as far as December 2019.

Tracing

We've outfitted the cardano-tracer service with the same kind of resource tracing machinery that's used by cardano-node - as well as created a dedicated benchmarking profile for it. It puts very little pressure on the node; instead it causes 6 nodes to emit traces at a rate of >35 messages per second, putting pressure on cardano-tracer via trace forwarding at >200 messages per second for an extended period of time. Analysis of these traces will form the ground for various optimizations of cardano-tracer that are currently being worked on.

As we aim for early 2024 to be able to recommend the new tracing system as default in production use, we're currently also reworking the comprehensive documentation to reflect all changes made over the last months.

Nomad backend

In addition to being able to deploy Plutus script data and redeemers seperately (instead of inlining them as the legacy cluster did), the Nomad cluster now supports being set up with recent P2P topology. No-P2P topology format will still be supported for occasional regression benchmarking of the P2P stack, when desired. Furthermore, we've completed porting all benchmarking profiles from the legacy cluster to Nomad.

· 3 min read
Michael Karg

High level summary

  • Benchmarking: Release benchmarking for node 8.7.0. Also, we performed the first-ever Conway benchmarks.
  • Development: Conway capability of our workload generator has been implemented and merged to master.
  • Infrastructure: Changes to our workbench facilitating easy access and archiving of raw benchmarking data.
  • Tracing: Quality-of-life improvements to tracing output and addition of a test suite.
  • Nomad cluster: Expand the list of benchmarking profiles that can be run on Nomad; generalize cluster topology generation.

Low level overview

Benchmarking

A full set of benchmarks for node 8.7.0 has been performed, with the focus of enabling the next mainnet release. We've measured slight performance improvements of 8.7.0 over 8.6.0, and can confirm no regressions have been introduced.

Furthermore, we've run system integration level benchmarks in the Conway era for the first time, on the same node version. Only Babbage-compatible workloads have entered comparison as to ascertain performance consequences of only changing the ledger version, and nothing else. The results are very promising, as we could show that switching ledger versions for existing workloads does not come with a performance penalty.

Development

Our transaction generator has been extended to be able to submit all present benchmarking workflows in the Conway era. Currently, we're looking into adding Conway-exclusive features, such as DRep registration. Those would be submitted at the very beginning of a run, as we're interested in seeing potential performance implications of maintaining DRep sets of varying size in ledger. Furthermore, this will serve as the basis for future development Conway-exclusive workloads, such as governance actions or vote tallying.

Infrastructure

As our workbench will be pivotal in orchestrating and organizing benchmarking runs on the Nomad cloud backend, we've improved how raw benchmark data is tagged, which metadata is documented in an automated manner. This enhances both access to existing run sets, as well as maintaining an archive for benchmarking data.

Tracing

The new tracing system is currently receiving usability improvements as we're reworking the output of several trace messages. Additionally, we're setting up a rigorous test suite to provide safety for future development of and component integration inte the system.

Nomad backend

We've been working on adapting various benchmarking workloads, which are defined by our workbench's profiles, to running on the new infrastructure. This mainly concerns a workload utilizing Plutus, as well as peer-to-peer flavoured workloads. Furthermore, we're implementing a solution to create all possible cluster topologies algorithmically, instead of still using fixed literal definitions for some cases.

· 3 min read
Michael Karg

High level summary

  • Benchmarking: Release benchmarking for node 8.6.0 as well as benchmarks scrutinizing GHC versions and the new tracing system.
  • Development: PlutusV3 capability of our workload generator has been implemented.
  • Tracing: First round of optimization of the cardano-tracer service has completed, awaiting validation.
  • Nomad backend: A significant PR has landed addressing automation features and debugging capabilites.
  • Workbench: Configurable remote environments and improvements to run documentation have been merged to master.

Low level overview

Benchmarking

We've performed and analyzed a full set of benchmarks for node 8.6.0, both in comparison to recent release tags and mainnet version 8.1.2. A lot of development work has entered the system since then, so it is crucial we can rule out any potential performance risks for the next mainnet release.

Additionally, we've been benchmarking GHC9.6.3 builds of cardano-node. Overall, we've observed reliable optimization behaviour by that compiler version - which is much more in line with expectations than what we've seen on GHC9.2.7. Getting evidence on how predictable (and malleable, by code annotations) performance is when building with a certain compiler version is essential for settling on a version as supported release platform.

A last set of benchmarks was dedicated to the new tracing system with node 8.6.0. We were able to show that there is no performance risk to enabling the new system, even when forwarding all trace messages to a cardano-tracer service on the receiving end. Key metrics for block forging, as well as block diffusion, did not exhibit any regression.

Development

For future benchmarks to be built around PlutusV3, we've equipped our transaction generator with basic integration and tests for the upcoming Plutus version. This enables us to target the new cost model and potential changes to the execution budgets by developing specialized workloads.

Tracing

The cardano-tracer service has received its first batch of optimizations. Profiling output is promising; to measure performance for a long service run time, we're currently equipping the service binary with the same capability to emit regular resource traces as cardano-node. Analysis of those will be the basis for validating this and possible future optimization efforts.

Nomad backend

Many improvements for the nomad backend have been implemented and merged to master. This encompasses a unified naming schema for all nomad profiles, improved internal management of cluster topology, a more fine-grained healthcheck service, more detailed automated documentation of underlying hardware, as well as lazy resource release. The latter enables our team to investigate and debug interrupted runs for the exact moment and in the exact cluster state a potential failure occurred.

Workbench

Our performance workbench has seen upgrades in documenting and reporting cardano-node builds. This ranges from capturing package versions and commit ids of key dependencies, to querying a deployed node for its build compiler. When alternating between compiler versions and benchmarking custom built branches, automating such documentation is essential.

Furthermore, the workbench is now able to access several remote deployments on all active clusters. This allows for fetching data, analyzing, comparing and reporting on all benchmarks from just one centralized workbench instance.