Skip to main content

45 posts tagged with "performance-tracing"

View All Tags

· 2 min read
Michael Karg

High level summary

  • Benchmarking: Benchmarking node version 8.2.1 has concluded. Additionally we're developing benchmarking setups for GHC9.6 and UTxO-HD.
  • Infrastructure: Our analysis pipeline has received improvements reducing memory footprint.
  • Tracing: Another batch of optimizations for the new tracing system has been merged; work on namespace consistency guarantees is ongoing.
  • Nomad backend: We're performing and analysing various runs for validation purposes on the new hardware cluster.

Low level overview

Benchmarking

We've performed and analysed the benchmarks for the 8.2.1 version of cardano-node as part of our release benchmarking cycle.

Setting up cluster benchmarks requires completing full system integration. This applies to both supporting a new build platform, as is GHC9.6, as well as targeting a specific feature, like a UTxO-HD enabled node. Currently, we're working on respective integrations on both those paths.

Infrastructure

As cluster runs increase in duration, more and more data is accumulated for analysis. Batch analysis mode needs all data to be held in memory, which wouldn't fit anymore even on a 64GB RAM machine. Changes to the in-memory data representation improving on compactness were able to reduce the RAM requirements of our analysis pipeline.

Tracing

The next portion of optimizations has been completed and merged to master, getting rid of Haskell's native String representation on critical code paths. This concludes the optimization phase of the new tracing system for all its components used by cardano-node.

The implementation for validation of consistent naming and configuration is ongoing. We're splitting out everything that's verifiable at compile time into a seperate test case which we hope to integrate into CI - leaving only configuration constraints to be verified at or before node startup.

Nomad backend

The verification phase of the nomad cloud backend is ongoing. We're able to perform full runs on the new hardware cluster and porting profiles and configurations from the legacy one. The goal is to reproduce with confidence known regressions, or improvements, between runs performed on the legacy cluster and runs performed with the new backend.

· 2 min read
Michael Karg

High level summary

  • Benchmarking: We've concluded benchmarking node version 8.2.0.
  • Tracing: Optimization of the new tracing system has been merged; we're currently working on self-documenting tracing configuration.
  • Nomad backend: A PR that makes our backend take advantage of added flexibility of the new hardware cluster has been merged.

Low level overview

Benchmarking

As part of our release benchmarking cycle, we've completed and analysed the runs for the 8.2.0 version of cardano-node. In addition to the adjustment of sanity checks in our automation, we had to implement small changes in the analysis pipeline as well to accomodate the new version.

Tracing

A significant amount of optimizations for the new tracing system has finally been merged to master. At the moment, we're working on having a trace message self-document the final tracing configuration of a running node. Apart from adding insight into the system, this feature also aims at making future hot reloading of tracing configuration explicit and straightforward.

Furthermore, we're setting up a final round of system integration level benchmarks comparing new against legacy tracing.

Nomad backend

The new hardware cluster permits greater flexibility as far as SSH access is concerned. By using nomad for a consistent and reliable deployment, but taking advantage of direct connections for healthchecks and data transfer we believe we were able to reduce overall network latency in the nomad cluster. This improves confidence when capturing all network related measurements during our benchmarks.

A PR that adds these capabalities to our nomad backend - along with very many quality-of-life improvements - has been merged to master.

· 2 min read
Michael Karg

High level summary

  • Benchmarking: We're adjusting the benchmarking cluster to handle runs for node version 8.2.0.
  • Tracing: We've finished optimization of the new tracing system and added extra robustness with regard to namespacing.
  • Infrastructure: We've been working on making all benchmarking code compliant with the latest GHC9.6.
  • Nomad backend: The new backend has seen adjustments due to a change of underlying hardware. Additionally, we've successfully performed various benchmarking runs on it.

Low level overview

Benchmarking

The 8.2.0 version of cardano-node required adjustment of some of the sanity checks that are part of our benchmarking cluster automation. We've pinpointed the necessary changes and are currently setting up the cluster for the new node version.

Tracing

The optimization efforts for the new tracing system have been completed and have significantly reduced the resource footprint when using it as default for a running node.

A linchpin of the new system is the organization of traces into a namespace hierarchy. This affects configuration, self-documentation as well as rendering of desired trace messages. The new system is now equipped to detect any inconsistency in the whole set of tracers, defined across all components, even if they are never turned on in a running node. This feature adds another layer of robustness to the whole system.

Infrastructure

A potential switch to GHC9.6 (or higher) required some work on our code bases to make it compliant with recent compiler versions. We've future-proofed our benchmarking code.

Nomad backend

The hardware cluster that our nomad backend was accessing has been changed, and we were able to adjust our backend accordingly without touching its higher level abstractions and functionality. Moreover, with the new hardware and cluster setup, certain tasks such as retrieving run artifacts or healthcheck monitoring have become more performant.

The validation phase is ongoing. We were able to perform successful runs and analyses for various 8.x node versions, including 8.2.0-pre. With parallel runs on the current cluster, we hope to measure the same effects we've observed with the nomad backend - which will be a big step towards production use.

· 3 min read
Michael Karg

High level summary

  • Benchmarking: The peformance investigation into the compiler switch to GHC9 is ongoing. Additionally, a roadmap for implementing Consensus QTAs has been developed.
  • Infrastructure: Our workbench has undergone some refactoring to seamlessly integrate its profiles into all available backends.
  • Tracing: Optimization of the new tracing system is ongoing and yielding good performance results.
  • Nomad backend: We developed a new feature for the nomad backend which allows pinning deployments to specific machines.

Low level overview

Benchmarking

Our analysis of the GHC9 build of cardano-node has produced several locations in the code base where the new compiler seems to miss opportunities for optimization. Our hypothesis is, that those can account for the difference in resource usage we observe when benchmarking with a full cluster run. Instructing the compiler on how to perform the optimizations which GHC8 apparently applied out of the box requires further investigation.

In an effort to define Quantitative Timeliness Agreements (QTAs) on a per-component basis, we have coordinated with the Consensus team and developed a roadmap for providing those on consensus level. Making use of the insight that system-level benchmarks allow, we intend to set up and calibrate a benchmark that can reliably predict a regression or optimization for select metrics before needing full integration into cardano-node. This will help tremendously in various ways: catching regressions much earlier, localizing them much easier, avoiding repeated component integration and much shorter feedback cycle.

Infrastructure

We have worked on seamless integration of our benchmarking profiles into the many available backends that the workbench provides. The goal was to be backend-agnostic, to guarantee that all benchmarking run artifacts be structurally identical as far as their file name, format and location are concerned. This lead to refactoring work and has already landed in master.

Tracing

Much effort went into further optimization of the new tracing system. After working on configuration to align both new and legacy tracing system with regard to their trace frequencies, we could uncover some increase in resource usage. This occurred for corner cases under very heavy load. These cases have been addressed already, and do now surpass the legacy tracing system in terms of performance.

Nomad backend

For reliable benchmarking results it is vital to introduce as few confounding factors as possible when performing runs. This includes hardware and network topology. The nomad backend has been outfitted with a mechanism to pin the nomad job for some node in our benhcmarking cluster to a specific machine instance. This greatly increases confidence in the metrics observed from a run.

Furthermore this feature will detect any change in the underlying hardware or topology so it can be taken into account. The new feature has been merged to master.

· 2 min read
Michael Karg

High level summary

  • Benchmarking: We've performed several new benchmarks and a performance investigation in preparation of switching the default compiler to GHC9.
  • Infrastructure: The first batch of refactoring and documentation for our tx-generator has been merged to master.
  • Tracing: We've looked into an issue where the tracing system's concurrency could prevent a graceful node shutdown.
  • Nomad backend: Our new cloud backend has seen various improvements regarding deployment and monitoring; validation runs for the backend are ongoing.

Low level overview

Benchmarking

The compiler switch to GHC9 as the default build platform for cardano-node and its components still has noticeable effects on system-wide performance metrics. An investigation into the different resource usage profiles of compiler versions does seem to indicate GHC9's significantly different inlining behaviour may produce those effects. We're currently locating the specific places in component code that have the most extensive effect in that regard.

Using the forge-stress approximation we set up, we could determine that above effect is not due to a range of RTS parameters, as for example the number of capabilites used by the node.

Infrastructure

The tx-generator is a crucial part of our tooling responsible for producing very specific workloads for our benchmarking cluster. In an effort to flesh out an API to make it reusable for more general use cases, a first set of refactorings has been merged to master. Additionally, this merge contained systematic documentation both for internal and for exposed areas of the code base.

Tracing

The tracing system's concurrency could under certain conditions prevent a graceful shutdown of the node. This issue did occur only after adding specific new traces on a development branch. We could localize and address that issue.

Nomad backend

With the data gathered from running the new nomad cloud backend, we've been able to address many, many small and medium-sized improvements. The deployment process has been restructured for better efficiency, and the healthcheck system could be fine-tuned to recognize severity of various conditions that might occur. Optimization of fetching all run data from the cloud for evaluation is in progress.

Additionally, we're continuing the new backend's validation by setting up test runs and looking into comparative analyses with metrics gathered from the current cluster backend.