Skip to main content

37 posts tagged with "performance-tracing"

View All Tags

· 3 min read
Michael Karg

High level summary

  • Benchmarking: We've performed both low-level network and high-level variance analysis of our benchmarking clusters.
  • Infrastructure: Our reporting pipeline was adjusted to classify various workloads easily reducing rework time.
  • Tracing: Work on machine-readable tracing of tracer configuration is ongoing.
  • Nomad backend: We've been able to eliminate several possible confounders on the nomad cluster.
  • Team: We're currently onboarding a new team member: Welcome to Cardano Performance & Tracing, Baldur Blöndal!

Low level overview

Benchmarking

As part of the effort to bring the Nomad backend into production use, we've been equipping both that and the existing benchmarking backend with means to measure and document network latency for each run. Furthermore we've implemented means to capture TCP packets for a limited time window during a benchmarking run - which will allow us to spot differences in the behaviour of the underlying networking stack at OS level.

Additionally, we're running variance analysis in parallel on both backends to ascertain confidence in metrics originating from either. We've concluded that baseline profile runs aren't directly comparable between the two, so we decided to compare standard deviations instead to validate the measurements from nomad.

Infrastructure

Reporting on benchmarks does require human time and effort to rework the final document. Improvements to the reporting pipeline have been merged to master. They reduce the time necessary to do so by various changes to the template and the workload classification logic in analysis.

Beyond that, we've looked into issues where services would quit with an unjustified exit failure upon shutdown - under rare circumstances. By reworking shutdown logic for trace-dispatcher and tx-generator we were able to address those issues.

Tracing

After various steps in constructing a configuration upon node startup, it is vital to document which runtime configuration the node arrived eventually. We're working on providing a machine-readable JSON/YAML trace message for that purpose.

This will facilitate hot-reloading a node's tracer configuration in the future: users will be able to take such a trace message, apply their intended change and hot-reload it immediately into the node.

Nomad backend

As with the existing benchmarking cluster, nomad is currently under scrutiny with regard to the reliability of metrics it produces, as well as the behaviour of its OS-level network stack. For instance, differing kernel versions can have an impact on our measurements, as we'd be basically using two different instruments to take them.

Along the way we've already been successful in eliminating some possible confounders that had been introduced by the nomad service or the slightly different system architecture of the new cluster.

New team member

Baldur Blöndal is an extremely capable and experienced Haskell developer. Also, he's an excellent fit for our existing team. So I'm very pleased to welcome him onboard with IOG, and with Performance & Tracing. He will be working on cardano-tracer, the component receiving, processing and making available node traces and metrics.

· 3 min read
Michael Karg

High level summary

  • Benchmarking: We've performed and analysed feature benchmarks for both UTxO-HD and the current P2P stack.
  • Infrastructure: Various improvements of our analysis pipeline have been merged to master, supporting safe log truncation.
  • Tracing: Namespace consistency checks have been merged to master along with a curated configuration for benchmarking.
  • Nomad backend: We're productively using the new backend to measure new vs. legacy tracing system, adding many quality-of-life improvements.

Low level overview

Benchmarking

We've completed various runs and analyses targeting two distinct features of the node: UTxO-HD and Peer2Peer.

With our UTxO-HD benchmark we could clearly localize one point where this new way of maintaining ledger state is still costly, but at the same time confirm that in basically all other aspects UTxO-HD makes no difference in performance.

The Peer2Peer benchmarks focused on the effects that enabling this feature on a block producing node has on propagation times, as well as scrutinized a proposed change to the Peer2Peer network stack.

Infrastructure

As a result of optimizing in-memory representation of log objects, which are constructed from a node's traces, we can now analyse runs that last longer in total. For runs that exceed their expected duration, analysis now supports a truncation operation that keeps the interdependencies of block events intact.

Truncation might happen at a slightly different point in time - and therefore in its log object stream - for each node in the cluster. An additional step validating the block hash timeline of the cluster has been implemented for the pipeline. It provides early feedback on whether a specific truncation will lead to a valid full analysis, which requires much more time.

Tracing

Consistency checking of namespace implementation and configuration when using the new system has been completed. This feature enables feedback on when tracer implementation details in some component might have changed. It's also able to detect when a configuration used for operating a cardano-node shows inconsistencies with the namespaces the system provides - and hence needs attention.

Furthermore, we've created a fine-grained configuration of the new system that caters to benchmarking's need of very many detailed trace messages. It's aimed at mirroring the same amount of trace messages, and information, we're seeing from our usage of the legacy system; an important step in making benchmarks between systems comparable.

Nomad backend

The new backend is currently being used for further validation with regard to the existing cluster. Additionally, we're using it in production mode to comparatively benchmark both tracing systems after merging past month's optimizations - which is the first real-life application of the nomad cluster. Hands-on experience in that phase translates into many small improvements which can be immediately applied to enhance user experience for the new backend.

· 2 min read
Michael Karg

High level summary

  • Benchmarking: Benchmarking node version 8.2.1 has concluded. Additionally we're developing benchmarking setups for GHC9.6 and UTxO-HD.
  • Infrastructure: Our analysis pipeline has received improvements reducing memory footprint.
  • Tracing: Another batch of optimizations for the new tracing system has been merged; work on namespace consistency guarantees is ongoing.
  • Nomad backend: We're performing and analysing various runs for validation purposes on the new hardware cluster.

Low level overview

Benchmarking

We've performed and analysed the benchmarks for the 8.2.1 version of cardano-node as part of our release benchmarking cycle.

Setting up cluster benchmarks requires completing full system integration. This applies to both supporting a new build platform, as is GHC9.6, as well as targeting a specific feature, like a UTxO-HD enabled node. Currently, we're working on respective integrations on both those paths.

Infrastructure

As cluster runs increase in duration, more and more data is accumulated for analysis. Batch analysis mode needs all data to be held in memory, which wouldn't fit anymore even on a 64GB RAM machine. Changes to the in-memory data representation improving on compactness were able to reduce the RAM requirements of our analysis pipeline.

Tracing

The next portion of optimizations has been completed and merged to master, getting rid of Haskell's native String representation on critical code paths. This concludes the optimization phase of the new tracing system for all its components used by cardano-node.

The implementation for validation of consistent naming and configuration is ongoing. We're splitting out everything that's verifiable at compile time into a seperate test case which we hope to integrate into CI - leaving only configuration constraints to be verified at or before node startup.

Nomad backend

The verification phase of the nomad cloud backend is ongoing. We're able to perform full runs on the new hardware cluster and porting profiles and configurations from the legacy one. The goal is to reproduce with confidence known regressions, or improvements, between runs performed on the legacy cluster and runs performed with the new backend.

· 2 min read
Michael Karg

High level summary

  • Benchmarking: We've concluded benchmarking node version 8.2.0.
  • Tracing: Optimization of the new tracing system has been merged; we're currently working on self-documenting tracing configuration.
  • Nomad backend: A PR that makes our backend take advantage of added flexibility of the new hardware cluster has been merged.

Low level overview

Benchmarking

As part of our release benchmarking cycle, we've completed and analysed the runs for the 8.2.0 version of cardano-node. In addition to the adjustment of sanity checks in our automation, we had to implement small changes in the analysis pipeline as well to accomodate the new version.

Tracing

A significant amount of optimizations for the new tracing system has finally been merged to master. At the moment, we're working on having a trace message self-document the final tracing configuration of a running node. Apart from adding insight into the system, this feature also aims at making future hot reloading of tracing configuration explicit and straightforward.

Furthermore, we're setting up a final round of system integration level benchmarks comparing new against legacy tracing.

Nomad backend

The new hardware cluster permits greater flexibility as far as SSH access is concerned. By using nomad for a consistent and reliable deployment, but taking advantage of direct connections for healthchecks and data transfer we believe we were able to reduce overall network latency in the nomad cluster. This improves confidence when capturing all network related measurements during our benchmarks.

A PR that adds these capabalities to our nomad backend - along with very many quality-of-life improvements - has been merged to master.

· 2 min read
Michael Karg

High level summary

  • Benchmarking: We're adjusting the benchmarking cluster to handle runs for node version 8.2.0.
  • Tracing: We've finished optimization of the new tracing system and added extra robustness with regard to namespacing.
  • Infrastructure: We've been working on making all benchmarking code compliant with the latest GHC9.6.
  • Nomad backend: The new backend has seen adjustments due to a change of underlying hardware. Additionally, we've successfully performed various benchmarking runs on it.

Low level overview

Benchmarking

The 8.2.0 version of cardano-node required adjustment of some of the sanity checks that are part of our benchmarking cluster automation. We've pinpointed the necessary changes and are currently setting up the cluster for the new node version.

Tracing

The optimization efforts for the new tracing system have been completed and have significantly reduced the resource footprint when using it as default for a running node.

A linchpin of the new system is the organization of traces into a namespace hierarchy. This affects configuration, self-documentation as well as rendering of desired trace messages. The new system is now equipped to detect any inconsistency in the whole set of tracers, defined across all components, even if they are never turned on in a running node. This feature adds another layer of robustness to the whole system.

Infrastructure

A potential switch to GHC9.6 (or higher) required some work on our code bases to make it compliant with recent compiler versions. We've future-proofed our benchmarking code.

Nomad backend

The hardware cluster that our nomad backend was accessing has been changed, and we were able to adjust our backend accordingly without touching its higher level abstractions and functionality. Moreover, with the new hardware and cluster setup, certain tasks such as retrieving run artifacts or healthcheck monitoring have become more performant.

The validation phase is ongoing. We were able to perform successful runs and analyses for various 8.x node versions, including 8.2.0-pre. With parallel runs on the current cluster, we hope to measure the same effects we've observed with the nomad backend - which will be a big step towards production use.