Skip to main content

37 posts tagged with "performance-tracing"

View All Tags

· 3 min read
Michael Karg

High level summary

  • Benchmarking: The peformance investigation into the compiler switch to GHC9 is ongoing. Additionally, a roadmap for implementing Consensus QTAs has been developed.
  • Infrastructure: Our workbench has undergone some refactoring to seamlessly integrate its profiles into all available backends.
  • Tracing: Optimization of the new tracing system is ongoing and yielding good performance results.
  • Nomad backend: We developed a new feature for the nomad backend which allows pinning deployments to specific machines.

Low level overview

Benchmarking

Our analysis of the GHC9 build of cardano-node has produced several locations in the code base where the new compiler seems to miss opportunities for optimization. Our hypothesis is, that those can account for the difference in resource usage we observe when benchmarking with a full cluster run. Instructing the compiler on how to perform the optimizations which GHC8 apparently applied out of the box requires further investigation.

In an effort to define Quantitative Timeliness Agreements (QTAs) on a per-component basis, we have coordinated with the Consensus team and developed a roadmap for providing those on consensus level. Making use of the insight that system-level benchmarks allow, we intend to set up and calibrate a benchmark that can reliably predict a regression or optimization for select metrics before needing full integration into cardano-node. This will help tremendously in various ways: catching regressions much earlier, localizing them much easier, avoiding repeated component integration and much shorter feedback cycle.

Infrastructure

We have worked on seamless integration of our benchmarking profiles into the many available backends that the workbench provides. The goal was to be backend-agnostic, to guarantee that all benchmarking run artifacts be structurally identical as far as their file name, format and location are concerned. This lead to refactoring work and has already landed in master.

Tracing

Much effort went into further optimization of the new tracing system. After working on configuration to align both new and legacy tracing system with regard to their trace frequencies, we could uncover some increase in resource usage. This occurred for corner cases under very heavy load. These cases have been addressed already, and do now surpass the legacy tracing system in terms of performance.

Nomad backend

For reliable benchmarking results it is vital to introduce as few confounding factors as possible when performing runs. This includes hardware and network topology. The nomad backend has been outfitted with a mechanism to pin the nomad job for some node in our benhcmarking cluster to a specific machine instance. This greatly increases confidence in the metrics observed from a run.

Furthermore this feature will detect any change in the underlying hardware or topology so it can be taken into account. The new feature has been merged to master.

· 2 min read
Michael Karg

High level summary

  • Benchmarking: We've performed several new benchmarks and a performance investigation in preparation of switching the default compiler to GHC9.
  • Infrastructure: The first batch of refactoring and documentation for our tx-generator has been merged to master.
  • Tracing: We've looked into an issue where the tracing system's concurrency could prevent a graceful node shutdown.
  • Nomad backend: Our new cloud backend has seen various improvements regarding deployment and monitoring; validation runs for the backend are ongoing.

Low level overview

Benchmarking

The compiler switch to GHC9 as the default build platform for cardano-node and its components still has noticeable effects on system-wide performance metrics. An investigation into the different resource usage profiles of compiler versions does seem to indicate GHC9's significantly different inlining behaviour may produce those effects. We're currently locating the specific places in component code that have the most extensive effect in that regard.

Using the forge-stress approximation we set up, we could determine that above effect is not due to a range of RTS parameters, as for example the number of capabilites used by the node.

Infrastructure

The tx-generator is a crucial part of our tooling responsible for producing very specific workloads for our benchmarking cluster. In an effort to flesh out an API to make it reusable for more general use cases, a first set of refactorings has been merged to master. Additionally, this merge contained systematic documentation both for internal and for exposed areas of the code base.

Tracing

The tracing system's concurrency could under certain conditions prevent a graceful shutdown of the node. This issue did occur only after adding specific new traces on a development branch. We could localize and address that issue.

Nomad backend

With the data gathered from running the new nomad cloud backend, we've been able to address many, many small and medium-sized improvements. The deployment process has been restructured for better efficiency, and the healthcheck system could be fine-tuned to recognize severity of various conditions that might occur. Optimization of fetching all run data from the cloud for evaluation is in progress.

Additionally, we're continuing the new backend's validation by setting up test runs and looking into comparative analyses with metrics gathered from the current cluster backend.

· 2 min read
Michael Karg

High level summary

  • Benchmarking: We've continued release benchmarking and established a new baseline for 8.0.0.
  • New tracing: Our benchmarking profile for measuring new vs. legacy tracing performance has been refined.
  • Nomad backend: The healthcheck system for the the nomad cloud has been completed. We've performed the first full runs on the new backend.

Low level overview

Benchmarking

In our release benchmarking cycle, we established a new performance baseline for 8.0.0. Additionally, we've measured performance under various workloads for 8.1.1-pre; the results look promising and validate the optimization efforts done on several system components.

In the meantime, we've finalized a build plan with GHC9.2 that matches the current one with GHC8.10; a requirement for benchmarking as a large amount of differences in the dependency graph can confound the results for the application code proper.

Tracing

The legacy and the new tracing system differ fundamentally in design, implementation and handling. So for metrics to be meaningful in a comparison, benchmarking profiles have to be tuned such that not only log line frequency but frequency of specific trace messages are closely aligned. We've found that higher granularity in this regard was necessary, and done additional work on our dedicated profiles.

Additionally, we've had a first glance of what additional traces could be valuable in the context of benchmarking UTxO-HD.

Nomad backend

As the new backend's healthcheck system in its first iteration can now serve as a guardrail to ensure sanity of a full-length run, we've performed our first 52-node cluster runs on nomad cloud. We're currently smoothing the edges around cluster deployment, and analysing the metrics gathered from those runs.

This means the backend is entering validation phase, where we systematically compare all metrics taken from the new infrastructure to the existing ones, including determining reproducibility and variance.

· 3 min read
Michael Karg

High level summary

  • Benchmarking: We've performed and analysed first benchmarks with GHC9.2 builds. Additionally we have developed an early indicator for how build config changes might reflect on metrics from our model cluster.
  • New tracing: Collaboration with Galois led to the new tracing system to be equipped with a re-forwarding mechanism.
  • Nomad backend: Porting the 52 node model cluster to nomad cloud is ongoing, with the focus on deployment and health checks.

Low level overview

Benchmarking

The first set of runs with GHC9.2 as a build platform are in. We've discovered a significant difference in resource profile usage compared to GHC8.10. Further investigation uncovered the need for benchmarking another parameter change in the build configuration: As it stands, the ghc-bignum package is using the Haskell native-backend as a default. We strive to benchmark a build with the gmp-backend next.

A variant of our forge-stress local benchmark has been set up to serve as an early indicator for the resource usage profile we'd expect to observe on the model cluster. This provides us with a much tighter feedback loop, as local run duration is way shorter. This indicator is specific to changes in the configuration of build and the runtime systems, and will be of great support when evaluating different compiler versions or RTS flags incrementally.

Tracing

The hub of the new tracing system cardano-tracer is designed with a fixed output behaviour, which is limited to various logging options. Thanks to the contribution from Galois, that design is now extended to be able to re-forward all, or a pre-filtered portion, of traces from the node in a configurable manner. This will enable downstream applications to directly receive the set of trace values relevant to their logic, without any additional cost for the node itself at all.

Nomad backend

We're currently working out the details of efficiently deploying and monitoring a fleet of 50+ nodes, along with job definitions for tracing and transaction generation. Scaling up to those many instances, and monitoring an ongoing benchmarking run required us to fine-tune communications with the nomad server.

Related to that, the new cloud backend will provide a monitoring and health-checking mechanism which is far more flexible and offers more detailed insight than the previous iteration in cardano-ops. The backend will enable you to formulate very specific conditions for an ongoing run to be considered healthy, and offer automation of certain actions should these conditions not be met.

· 4 min read
Michael Karg

High level summary

  • Benchmarking: We're preparing our model cluster to perform GHC9.2 benchmarks, as well as experimenting with increased dataset sizes.
  • New tracing: After optimization work on the new tracing system, another cycle of validation and documentation is due.
  • Analysis pipeline: First steps on implementing incremental analysis have been untertaken.
  • Open Sourcing: Exhaustive dataflow charts for both our analysis tool locli and our workbanch have been merged to master
  • Nomad backend: The first set of CI-centric workbench profiles have been adjusted and run on the nomad backend; currently we're porting the definition of our model cluster.
  • P&T Meetup: We had a very productive personal meetup in Lugano, Switzerland.
  • Offboarding: Sadly, we have to say goodbye to our team lead. Currently, we're busy with the handover.

Low level overview

Benchmarking

As a compiler switch to GHC 9.2.7 for cardano-node's default build environment is around the corner, we're setting up our benchmarking cluster to handle the new version. Special attention is given to the fact that we might need more flexibility in switching compiler versions in the future. This also involves choosing a reliable baseline as reference point for inter-version comparisons.

Additionally we've been working on refining our model cluster: by increasing UTxO and delegation map sizes to closer match those of current mainnet, we strive to have a more accurate model - and thus be able to make more detailed predictions regarding performance. However, this still needs to be balanced against resource demand for all our cluster's nodes.

Tracing

For our new tracing system, we're currently validating the behaviour of the system after optimizations have been applied. Furthermore, some quality-of-life details that have changed required us to revision the system documentation.

Analysis

As a mid-term goal, we aim to provide incremental analysis of our benchmarking metrics. While currently, we can only reliably process runs that have been normally (or abnormally) terminated, we see the possibility of incrementally analysing ongoing runs, or any data source yielding our key metrics, as a huge opportunity to increase our operational flexibility. All in all, this approach entails building completely new features for our pipeline. A first effort to accomodate incrementally incoming data points has been undertaken.

Open Sourcing

A very involved and exhaustive documentation and visualization effort has been undertaken to make the data flow through our key benchmarking copmonents more accessible. As a result, detailed charts for both our LogObject CLI locli and our workbench have been merged to master.

Nomad backend

While our Nomad backend is reaching completion, and hardware setup is being implemented in collaboration with SRE, we've been adjusting those profiles of our workbench that target CI-oriented workloads to the new backend. Those profiles should demonstrate the full functionality of the nomad cloud backend.

Additionally, we're porting a first deployable version of our model cluster to nomad cloud, which will form the basis for validation of our actual key metrics with regard to those from the existing cluster.

Performance & Tracing Meetup

We held a personal team meetup in Lugano, Switzerland. In an amazing location, and with a great seminar room to focus, we had 2 very productive days together. Being able to discuss live and in colour, we could effectively synchronize on where the team is at, and how we want to develop in the future. Also, it was a great opportunity to finally meet in person.

Offboarding

Last not least we regret that our team lead is leaving at the end of May. Currently, he's handing over all his obligations, which requires reorganisation of team structure, and responsabilities of team members for specific tasks. Serge, we all want to thank you for your excellent and reliable lead; we very much enjoyed the time with you, and wish you all the best for your future endeavours!