Skip to main content

49 posts tagged with "performance-tracing"

View All Tags

· 3 min read
Michael Karg

High level summary

  • Benchmarking: We've performed and analysed first benchmarks with GHC9.2 builds. Additionally we have developed an early indicator for how build config changes might reflect on metrics from our model cluster.
  • New tracing: Collaboration with Galois led to the new tracing system to be equipped with a re-forwarding mechanism.
  • Nomad backend: Porting the 52 node model cluster to nomad cloud is ongoing, with the focus on deployment and health checks.

Low level overview

Benchmarking

The first set of runs with GHC9.2 as a build platform are in. We've discovered a significant difference in resource profile usage compared to GHC8.10. Further investigation uncovered the need for benchmarking another parameter change in the build configuration: As it stands, the ghc-bignum package is using the Haskell native-backend as a default. We strive to benchmark a build with the gmp-backend next.

A variant of our forge-stress local benchmark has been set up to serve as an early indicator for the resource usage profile we'd expect to observe on the model cluster. This provides us with a much tighter feedback loop, as local run duration is way shorter. This indicator is specific to changes in the configuration of build and the runtime systems, and will be of great support when evaluating different compiler versions or RTS flags incrementally.

Tracing

The hub of the new tracing system cardano-tracer is designed with a fixed output behaviour, which is limited to various logging options. Thanks to the contribution from Galois, that design is now extended to be able to re-forward all, or a pre-filtered portion, of traces from the node in a configurable manner. This will enable downstream applications to directly receive the set of trace values relevant to their logic, without any additional cost for the node itself at all.

Nomad backend

We're currently working out the details of efficiently deploying and monitoring a fleet of 50+ nodes, along with job definitions for tracing and transaction generation. Scaling up to those many instances, and monitoring an ongoing benchmarking run required us to fine-tune communications with the nomad server.

Related to that, the new cloud backend will provide a monitoring and health-checking mechanism which is far more flexible and offers more detailed insight than the previous iteration in cardano-ops. The backend will enable you to formulate very specific conditions for an ongoing run to be considered healthy, and offer automation of certain actions should these conditions not be met.

· 4 min read
Michael Karg

High level summary

  • Benchmarking: We're preparing our model cluster to perform GHC9.2 benchmarks, as well as experimenting with increased dataset sizes.
  • New tracing: After optimization work on the new tracing system, another cycle of validation and documentation is due.
  • Analysis pipeline: First steps on implementing incremental analysis have been untertaken.
  • Open Sourcing: Exhaustive dataflow charts for both our analysis tool locli and our workbanch have been merged to master
  • Nomad backend: The first set of CI-centric workbench profiles have been adjusted and run on the nomad backend; currently we're porting the definition of our model cluster.
  • P&T Meetup: We had a very productive personal meetup in Lugano, Switzerland.
  • Offboarding: Sadly, we have to say goodbye to our team lead. Currently, we're busy with the handover.

Low level overview

Benchmarking

As a compiler switch to GHC 9.2.7 for cardano-node's default build environment is around the corner, we're setting up our benchmarking cluster to handle the new version. Special attention is given to the fact that we might need more flexibility in switching compiler versions in the future. This also involves choosing a reliable baseline as reference point for inter-version comparisons.

Additionally we've been working on refining our model cluster: by increasing UTxO and delegation map sizes to closer match those of current mainnet, we strive to have a more accurate model - and thus be able to make more detailed predictions regarding performance. However, this still needs to be balanced against resource demand for all our cluster's nodes.

Tracing

For our new tracing system, we're currently validating the behaviour of the system after optimizations have been applied. Furthermore, some quality-of-life details that have changed required us to revision the system documentation.

Analysis

As a mid-term goal, we aim to provide incremental analysis of our benchmarking metrics. While currently, we can only reliably process runs that have been normally (or abnormally) terminated, we see the possibility of incrementally analysing ongoing runs, or any data source yielding our key metrics, as a huge opportunity to increase our operational flexibility. All in all, this approach entails building completely new features for our pipeline. A first effort to accomodate incrementally incoming data points has been undertaken.

Open Sourcing

A very involved and exhaustive documentation and visualization effort has been undertaken to make the data flow through our key benchmarking copmonents more accessible. As a result, detailed charts for both our LogObject CLI locli and our workbench have been merged to master.

Nomad backend

While our Nomad backend is reaching completion, and hardware setup is being implemented in collaboration with SRE, we've been adjusting those profiles of our workbench that target CI-oriented workloads to the new backend. Those profiles should demonstrate the full functionality of the nomad cloud backend.

Additionally, we're porting a first deployable version of our model cluster to nomad cloud, which will form the basis for validation of our actual key metrics with regard to those from the existing cluster.

Performance & Tracing Meetup

We held a personal team meetup in Lugano, Switzerland. In an amazing location, and with a great seminar room to focus, we had 2 very productive days together. Being able to discuss live and in colour, we could effectively synchronize on where the team is at, and how we want to develop in the future. Also, it was a great opportunity to finally meet in person.

Offboarding

Last not least we regret that our team lead is leaving at the end of May. Currently, he's handing over all his obligations, which requires reorganisation of team structure, and responsabilities of team members for specific tasks. Serge, we all want to thank you for your excellent and reliable lead; we very much enjoyed the time with you, and wish you all the best for your future endeavours!

· 2 min read
Michael Karg
  • Benchmarking: The benchmarks and performance investigations for the new 8.0 release branch are ongoing.
  • New tracing: Performance optimization of the new tracing system is paying off and we could notably shrink its resource footprint.
  • Analysis pipeline: An exhaustive documentation and dataflow diagram for our analyses is being worked on.
  • Infrastructure: The plutus-apps flake input for cardano-node has finally been removed.
  • Nomad backend: A PR implementing placement of benchmarking clusters has been merged.

Benchmarking

The performance investigations on the 8.0 release branch have lead to pinpointing and addressing incosistent behaviour. For that, we created yet another local reproduction with the workbench's forge-stress benchmark.

Currently we're working on scaling up the dataset size (UTxO and delegations) on the AWS cluster to gain further insight into 8.0 and subsequent releases.

Additionally, we've refined the trace-bench family of profiles that target benchmarking our own new tracing system.

Tracing

Optimization of the tracing system has identified several locations where inefficient serializations were used; those were not originally intended to run on a performance-critical codepath. We've worked on improving those, as well as eliminating cases of redundant conversion between different serialization formats. This has brought down both memory and CPU impact of the tracing system.

Infrastructure & Analysis

Dataflow documentation

The LogObject CLI locli is at the heart of our analysis and reporting pipeline. To increase its accessibility and facilitate further development, we're creating a detailed and illustrated documentation of all dataflows that happen during analysis and reporting.

Remove redundant Plutus flake input

This step is the conclusion of porting Plutus benchmarking scripts to our own library. By finally removing the now unnecessary flake input, we simplify the dependency graph for cardano-node, as well as enable immediate feedback when developing Plutus benchmarks.

Nomad backend

Sophisticated placement of nodes across various regions of the globe is a cornerstone of the model cluster we use for benchmarking. This capability has now been added to the Nomad backend and can be controlled with Nomad job descriptions. A PR with this, along with various quality-of-life improvements, has been merged to master.

· 3 min read
Michael Karg
  • Benchmarking: We performed a series of benchmarks aimed at the new 8.0 release branch and built a timeline from the 1.35 releases to that branch.
  • New tracing: Work on safeguarding the new tracing system performance-wise is ongoing. A practical use case for data points is being tackled with Galois.
  • Analysis pipeline: We're working on automatically obtaining a detailed manifest for each run.
  • Infrastructure: The library for benchmarking Plutus scripts has been merged. Also, we've laid the ground for including GHC profiling data into our workbench.
  • Nomad backend: The first iteration of a distributed / multi-client Nomad cluster has been merged.

Benchmarking

We have performed various cluster runs targeting the 8.0 release branch. That way we were able to catch an inconsistency in behaviour early on. This led to the creation of a specialized workbench profile epoch-transition for local reproduction of what we observed on the benchmarking cluster.

Furthermore, we bridged the gap between the run data from the 1.35.x releases to the the new 8.0.x release branch. This included walking the master branch backwards and pinpointing the order, as well as the dates and commits of all relevant component bumps. This timeline is absolutely crucial in locating possible regressions for the new release branch, as it provides the exact points in history we would need to target with a comprehensive set of benchmarks.

Tracing

In-depth performance analysis of the new tracing system has already yielded results and helped us smoothing some rough edges. However, this work is still ongoing.

In coordination with Galois, who are developing a system assurance service by observing a number of cardano-nodes, we're working with the implementation of data points which the node provides during runtime. While making the view on data points expressive enough for the external service, the computational burden inside the node needs to be kept to an absolute minimum. We're currently in ideation about whether cardano-tracer could be extended with a richer feature set to that end.

Infrastructure & Analysis

Detailed manifest

A run manifest documents, among other things, the component dependencies that were used for a specific build the run has been performed with. These dependencies come from different package sources, have different versioning policies, and an identical package version might provide different performance characteristics depending on the exact commit used for the build. This manifest will greatly increase insight into where changes in measured behaviour might have originated by making all component bumps visible and accessible.

GHC profiling inside workbench

The workbench has been equipped with a new -profnix profile flavour. This enforces a -fprof-auto build for all node-related packages. The type of profiling data generated by the GHC runtime can be customized and will enter statistical analysis. The relevant PR for this new feature has already been merged to master.

Nomad backend

The added feature for a multi-client Nomad cluster greatly enhances how jobs are organized by the backend and mapped within specific instances. This results in great maintainability while not giving up on flexibility. However, work on that feature is still ongoing.

· 3 min read
Michael Karg
  • Benchmarking: We worked on adjusting our infrastructure to the new 8.0 release branch and performed a (very) early run.
  • New tracing: We're profiling the new tracing system for minimizing its resource footprint and guarantee high throughput.
  • Analysis pipeline: Variance analysis both for reporting and for serving as a point of comparison has been merged.
  • Infrastructure: A library for Plutus scripts will be integrated in our tooling and benchmarking profiles. Also, a profile family aimed at the tracing systems has been added.
  • Nomad backend: Various specializations of the backend are currently being implemented, along with streamlining credentials management.

Benchmarking

We have adapted our benchmarking cluster to the requirements of the 8.0 release branch. Testing runs of a very early feature branch for 8.0 helped us localize an important issue in collaboration with the other teams. We look forward to gathering preliminary metrics for 8.0 soon.

Tracing

Analysis of resource usage profiles of both the legacy and new tracing system, with and without trace forwarding, have lead us to gather very detailed profiling data for each possible setup. This is to ensure we keep resource usage within the node to an absolute minimum, while still providing the highest possible throughput of data for forwarding to cardano-tracer.

Additionally, we've worked on a very practically-oriented document targeted at end users of the new tracing system. It provides tested step-by-step instructions for tunneling trace forwarding from a node to cardano-tracer via an easy to manage system service, which will match the production setup of most users.

Infrastructure & Analysis

General

Variance analysis as a full-fledged entity in our tooling has been merged. Not only is this type of analysis now part of our reporting pipeline - variance analysis can be fed back and serve as an additional point of comparison.

Furthermore, we've created a profile family for the workbench that's specifically aimed at measuring and comparing tracing system configurations.

Plutus library

We opened a PR containing a new package for benchmarking - an extendable library that holds all Plutus scripts we use in our benchmarking profiles. This will enable us in the future to iteratively work on customizing any given script, and the way is called in the context of a specific profile. It is a refinement of current affairs, where we have additional build inputs solely to generate a static script file tied to an external commit.

Nomad backend

The nomad backend is being specialized in three ways: using a podman driver locally, using nomad agents supporting nix installables, and using nomad cloud agents. This supports having a common surface independent of the actual backend driver being used. In addition, vault retrieval and management of cloud access credentials is being improved to minimize any friction for the backend user.