53 posts tagged with "performance-tracing"

View All Tags

Performance & Tracing Update

September 23, 2024 · 4 min read

Michael Karg

Performance and Tracing Team Lead

High level summary

Benchmarking: Release benchmarks for Node 9.2.0. Validating the new "age of Voltaire" performance baseline.
Development - New Tracing System: A space leak in the forwarding mechanism was fixed; a log rotation bug is being investigated.
Workbench: Large refactoring of workbench, optimizing nix closure size and adding profile flake outputs. Adjusted Nomad backend was merged.
Infrastructure: Dropping Vault for the Nomad cluster was tested and merged.
Tracing: Further metrics names alignment; be OpenMetrics specs compliant; adding annotations to Prometheus metrics; internal monitoring servers routing has entered testing.

Low level overview

Benchmarking

We've run and analyzed a full set of release benchmarks for Node version 9.2.0. In comparison with Mainnet release 9.1.1, we could not observe any performance regression.

Moreover, we've validated the stability of our new "age of Voltaire" performance baseline on 9.1.1. Currently, we're running a cross-comparison between baselines and Node versions 9.1.1 and 9.2.0 to ascertain that the new baseline arrives - at scale - at the same performance observations and predictions as the previous one.

Development - New Tracing System

Forwarding traces and metrics in the new system exhibited a tiny space leak. Under conventional operation, this leak would only become noticeable after running uninterrupted for days or even weeks. It took very hard pressure on the system, and additional profiling, to make it visible. It could be fixed by avoiding unnecessary allocations of continuations: The buffer of objects to forward inherently carries the position of the next object to process, such that a fully evaluated closure can trivially be reused to handle any subsequent forwarding request. This has led to new versions of packages trace-foward-2.2.7 and ekg-forward-0.6. Huge thanks to John Lotoski and Javier Sagredo, whose meticulous information helped to swiftly address the issue.

On the benchmarking cluster, we've observed cardano-tracer's log rotation to occasionally misbehave: under certain circumstances, the service leaks handles by not redirecting output to the latest log file in the rotation. We've located the issue and are working towards a fix.

Workbench

We've been working on a major refactoring of workbench code. The main benefit of this endeavour is being able to pull in a very heavy dependency optionally only when required, when building and running the workbench shell. This will especially facilitate runs on CI machines after garbage collections, but also building a local shell on individual developer machines. Additionally, benchmarking profiles designed for the cluster are now provided as nix flake outputs. This allows for building a more versatile automation in the future, where workbench and cardano-node commits won't need to be tied to each other. Last not least, the refactoring simplified the way the shell commands are evaluated, doing away with nested calls in many instances. The refactoring PR has been thouroughly tested and merged.

Furthermore, the workbench is now prepared for a nixpkgs upgrade and has dropped the container-based Nomad / podman backend - the respective PR was merged successfully.

Infrastructure

Removal of the Vault service for managing benchmarking cluster credentials has been successfully tested and merged. The service is scheduled for final shutdown end of month, reducing hardware cost and maintenance effort.

Tracing

We've received initial feedback regarding the alignment of metrics names between new and legacy tracing systems. Based upon that feedback, we're currently working on some further adjustments to the naming schema.

The implementation for hosting multiple EKG monitors in one single service has been finished and is currently in the testing phase. The dynamic routing to monitoring data, now used both for EKG and Prometheus, reflects the nodes that are connected to cardano-tracer. We've also added a JSON response format, which makes it easier to query and scrape existing routes as part of automations. Finally, this PR also removes the dependency on the snap server framework and transitively on HsOpenSSL (which is prone to cause build issues in the future).

Currently, we're working on various improvements to the Prometheus metric expositions in cardano-tracer. We aim to implement full compliance with the OpenMetrics specification, which should greatly enhance integration processes. Furthermore, metrics will be augmented with # TYPE and # HELP annotations, as tracked in issue cardano-node#5021.

Last not least, we've closed off issue cardano-node#3988. For adding an optional prefix to metrics names, the Node config option TraceOptionMetricsPrefix can now be used.

Performance & Tracing Update

September 9, 2024 · 4 min read

Michael Karg

Performance and Tracing Team Lead

High level summary

Benchmarking: Release benchmarks for Node 9.1.1; additional UTxO-HD in-memory benchmarks.
Development: Created a local reproduction for observed UTxO-HD RAM increase.
Workbench: Created a new "age of Voltaire" performance baseline. Adjusted Nomad backend has entered testing phase.
Infrastructure: Dropping the requirement on Vault, optimizing cluster setup.
Tracing: New metrics naming schema was merged. Routing to internal monitoring servers is ongoing. Dropping dependency on HsOpenSSL.

Low level overview

Benchmarking

Runs and analyses for a full set of release benchmarks have been performed for Node version 9.1.1. In comparison with Mainnet releases 9.0 and 9.1.0, we could determine this version does not exhibit any performance regression.

Having been provided with the patch by Consensus targeting the increased RAM usage of the UTxO-HD in-memory backend (read below), we've performed additional benchmarks to validate the desired result on the cluster. Our measurements demonstrate the increased memory need has now vanished. We're confident that by now we've located - and addressed - all performance risks for UTxO-HD in-memory that we can capture given the instruments at our disposal. To gain further confidence in the stability of resource usage pattern and network metrics observed on the benchmarking cluster, we've advised long-running UTxO-HD nodes under close monitoring.

Development

We succeeded in creating a local reproduction of the increase in RAM usage that was observed for the UTxO-HD in-memory backend on the cluster. That reproduction enabled the Consensus team to inspect in real-time and profile running Node processes - which led to a swift identification of the underlying cause and a patch addressing it.

Workbench

After the smooth Chang hard fork which transitioned Cardano into the Conway era, we've created - and merged - a new performance baseline. It's intended for release benchmarks and caters to the new features of the Conway ledger. Apart from incorporating the latest protocol version and Plutus cost models, it includes DRep presence in ledger when performing measurements.

The PR preparing our workbench for a nixpkgs upgrade and removing the container-based Nomad / podman backend is complete and has entered testing phase.

Infrastructure

Currently, our Nomad cluster uses Vault to manage access and credentials for the benchmarking cluster. As the cluster exclusively relies on static routes, and fixed deployment endpoints, encoding access as a set of rules into the cloud infrastructure is a viable option. That way, we will no longer depend on the Vault service, removing the requirement of hosting, and maintaining, an instance of it.

Tracing

Aligning the metrics naming schema and semantics between new and legacy tracing systems has been completed and merged. This will enable a seamless interchange in the community, as all existing configurations of monitoring services remain their validity.

As for hosting multiple EKG metrics monitors in one single service application, we ascertained that the ekg package was not built for that use case. However, we've come up with a much nicer design for cardano-tracer using dynamic routing based on the names of nodes connected to it. It has successfully passed prototype stage in that it's able to serve multiple EKG monitors without the need for any server restart; the full implementation is being worked on.

Last not least, both existing tracing systems rely on the snap server framework, and thus by transitive dependency, on HsOpenSSL to speak the HTTPS protocol. However, we've determined the latter package to have a risk of breaking the build, both currently and in the future (cf. HsOpenSSL#95 and HsOpenSSL#88). As a consequence, a switch to the wai / warp based framework was decided, which implements HTTPS capability differently, thus preempting the risk. This has already been carried out for the legacy system, and currently is for cardano-tracer - a big shoutout to Erik de Castro Lopo for his support on that issue.

Performance & Tracing Update

August 21, 2024 · 4 min read

Michael Karg

Performance and Tracing Team Lead

High level summary

Benchmarking: Release benchmarks for Node 9.1; UTxO-HD in-memory benchmarks; typed-protocols feature benchmarks.
Development: Correct resource trace emission for CPU 85% spans metric. Governance action benchmarking still under development.
Workbench: Preparations for bumping nixpkgs. Started removal of the container-based podman backend. Support GHC9.8 nix shells.
Infrastructure: Test and validate an upcoming change in node-to-node submission protocol.
Tracing: cardano-tracer: Support of non-systemd Linux was merged; safe restart of internal monitoring servers.

Low level overview

Benchmarking

We've run and analyzed a full set of release benchmarks for Node version 9.1. Comparing with the mainnet release 9.0, we could not observe any performance regression.

Additionally, we've performed feature benchmarks for an upcoming new API for typed-protocols. Those did not exhibit any regression either in comparison with the baseline using the current API.

Furthermore, we've performed various benchmarks for the UTxO-HD in-memory backend on Node versions 9.0 and 9.1. Based on those observations, a rare race condition could be eliminated, where block producers on occasion failed to fork off a thread for the forging loop. The overall network performance of the UTxO-HD in memory backend shows a slight improvement over the regular node, but currently comes with a slightly increased RAM usage.

Development

We've spotted an inconsistency in one of our benchmarking metrics - CPU 85% spans - which measures the average number of consecutive slots where CPU usage spikes to 85% or higher (however short the spike itself might be). There was a difference between legacy tracing system (which yielded the correct value) and the new one, for which a fix has already been devised.

The implementation of Conway governance action workloads for benchmarking is ongoing.

Workbench

With a nixpkgs bump on the horizon, we're working on adjusting, and testing, our usage of packages that change their status, lose their support, or packages that require pinning a version for the workbench.

Additionally, we'll remove a container-based backend for workbench, which ties in OCI image usage on podman with Nomad. It was a precursor to the current Nomad backend, which is containerless and can directly build Nomad jobs using nix.

Last not least, we've merged a small PR which enables our workbench to build nix shells with GHC9.8, as this not only pulls in the compiler, but much of the Haskell development toolchain. The correct version couplings between compiler and toolchain components is now declared explicitly from GHC8.10.7 up to GHC9.8.

Infrastructure

We've tested and validated an upcoming change in ouroboros-network which demands any node-to-node submission client to hold the connection for at least one minute before being able to submit transactions. The change works as expected and does not interfere with special functionality required by benchmarking.

Tracing

The trace consumer service for the new tracing system used to require systemd on Linux to build and operate. There are, however, Linux environments that choose to not use systemd. It is now possible to configure the desired flavour of that service, cardano-tracer, at build time, thus adding support for those Linuxes - cardano-node#5021.

cardano-tracer consumes not just traces, but also metrics. With the new tracing system, this shifts running a metrics server from the node to the consumer process. One possible setup in the new system is operating only one consumer service and connecting multiple nodes to it. In its current design, this requires to safely shutdown and restart the monitoring server, using the metrics store of any connected node that's been requested. We're currently battle-testing ekg's (the monitoring package that's being used) built-in behaviour and exploring solutions in case it does not fully meet requirements.

Performance & Tracing Update

July 23, 2024 · 3 min read

Michael Karg

Performance and Tracing Team Lead

High level summary

Benchmarking: Release benchmarks for Node 9.0; Plutus execution budget scaling benchmarks.
Development: Improved shutdown behaviour for tx-generator merged to master. Work on governance action benchmarking workload is ongoing.
Workbench: Haskell profile content definition merged to master.
Tracing: Factoring out RTView was merged to master. Work on metrics naming ongoing, minimizing migration effort.
Consensus QTAs: Design for automating and data warehousing beacon measurements is complete.

Low level overview

Benchmarking

Runs and analyses for a full set of release benchmarks have been performed - and published - for Node version 9.0.0. Comparing with the latest mainnet release 8.12.1, we could not observe any performance regression. 9.0.0 exhibits an improvement in Block Fetch duration, which results in slightly better overall network performance.

Additionally, we've performed scaling benchmarks of Plutus execution budgets. In these series of benchmarks, we measure the performance impact of changes to those budgets in the protocol parameters. Steps (CPU) and memory budgets are scaled independently of each other, and performance testing takes place using Plutus scripts that either are heavy on allocations but light on CPU, or vice versa. These performance tests are meant to explore the headroom of those budgets, taking into account cost model changes, and recent optimization capabilites of the Plutus compiler.

Development

Our workload submission service tx-generator has been equipped with the ability to handle POSIX signals for graceful shutdown scenarios. Furthermore, as it is highly concurrent, error reporting on a per-thread basis has been added, enhancing feedback from the service. Along with some quality-of-life improvements, these changes have landed in master.

The Conway governance action workloads for benchmarking have completed design phase, and we've settled on an implementation plan. Implementation work itself has started.

Workbench

Generating the contents for any benchmarking profile has now become a dedicated Haskell tool, cardano-profile, which has landed in master. Adding type safety and a test suite to profile definitions is a major improvement over shell code that directly manipulates JSON objects. Furthermore, it makes reliably modifying, or creating, benchmarking profiles much more accessible to engineers outside of our team.

Tracing

With factoring out RTView, and making it an opt-in component of the cardano-tracer build, we've reduced the service's dependency graph significantly. Furthermore, the service has become more lightweight on resources. We'll continue to maintain RTView, and guarantee it will remain buildable and usable in the future.

Aligning metrics naming and semantics of new and legacy tracing is ongoing work. This task is part of a larger endeavour to minimize the effort necessary for users to migrate to the new system.

Consensus QTAs

beacon, which currently measures performance of certain ledger operations on actual workload fragments, is a first step in building a benchmarking framework based on Delta-Q system design and quantitative timeliness agreements. We've finished the design of how to automate those measurements at sensible points in time, and provide a storage schema which will enable access and analysis that fits with the framwork.

Performance & Tracing Update

June 25, 2024 · 4 min read

Michael Karg

Performance and Tracing Team Lead

High level summary

Benchmarking: Release benchmarks for Node 8.12.0; DRep benchmarks with 100k DReps.
Development: Merged a performance fix on 8.11; kicked off development of governance action workload.
Workbench: Adjusted automations to latest 8.12 Conway features and Plutus cost model; implementation of CIP-69 and CIP-117 for our tooling is in validation phase.
Tracing: Work on metrics naming ongoing. Factoring out RTView component is completed and has entered testing.
IOI Tech Meetup: Our team contributed two presentations at the meetup in Zurich; worked on community report of UTxO scaling benchmarks.

Low level overview

Benchmarking

We've run and analyzed a full set of release benchmarks for Node versions 8.12.0. In comparison with the latest mainnet release 8.9.3, we could not observe any regressions. In fact, 8.12.0 was able to deliver equal network performance at a slightly reduced resource cost - both for CPU and memory.

Another benchmark of the Conway ledger with large amounts of DReps has been performed. This time, 100000 DReps were chosen - this amount aims to simulate a scenario where lots of self-delegation takes place. While a performance impact is observable in this instance, we can still see that the number of DReps scales well overall, and poses no concern for network peformance.

Development

We have contributed and merged a performance fix on 8.11 which adresses a regressing metric in the forging loop. The regression was only observable under specific conditions. Benchmarks on 8.12 have already confirmed the fix to be successful.

We've kicked off governance action workloads for benchmarking. This will be an entirely new workload type for Conway era, targeting performance measurements of its decentralized decision making process. The workload will feature registering proposals, acting as multiple DReps to vote on active proposals, vote tallying and proposal enactment. We're very grateful for the Ledger team's helpful support so far in creating a workload design for benchmarking - one that evenly stresses the network over extended periods of time.

Workbench

The workbench automations have been upgraded to handle Node 8.12 and the corresponding integrations of Cardano API and CLI.

Furthermore, we've updated to the latest PlutusV3 costmodel in our benchmarks - as well as implemented CIP-69 and CIP-117 for all our PlutusV3 benchmarking scripts, pending validation by the Plutus team.

Tracing

The work on aligning of metrics naming and semantics of new and legacy tracing is ongoing. Additionally, we're adding a handful of metrics to the new tracing system which currently exist in legacy tracing only.

Factoring out the RTView ("real-time view") component of cardano-tracer in the new tracing system has finished. This includes a considerable refactoring of cardano-tracer's codebase, so that we're currently running test on the new codebase. Isolating RTView is due to its being in prototype stage for too long, and the design decisions taken. In the short term, this will make several package dependencies optional, which have become troublesome for CI, as well as making cardano-tracer more lightweight. RTView remains as an opt-in.

IOI Tech Meetup

Our entire team traveled to Zurich, Switzerland to attend ZuriHac'24 and the IOI Tech Meetup. It was fantastic to meet everyone in person, and we all had an amazing and very productive time. A big Thank You to everyone involved in making that happen, and making it a success.

We contributed two presentations for the meetup: a thourough introduction of the new tracing system aimed at developers - as it's not tailored exclusively to cardano-node, but can be used in other (Haskell) services as well. And secondly, an overview over the benchmarking framework based on Quantitative Timeliness Agreements which we're building - as well as a show-and-tell of our prototype, implementing part of said framework. We're grateful for the great interest and feedback from all the participants.

Last not least, we worked on creating a community report of the UTxO scaling benchmarks performed during March and April - to be released soon.

High level summary​

Low level overview​

Benchmarking​

Development - New Tracing System​

Workbench​

Infrastructure​

Tracing​

High level summary​

Low level overview​

Benchmarking​

Development​

Workbench​

Infrastructure​

Tracing​

High level summary​

Low level overview​

Benchmarking​

Development​

Workbench​

Infrastructure​

Tracing​

High level summary​

Low level overview​

Benchmarking​

Development​

Workbench​

Tracing​

Consensus QTAs​

High level summary​

Low level overview​

Benchmarking​

Development​

Workbench​

Tracing​

IOI Tech Meetup​

High level summary

Low level overview

Benchmarking

Development - New Tracing System

Workbench

Infrastructure

Tracing

High level summary

Low level overview

Benchmarking

Development

Workbench

Infrastructure

Tracing

High level summary

Low level overview

Benchmarking

Development

Workbench

Infrastructure

Tracing

High level summary

Low level overview

Benchmarking

Development

Workbench

Tracing

Consensus QTAs

High level summary

Low level overview

Benchmarking

Development

Workbench

Tracing

IOI Tech Meetup