Skip to main content

42 posts tagged with "performance-tracing"

View All Tags

· 4 min read
Michael Karg

High level summary

  • Benchmarking: Release benchmarks for Node 9.1.1; additional UTxO-HD in-memory benchmarks.
  • Development: Created a local reproduction for observed UTxO-HD RAM increase.
  • Workbench: Created a new "age of Voltaire" performance baseline. Adjusted Nomad backend has entered testing phase.
  • Infrastructure: Dropping the requirement on Vault, optimizing cluster setup.
  • Tracing: New metrics naming schema was merged. Routing to internal monitoring servers is ongoing. Dropping dependency on HsOpenSSL.

Low level overview

Benchmarking

Runs and analyses for a full set of release benchmarks have been performed for Node version 9.1.1. In comparison with Mainnet releases 9.0 and 9.1.0, we could determine this version does not exhibit any performance regression.

Having been provided with the patch by Consensus targeting the increased RAM usage of the UTxO-HD in-memory backend (read below), we've performed additional benchmarks to validate the desired result on the cluster. Our measurements demonstrate the increased memory need has now vanished. We're confident that by now we've located - and addressed - all performance risks for UTxO-HD in-memory that we can capture given the instruments at our disposal. To gain further confidence in the stability of resource usage pattern and network metrics observed on the benchmarking cluster, we've advised long-running UTxO-HD nodes under close monitoring.

Development

We succeeded in creating a local reproduction of the increase in RAM usage that was observed for the UTxO-HD in-memory backend on the cluster. That reproduction enabled the Consensus team to inspect in real-time and profile running Node processes - which led to a swift identification of the underlying cause and a patch addressing it.

Workbench

After the smooth Chang hard fork which transitioned Cardano into the Conway era, we've created - and merged - a new performance baseline. It's intended for release benchmarks and caters to the new features of the Conway ledger. Apart from incorporating the latest protocol version and Plutus cost models, it includes DRep presence in ledger when performing measurements.

The PR preparing our workbench for a nixpkgs upgrade and removing the container-based Nomad / podman backend is complete and has entered testing phase.

Infrastructure

Currently, our Nomad cluster uses Vault to manage access and credentials for the benchmarking cluster. As the cluster exclusively relies on static routes, and fixed deployment endpoints, encoding access as a set of rules into the cloud infrastructure is a viable option. That way, we will no longer depend on the Vault service, removing the requirement of hosting, and maintaining, an instance of it.

Tracing

Aligning the metrics naming schema and semantics between new and legacy tracing systems has been completed and merged. This will enable a seamless interchange in the community, as all existing configurations of monitoring services remain their validity.

As for hosting multiple EKG metrics monitors in one single service application, we ascertained that the ekg package was not built for that use case. However, we've come up with a much nicer design for cardano-tracer using dynamic routing based on the names of nodes connected to it. It has successfully passed prototype stage in that it's able to serve multiple EKG monitors without the need for any server restart; the full implementation is being worked on.

Last not least, both existing tracing systems rely on the snap server framework, and thus by transitive dependency, on HsOpenSSL to speak the HTTPS protocol. However, we've determined the latter package to have a risk of breaking the build, both currently and in the future (cf. HsOpenSSL#95 and HsOpenSSL#88). As a consequence, a switch to the wai / warp based framework was decided, which implements HTTPS capability differently, thus preempting the risk. This has already been carried out for the legacy system, and currently is for cardano-tracer - a big shoutout to Erik de Castro Lopo for his support on that issue.

· 4 min read
Michael Karg

High level summary

  • Benchmarking: Release benchmarks for Node 9.1; UTxO-HD in-memory benchmarks; typed-protocols feature benchmarks.
  • Development: Correct resource trace emission for CPU 85% spans metric. Governance action benchmarking still under development.
  • Workbench: Preparations for bumping nixpkgs. Started removal of the container-based podman backend. Support GHC9.8 nix shells.
  • Infrastructure: Test and validate an upcoming change in node-to-node submission protocol.
  • Tracing: cardano-tracer: Support of non-systemd Linux was merged; safe restart of internal monitoring servers.

Low level overview

Benchmarking

We've run and analyzed a full set of release benchmarks for Node version 9.1. Comparing with the mainnet release 9.0, we could not observe any performance regression.

Additionally, we've performed feature benchmarks for an upcoming new API for typed-protocols. Those did not exhibit any regression either in comparison with the baseline using the current API.

Furthermore, we've performed various benchmarks for the UTxO-HD in-memory backend on Node versions 9.0 and 9.1. Based on those observations, a rare race condition could be eliminated, where block producers on occasion failed to fork off a thread for the forging loop. The overall network performance of the UTxO-HD in memory backend shows a slight improvement over the regular node, but currently comes with a slightly increased RAM usage.

Development

We've spotted an inconsistency in one of our benchmarking metrics - CPU 85% spans - which measures the average number of consecutive slots where CPU usage spikes to 85% or higher (however short the spike itself might be). There was a difference between legacy tracing system (which yielded the correct value) and the new one, for which a fix has already been devised.

The implementation of Conway governance action workloads for benchmarking is ongoing.

Workbench

With a nixpkgs bump on the horizon, we're working on adjusting, and testing, our usage of packages that change their status, lose their support, or packages that require pinning a version for the workbench.

Additionally, we'll remove a container-based backend for workbench, which ties in OCI image usage on podman with Nomad. It was a precursor to the current Nomad backend, which is containerless and can directly build Nomad jobs using nix.

Last not least, we've merged a small PR which enables our workbench to build nix shells with GHC9.8, as this not only pulls in the compiler, but much of the Haskell development toolchain. The correct version couplings between compiler and toolchain components is now declared explicitly from GHC8.10.7 up to GHC9.8.

Infrastructure

We've tested and validated an upcoming change in ouroboros-network which demands any node-to-node submission client to hold the connection for at least one minute before being able to submit transactions. The change works as expected and does not interfere with special functionality required by benchmarking.

Tracing

The trace consumer service for the new tracing system used to require systemd on Linux to build and operate. There are, however, Linux environments that choose to not use systemd. It is now possible to configure the desired flavour of that service, cardano-tracer, at build time, thus adding support for those Linuxes - cardano-node#5021.

cardano-tracer consumes not just traces, but also metrics. With the new tracing system, this shifts running a metrics server from the node to the consumer process. One possible setup in the new system is operating only one consumer service and connecting multiple nodes to it. In its current design, this requires to safely shutdown and restart the monitoring server, using the metrics store of any connected node that's been requested. We're currently battle-testing ekg's (the monitoring package that's being used) built-in behaviour and exploring solutions in case it does not fully meet requirements.

· 3 min read
Michael Karg

High level summary

  • Benchmarking: Release benchmarks for Node 9.0; Plutus execution budget scaling benchmarks.
  • Development: Improved shutdown behaviour for tx-generator merged to master. Work on governance action benchmarking workload is ongoing.
  • Workbench: Haskell profile content definition merged to master.
  • Tracing: Factoring out RTView was merged to master. Work on metrics naming ongoing, minimizing migration effort.
  • Consensus QTAs: Design for automating and data warehousing beacon measurements is complete.

Low level overview

Benchmarking

Runs and analyses for a full set of release benchmarks have been performed - and published - for Node version 9.0.0. Comparing with the latest mainnet release 8.12.1, we could not observe any performance regression. 9.0.0 exhibits an improvement in Block Fetch duration, which results in slightly better overall network performance.

Additionally, we've performed scaling benchmarks of Plutus execution budgets. In these series of benchmarks, we measure the performance impact of changes to those budgets in the protocol parameters. Steps (CPU) and memory budgets are scaled independently of each other, and performance testing takes place using Plutus scripts that either are heavy on allocations but light on CPU, or vice versa. These performance tests are meant to explore the headroom of those budgets, taking into account cost model changes, and recent optimization capabilites of the Plutus compiler.

Development

Our workload submission service tx-generator has been equipped with the ability to handle POSIX signals for graceful shutdown scenarios. Furthermore, as it is highly concurrent, error reporting on a per-thread basis has been added, enhancing feedback from the service. Along with some quality-of-life improvements, these changes have landed in master.

The Conway governance action workloads for benchmarking have completed design phase, and we've settled on an implementation plan. Implementation work itself has started.

Workbench

Generating the contents for any benchmarking profile has now become a dedicated Haskell tool, cardano-profile, which has landed in master. Adding type safety and a test suite to profile definitions is a major improvement over shell code that directly manipulates JSON objects. Furthermore, it makes reliably modifying, or creating, benchmarking profiles much more accessible to engineers outside of our team.

Tracing

With factoring out RTView, and making it an opt-in component of the cardano-tracer build, we've reduced the service's dependency graph significantly. Furthermore, the service has become more lightweight on resources. We'll continue to maintain RTView, and guarantee it will remain buildable and usable in the future.

Aligning metrics naming and semantics of new and legacy tracing is ongoing work. This task is part of a larger endeavour to minimize the effort necessary for users to migrate to the new system.

Consensus QTAs

beacon, which currently measures performance of certain ledger operations on actual workload fragments, is a first step in building a benchmarking framework based on Delta-Q system design and quantitative timeliness agreements. We've finished the design of how to automate those measurements at sensible points in time, and provide a storage schema which will enable access and analysis that fits with the framwork.

· 4 min read
Michael Karg

High level summary

  • Benchmarking: Release benchmarks for Node 8.12.0; DRep benchmarks with 100k DReps.
  • Development: Merged a performance fix on 8.11; kicked off development of governance action workload.
  • Workbench: Adjusted automations to latest 8.12 Conway features and Plutus cost model; implementation of CIP-69 and CIP-117 for our tooling is in validation phase.
  • Tracing: Work on metrics naming ongoing. Factoring out RTView component is completed and has entered testing.
  • IOI Tech Meetup: Our team contributed two presentations at the meetup in Zurich; worked on community report of UTxO scaling benchmarks.

Low level overview

Benchmarking

We've run and analyzed a full set of release benchmarks for Node versions 8.12.0. In comparison with the latest mainnet release 8.9.3, we could not observe any regressions. In fact, 8.12.0 was able to deliver equal network performance at a slightly reduced resource cost - both for CPU and memory.

Another benchmark of the Conway ledger with large amounts of DReps has been performed. This time, 100000 DReps were chosen - this amount aims to simulate a scenario where lots of self-delegation takes place. While a performance impact is observable in this instance, we can still see that the number of DReps scales well overall, and poses no concern for network peformance.

Development

We have contributed and merged a performance fix on 8.11 which adresses a regressing metric in the forging loop. The regression was only observable under specific conditions. Benchmarks on 8.12 have already confirmed the fix to be successful.

We've kicked off governance action workloads for benchmarking. This will be an entirely new workload type for Conway era, targeting performance measurements of its decentralized decision making process. The workload will feature registering proposals, acting as multiple DReps to vote on active proposals, vote tallying and proposal enactment. We're very grateful for the Ledger team's helpful support so far in creating a workload design for benchmarking - one that evenly stresses the network over extended periods of time.

Workbench

The workbench automations have been upgraded to handle Node 8.12 and the corresponding integrations of Cardano API and CLI.

Furthermore, we've updated to the latest PlutusV3 costmodel in our benchmarks - as well as implemented CIP-69 and CIP-117 for all our PlutusV3 benchmarking scripts, pending validation by the Plutus team.

Tracing

The work on aligning of metrics naming and semantics of new and legacy tracing is ongoing. Additionally, we're adding a handful of metrics to the new tracing system which currently exist in legacy tracing only.

Factoring out the RTView ("real-time view") component of cardano-tracer in the new tracing system has finished. This includes a considerable refactoring of cardano-tracer's codebase, so that we're currently running test on the new codebase. Isolating RTView is due to its being in prototype stage for too long, and the design decisions taken. In the short term, this will make several package dependencies optional, which have become troublesome for CI, as well as making cardano-tracer more lightweight. RTView remains as an opt-in.

IOI Tech Meetup

Our entire team traveled to Zurich, Switzerland to attend ZuriHac'24 and the IOI Tech Meetup. It was fantastic to meet everyone in person, and we all had an amazing and very productive time. A big Thank You to everyone involved in making that happen, and making it a success.

We contributed two presentations for the meetup: a thourough introduction of the new tracing system aimed at developers - as it's not tailored exclusively to cardano-node, but can be used in other (Haskell) services as well. And secondly, an overview over the benchmarking framework based on Quantitative Timeliness Agreements which we're building - as well as a show-and-tell of our prototype, implementing part of said framework. We're grateful for the great interest and feedback from all the participants.

Last not least, we worked on creating a community report of the UTxO scaling benchmarks performed during March and April - to be released soon.

· 4 min read
Michael Karg

High level summary

  • Benchmarking: Node versions 8.9.3 and 8.11.0; new PlutusV3 plus addtional DRep benchmarks; re-evaluation of network latency.
  • Development: BLST workload for PlutusV3 was implemented; improved error/shutdown behaviour for tx-generator is in testing phase.
  • Workbench: UTxO-HD tracer configs harmonized. New plutusv3 profiles supporting experimental budgets. Work on Haskell profile definition is in validation phase.
  • Tracing: New metrics and handle registry feature merged to master. Work on metrics naming ongoing. Factoring out RTView component has begun.

Low level overview

Benchmarking

Runs and analyses of full sets of release benchmarks have been performed for Node versions 8.9.3 and 8.11.0.

For comparison of how the Conway ledger performs when injecting large amounts of DReps and delegations versus one with zero DReps we've run additional configurations with existing workloads from release benchmarking. So far we've found that the number of DReps in ledger scales well and does not lead to notable performance penalties.

Additionally, we've successfully run the baseline for the upcoming PlutusV3 benchmarks on our Nomad cluster. Those will, given the new V3 cost model, serve to determine headroom, or constraint, regarding resource usage and network metrics when operating under various execution budgets.

Last not least, with much appreciated support and feedback from the network team, we performed a re-evaluation of the network the latency matrix for our benchmarking cluster. The cluster stretches over three regions globally. Due to unknown changes in the underlying hardware infrastructure, a slight delay between Europe and Asia/Pacific regions could be measured. We needed to adjust some existing baselines accordingly - otherwise, this delay could be falsely attributed to a software regression.

Development

We have implemented a benchmarking workload using PlutusV3's new BLST internals. As those do little memory allocation, but require more CPU steps, this workload will allow us to focus on that particular aspect of block and transaction budgets.

The tx-generator service will now label each submission thread with its submission target. Additionally, it has been equipped with custom signal handlers. This will improve both how gracefully shutdowns can be performed, and how precise error reporting is done when losing connection to a submission target. Last not least, the service now supports a configurable KeepAlive timeout for the NodeToNode mini-protocol - accounting for very long major GC pauses on submission targets under very specific benchmarking workloads. Those features have entered testing phase.

Workbench

Thanks to feedback from the consensus team, we've harmonized tracing configurations for our benchmarks between regular and UTxO-HD node. As the latter is more verbose by default, this is a confounding factor for our metrics: We're analysing north of 90 traces per second per cluster node, so all node flavours are required to be equally verbose.

The benchmarks based on the BLST workload now additionally support scaling budget components up or down at will. This means we can run a given cost model against custom execution budgets, controlling the point where the workload will exhaust it. This enables comparison of performance impact of potential changes to those budgets.

Porting our performance workbench's profile definitions to Haskell has been nearly completed, and an adequate test suite been implemented. This new component has now entered validation phase to make sure it correctly replicates all existing profile content.

Tracing

Two new metrics for cardano-node have landed in master - both for new and legacy tracing systems. They provide detailed build info, and indicate wether the node is a block producer or not.

We're now working on closing the gap in the metric naming schema between new and legacy tracing. The aim is to allow for a seamless interchange, without additional configuration required, so that all existing monitoring services can rely on identical metric names with identical semantics.

Furthermore, work has begun to factor out the RTView ("real-time view") component of cardano-tracer in the new tracing system. Unfortunately, the component has remained in prototype stage for over a year, and has revealed some design shortcomings. It's aim is to provide an interactive, real-time dashboard based on metrics from all nodes connected to cardano-tracer. The current design has all front-end side code baked into the backend service, requiring to rebuild the entire service in Haskell even for simple changes in the dashboard. We decided to isolate the component in the current code base, which still allows for optionally enabling it for a build. The long term goal however is to convert it into a downstream service: It will ingest metrics by reforwarding, or querying a REST API, and will provide a clear separation of frontend facing code. Thus we, and anybody, can use their favourite web technology for visualization of metrics.