Skip to main content

51 posts tagged with "performance-tracing"

View All Tags

· 4 min read
Michael Karg

High level summary

  • Benchmarking: Release benchmarks for Node 9.1; UTxO-HD in-memory benchmarks; typed-protocols feature benchmarks.
  • Development: Correct resource trace emission for CPU 85% spans metric. Governance action benchmarking still under development.
  • Workbench: Preparations for bumping nixpkgs. Started removal of the container-based podman backend. Support GHC9.8 nix shells.
  • Infrastructure: Test and validate an upcoming change in node-to-node submission protocol.
  • Tracing: cardano-tracer: Support of non-systemd Linux was merged; safe restart of internal monitoring servers.

Low level overview

Benchmarking

We've run and analyzed a full set of release benchmarks for Node version 9.1. Comparing with the mainnet release 9.0, we could not observe any performance regression.

Additionally, we've performed feature benchmarks for an upcoming new API for typed-protocols. Those did not exhibit any regression either in comparison with the baseline using the current API.

Furthermore, we've performed various benchmarks for the UTxO-HD in-memory backend on Node versions 9.0 and 9.1. Based on those observations, a rare race condition could be eliminated, where block producers on occasion failed to fork off a thread for the forging loop. The overall network performance of the UTxO-HD in memory backend shows a slight improvement over the regular node, but currently comes with a slightly increased RAM usage.

Development

We've spotted an inconsistency in one of our benchmarking metrics - CPU 85% spans - which measures the average number of consecutive slots where CPU usage spikes to 85% or higher (however short the spike itself might be). There was a difference between legacy tracing system (which yielded the correct value) and the new one, for which a fix has already been devised.

The implementation of Conway governance action workloads for benchmarking is ongoing.

Workbench

With a nixpkgs bump on the horizon, we're working on adjusting, and testing, our usage of packages that change their status, lose their support, or packages that require pinning a version for the workbench.

Additionally, we'll remove a container-based backend for workbench, which ties in OCI image usage on podman with Nomad. It was a precursor to the current Nomad backend, which is containerless and can directly build Nomad jobs using nix.

Last not least, we've merged a small PR which enables our workbench to build nix shells with GHC9.8, as this not only pulls in the compiler, but much of the Haskell development toolchain. The correct version couplings between compiler and toolchain components is now declared explicitly from GHC8.10.7 up to GHC9.8.

Infrastructure

We've tested and validated an upcoming change in ouroboros-network which demands any node-to-node submission client to hold the connection for at least one minute before being able to submit transactions. The change works as expected and does not interfere with special functionality required by benchmarking.

Tracing

The trace consumer service for the new tracing system used to require systemd on Linux to build and operate. There are, however, Linux environments that choose to not use systemd. It is now possible to configure the desired flavour of that service, cardano-tracer, at build time, thus adding support for those Linuxes - cardano-node#5021.

cardano-tracer consumes not just traces, but also metrics. With the new tracing system, this shifts running a metrics server from the node to the consumer process. One possible setup in the new system is operating only one consumer service and connecting multiple nodes to it. In its current design, this requires to safely shutdown and restart the monitoring server, using the metrics store of any connected node that's been requested. We're currently battle-testing ekg's (the monitoring package that's being used) built-in behaviour and exploring solutions in case it does not fully meet requirements.

· 3 min read
Michael Karg

High level summary

  • Benchmarking: Release benchmarks for Node 9.0; Plutus execution budget scaling benchmarks.
  • Development: Improved shutdown behaviour for tx-generator merged to master. Work on governance action benchmarking workload is ongoing.
  • Workbench: Haskell profile content definition merged to master.
  • Tracing: Factoring out RTView was merged to master. Work on metrics naming ongoing, minimizing migration effort.
  • Consensus QTAs: Design for automating and data warehousing beacon measurements is complete.

Low level overview

Benchmarking

Runs and analyses for a full set of release benchmarks have been performed - and published - for Node version 9.0.0. Comparing with the latest mainnet release 8.12.1, we could not observe any performance regression. 9.0.0 exhibits an improvement in Block Fetch duration, which results in slightly better overall network performance.

Additionally, we've performed scaling benchmarks of Plutus execution budgets. In these series of benchmarks, we measure the performance impact of changes to those budgets in the protocol parameters. Steps (CPU) and memory budgets are scaled independently of each other, and performance testing takes place using Plutus scripts that either are heavy on allocations but light on CPU, or vice versa. These performance tests are meant to explore the headroom of those budgets, taking into account cost model changes, and recent optimization capabilites of the Plutus compiler.

Development

Our workload submission service tx-generator has been equipped with the ability to handle POSIX signals for graceful shutdown scenarios. Furthermore, as it is highly concurrent, error reporting on a per-thread basis has been added, enhancing feedback from the service. Along with some quality-of-life improvements, these changes have landed in master.

The Conway governance action workloads for benchmarking have completed design phase, and we've settled on an implementation plan. Implementation work itself has started.

Workbench

Generating the contents for any benchmarking profile has now become a dedicated Haskell tool, cardano-profile, which has landed in master. Adding type safety and a test suite to profile definitions is a major improvement over shell code that directly manipulates JSON objects. Furthermore, it makes reliably modifying, or creating, benchmarking profiles much more accessible to engineers outside of our team.

Tracing

With factoring out RTView, and making it an opt-in component of the cardano-tracer build, we've reduced the service's dependency graph significantly. Furthermore, the service has become more lightweight on resources. We'll continue to maintain RTView, and guarantee it will remain buildable and usable in the future.

Aligning metrics naming and semantics of new and legacy tracing is ongoing work. This task is part of a larger endeavour to minimize the effort necessary for users to migrate to the new system.

Consensus QTAs

beacon, which currently measures performance of certain ledger operations on actual workload fragments, is a first step in building a benchmarking framework based on Delta-Q system design and quantitative timeliness agreements. We've finished the design of how to automate those measurements at sensible points in time, and provide a storage schema which will enable access and analysis that fits with the framwork.

· 4 min read
Michael Karg

High level summary

  • Benchmarking: Release benchmarks for Node 8.12.0; DRep benchmarks with 100k DReps.
  • Development: Merged a performance fix on 8.11; kicked off development of governance action workload.
  • Workbench: Adjusted automations to latest 8.12 Conway features and Plutus cost model; implementation of CIP-69 and CIP-117 for our tooling is in validation phase.
  • Tracing: Work on metrics naming ongoing. Factoring out RTView component is completed and has entered testing.
  • IOI Tech Meetup: Our team contributed two presentations at the meetup in Zurich; worked on community report of UTxO scaling benchmarks.

Low level overview

Benchmarking

We've run and analyzed a full set of release benchmarks for Node versions 8.12.0. In comparison with the latest mainnet release 8.9.3, we could not observe any regressions. In fact, 8.12.0 was able to deliver equal network performance at a slightly reduced resource cost - both for CPU and memory.

Another benchmark of the Conway ledger with large amounts of DReps has been performed. This time, 100000 DReps were chosen - this amount aims to simulate a scenario where lots of self-delegation takes place. While a performance impact is observable in this instance, we can still see that the number of DReps scales well overall, and poses no concern for network peformance.

Development

We have contributed and merged a performance fix on 8.11 which adresses a regressing metric in the forging loop. The regression was only observable under specific conditions. Benchmarks on 8.12 have already confirmed the fix to be successful.

We've kicked off governance action workloads for benchmarking. This will be an entirely new workload type for Conway era, targeting performance measurements of its decentralized decision making process. The workload will feature registering proposals, acting as multiple DReps to vote on active proposals, vote tallying and proposal enactment. We're very grateful for the Ledger team's helpful support so far in creating a workload design for benchmarking - one that evenly stresses the network over extended periods of time.

Workbench

The workbench automations have been upgraded to handle Node 8.12 and the corresponding integrations of Cardano API and CLI.

Furthermore, we've updated to the latest PlutusV3 costmodel in our benchmarks - as well as implemented CIP-69 and CIP-117 for all our PlutusV3 benchmarking scripts, pending validation by the Plutus team.

Tracing

The work on aligning of metrics naming and semantics of new and legacy tracing is ongoing. Additionally, we're adding a handful of metrics to the new tracing system which currently exist in legacy tracing only.

Factoring out the RTView ("real-time view") component of cardano-tracer in the new tracing system has finished. This includes a considerable refactoring of cardano-tracer's codebase, so that we're currently running test on the new codebase. Isolating RTView is due to its being in prototype stage for too long, and the design decisions taken. In the short term, this will make several package dependencies optional, which have become troublesome for CI, as well as making cardano-tracer more lightweight. RTView remains as an opt-in.

IOI Tech Meetup

Our entire team traveled to Zurich, Switzerland to attend ZuriHac'24 and the IOI Tech Meetup. It was fantastic to meet everyone in person, and we all had an amazing and very productive time. A big Thank You to everyone involved in making that happen, and making it a success.

We contributed two presentations for the meetup: a thourough introduction of the new tracing system aimed at developers - as it's not tailored exclusively to cardano-node, but can be used in other (Haskell) services as well. And secondly, an overview over the benchmarking framework based on Quantitative Timeliness Agreements which we're building - as well as a show-and-tell of our prototype, implementing part of said framework. We're grateful for the great interest and feedback from all the participants.

Last not least, we worked on creating a community report of the UTxO scaling benchmarks performed during March and April - to be released soon.

· 4 min read
Michael Karg

High level summary

  • Benchmarking: Node versions 8.9.3 and 8.11.0; new PlutusV3 plus addtional DRep benchmarks; re-evaluation of network latency.
  • Development: BLST workload for PlutusV3 was implemented; improved error/shutdown behaviour for tx-generator is in testing phase.
  • Workbench: UTxO-HD tracer configs harmonized. New plutusv3 profiles supporting experimental budgets. Work on Haskell profile definition is in validation phase.
  • Tracing: New metrics and handle registry feature merged to master. Work on metrics naming ongoing. Factoring out RTView component has begun.

Low level overview

Benchmarking

Runs and analyses of full sets of release benchmarks have been performed for Node versions 8.9.3 and 8.11.0.

For comparison of how the Conway ledger performs when injecting large amounts of DReps and delegations versus one with zero DReps we've run additional configurations with existing workloads from release benchmarking. So far we've found that the number of DReps in ledger scales well and does not lead to notable performance penalties.

Additionally, we've successfully run the baseline for the upcoming PlutusV3 benchmarks on our Nomad cluster. Those will, given the new V3 cost model, serve to determine headroom, or constraint, regarding resource usage and network metrics when operating under various execution budgets.

Last not least, with much appreciated support and feedback from the network team, we performed a re-evaluation of the network the latency matrix for our benchmarking cluster. The cluster stretches over three regions globally. Due to unknown changes in the underlying hardware infrastructure, a slight delay between Europe and Asia/Pacific regions could be measured. We needed to adjust some existing baselines accordingly - otherwise, this delay could be falsely attributed to a software regression.

Development

We have implemented a benchmarking workload using PlutusV3's new BLST internals. As those do little memory allocation, but require more CPU steps, this workload will allow us to focus on that particular aspect of block and transaction budgets.

The tx-generator service will now label each submission thread with its submission target. Additionally, it has been equipped with custom signal handlers. This will improve both how gracefully shutdowns can be performed, and how precise error reporting is done when losing connection to a submission target. Last not least, the service now supports a configurable KeepAlive timeout for the NodeToNode mini-protocol - accounting for very long major GC pauses on submission targets under very specific benchmarking workloads. Those features have entered testing phase.

Workbench

Thanks to feedback from the consensus team, we've harmonized tracing configurations for our benchmarks between regular and UTxO-HD node. As the latter is more verbose by default, this is a confounding factor for our metrics: We're analysing north of 90 traces per second per cluster node, so all node flavours are required to be equally verbose.

The benchmarks based on the BLST workload now additionally support scaling budget components up or down at will. This means we can run a given cost model against custom execution budgets, controlling the point where the workload will exhaust it. This enables comparison of performance impact of potential changes to those budgets.

Porting our performance workbench's profile definitions to Haskell has been nearly completed, and an adequate test suite been implemented. This new component has now entered validation phase to make sure it correctly replicates all existing profile content.

Tracing

Two new metrics for cardano-node have landed in master - both for new and legacy tracing systems. They provide detailed build info, and indicate wether the node is a block producer or not.

We're now working on closing the gap in the metric naming schema between new and legacy tracing. The aim is to allow for a seamless interchange, without additional configuration required, so that all existing monitoring services can rely on identical metric names with identical semantics.

Furthermore, work has begun to factor out the RTView ("real-time view") component of cardano-tracer in the new tracing system. Unfortunately, the component has remained in prototype stage for over a year, and has revealed some design shortcomings. It's aim is to provide an interactive, real-time dashboard based on metrics from all nodes connected to cardano-tracer. The current design has all front-end side code baked into the backend service, requiring to rebuild the entire service in Haskell even for simple changes in the dashboard. We decided to isolate the component in the current code base, which still allows for optionally enabling it for a build. The long term goal however is to convert it into a downstream service: It will ingest metrics by reforwarding, or querying a REST API, and will provide a clear separation of frontend facing code. Thus we, and anybody, can use their favourite web technology for visualization of metrics.

· 5 min read
Michael Karg

High level summary

  • Benchmarking: We've performed and analysed benchmarks in the Conway era, with DReps injected.
  • Development: Tracing DRep data has been implemented; improved error reporting in tx-generator and analysis quick queries are ongoing work.
  • Workbench: We now fully supports the new CLI create-testnet-data command and DRep injection into Conway genesis. Haskell profile definition work is ongoing.
  • Tracing: Various additions to Node metrics are being worked on, such as build info and block producer role. Metrics naming will be further harmonized.
  • UTxO Growth: We've finalized analysis and reports of all benchmarks targeting UTxO scaling scenarios.
  • UTxO-HD / LMDB: We've performed multiple runs benchmarking the LMDB (on-disk) backend of UTxO-HD.

Low level overview

Benchmarking

We've run and analyzed a full set of benchmarks comparing the Conway ledger against the Babbage one, on Node 8.10.1-pre. For Conway, our additional goal was to measure a vanilla ledger state against one with a large amount of DReps - and delegations to those DReps - present. The benchmarks used our existing value and Plutus workloads to remain comparable to each other.

Development

Additional ledger queries for the tracing system have been implemented and merged to master. Those capture the amount of, and the number of existing delegations to, DReps as trace output - and thus enable creating a metric on top of it, which can then be monitored.

The (in our case) non-deterministic nature of shutting down different cluster setups - both local and cloud-based - carries the possibility that our transaction generation service occasionally misclassifies a regular shutdown as an error. Furthermore, in the case of network malfunctions, the service's errors are too unspecific. By implementing thread labels for submission threads, corresponding to each submission target, and by adding custom smart signal handlers, we'll improve the generator's error reporting significantly.

The initial tests for quick queries are being developed further. We're moving towards a principled, and generalized, syntax that supports both prepared, parametrizable queries from the application code, as well as ad-hoc queries stated e.g. on the command line.

Workbench

The performance workbench now fully supports the new cardano-cli command create-test-data. We use it to inject both stake delegated to stake pools into genesis, and - recently added - stake delegated to DReps as well. It has been proven very useful and versatile so far, and will eventually replace the current create-staked command.

Work on porting our performance workbench's profile definitions to Haskell, and providing them with an appropriate test suite, is still ongoing; currently, we're integrating all new profile families that came out of the UTxO growth scenarios.

Tracing

New metrics are being implemented for the tracing system. They will also be part of Prometheus output and as such accessible to monitoring services. There'll be cardano-node's detailed build info, as well as a node's block producer status, meaning the presence of forger credentials. Those new metrics are being backported to the legacy tracing system, too.

Furthermore, we've determined the need to revisit metrics naming. There's still a divergence between naming in the legacy and the new system. While this could be mitigated by passing in extra config options, we think that a transition to the new system should not impose any unnecessary effort for node operators. A design to fully harmonize the existing naming schemata is currently being set up.

UTxO Growth

The UTxO Growth benchmarking series has been finalized. We've finished analyses and reports for all scenarios that were tested and explored.

The overarching questions were, given a network of 32GB host systems, how large can the UTxO set grow in general, how large can it grow before the nodes have to operate close to the RAM limit over extended periods of time, and how does scaling the UTxO set size affect network metrics, such as block diffusion.

A dedicated "UTxO Scaling Squad" was set up, who was driving the entire process, and we enjoyed a very focused and productive collaboration with them.

UTxO-HD / LMDB

Last not least, we were able to benchmark UTxO-HD's on-disk backend on a network of block producing nodes, on a recent 8.9.1 version of cardano-node. The setup allowed of using a direct access SSD device for performance critical disk I/O, whereas the bulk of ChainDB and ledger snapshots remained on a standard AWS EBS volume.

The benchmarks comprised both optimistic and pessimistic RAM assumptions for the host OS to further optimize I/O via page cache, as well as medium and large UTxO set sizes - the latter almost tripling current mainnet's size. The results were promising; the LMDB backend has proven to be able to accomodate large UTxO sets using significantly less RAM than the default all-in-memory node - and with a more than reasonable trade-off performance-wise. Furthermore, running with pessimistic assumptions, the performance impact on LMDB was very moderate only.