Skip to main content

39 posts tagged with "performance-tracing"

View All Tags

· 4 min read
Michael Karg

High level summary

  • Benchmarking: Release benchmarks for Node 8.12.0; DRep benchmarks with 100k DReps.
  • Development: Merged a performance fix on 8.11; kicked off development of governance action workload.
  • Workbench: Adjusted automations to latest 8.12 Conway features and Plutus cost model; implementation of CIP-69 and CIP-117 for our tooling is in validation phase.
  • Tracing: Work on metrics naming ongoing. Factoring out RTView component is completed and has entered testing.
  • IOI Tech Meetup: Our team contributed two presentations at the meetup in Zurich; worked on community report of UTxO scaling benchmarks.

Low level overview

Benchmarking

We've run and analyzed a full set of release benchmarks for Node versions 8.12.0. In comparison with the latest mainnet release 8.9.3, we could not observe any regressions. In fact, 8.12.0 was able to deliver equal network performance at a slightly reduced resource cost - both for CPU and memory.

Another benchmark of the Conway ledger with large amounts of DReps has been performed. This time, 100000 DReps were chosen - this amount aims to simulate a scenario where lots of self-delegation takes place. While a performance impact is observable in this instance, we can still see that the number of DReps scales well overall, and poses no concern for network peformance.

Development

We have contributed and merged a performance fix on 8.11 which adresses a regressing metric in the forging loop. The regression was only observable under specific conditions. Benchmarks on 8.12 have already confirmed the fix to be successful.

We've kicked off governance action workloads for benchmarking. This will be an entirely new workload type for Conway era, targeting performance measurements of its decentralized decision making process. The workload will feature registering proposals, acting as multiple DReps to vote on active proposals, vote tallying and proposal enactment. We're very grateful for the Ledger team's helpful support so far in creating a workload design for benchmarking - one that evenly stresses the network over extended periods of time.

Workbench

The workbench automations have been upgraded to handle Node 8.12 and the corresponding integrations of Cardano API and CLI.

Furthermore, we've updated to the latest PlutusV3 costmodel in our benchmarks - as well as implemented CIP-69 and CIP-117 for all our PlutusV3 benchmarking scripts, pending validation by the Plutus team.

Tracing

The work on aligning of metrics naming and semantics of new and legacy tracing is ongoing. Additionally, we're adding a handful of metrics to the new tracing system which currently exist in legacy tracing only.

Factoring out the RTView ("real-time view") component of cardano-tracer in the new tracing system has finished. This includes a considerable refactoring of cardano-tracer's codebase, so that we're currently running test on the new codebase. Isolating RTView is due to its being in prototype stage for too long, and the design decisions taken. In the short term, this will make several package dependencies optional, which have become troublesome for CI, as well as making cardano-tracer more lightweight. RTView remains as an opt-in.

IOI Tech Meetup

Our entire team traveled to Zurich, Switzerland to attend ZuriHac'24 and the IOI Tech Meetup. It was fantastic to meet everyone in person, and we all had an amazing and very productive time. A big Thank You to everyone involved in making that happen, and making it a success.

We contributed two presentations for the meetup: a thourough introduction of the new tracing system aimed at developers - as it's not tailored exclusively to cardano-node, but can be used in other (Haskell) services as well. And secondly, an overview over the benchmarking framework based on Quantitative Timeliness Agreements which we're building - as well as a show-and-tell of our prototype, implementing part of said framework. We're grateful for the great interest and feedback from all the participants.

Last not least, we worked on creating a community report of the UTxO scaling benchmarks performed during March and April - to be released soon.

· 4 min read
Michael Karg

High level summary

  • Benchmarking: Node versions 8.9.3 and 8.11.0; new PlutusV3 plus addtional DRep benchmarks; re-evaluation of network latency.
  • Development: BLST workload for PlutusV3 was implemented; improved error/shutdown behaviour for tx-generator is in testing phase.
  • Workbench: UTxO-HD tracer configs harmonized. New plutusv3 profiles supporting experimental budgets. Work on Haskell profile definition is in validation phase.
  • Tracing: New metrics and handle registry feature merged to master. Work on metrics naming ongoing. Factoring out RTView component has begun.

Low level overview

Benchmarking

Runs and analyses of full sets of release benchmarks have been performed for Node versions 8.9.3 and 8.11.0.

For comparison of how the Conway ledger performs when injecting large amounts of DReps and delegations versus one with zero DReps we've run additional configurations with existing workloads from release benchmarking. So far we've found that the number of DReps in ledger scales well and does not lead to notable performance penalties.

Additionally, we've successfully run the baseline for the upcoming PlutusV3 benchmarks on our Nomad cluster. Those will, given the new V3 cost model, serve to determine headroom, or constraint, regarding resource usage and network metrics when operating under various execution budgets.

Last not least, with much appreciated support and feedback from the network team, we performed a re-evaluation of the network the latency matrix for our benchmarking cluster. The cluster stretches over three regions globally. Due to unknown changes in the underlying hardware infrastructure, a slight delay between Europe and Asia/Pacific regions could be measured. We needed to adjust some existing baselines accordingly - otherwise, this delay could be falsely attributed to a software regression.

Development

We have implemented a benchmarking workload using PlutusV3's new BLST internals. As those do little memory allocation, but require more CPU steps, this workload will allow us to focus on that particular aspect of block and transaction budgets.

The tx-generator service will now label each submission thread with its submission target. Additionally, it has been equipped with custom signal handlers. This will improve both how gracefully shutdowns can be performed, and how precise error reporting is done when losing connection to a submission target. Last not least, the service now supports a configurable KeepAlive timeout for the NodeToNode mini-protocol - accounting for very long major GC pauses on submission targets under very specific benchmarking workloads. Those features have entered testing phase.

Workbench

Thanks to feedback from the consensus team, we've harmonized tracing configurations for our benchmarks between regular and UTxO-HD node. As the latter is more verbose by default, this is a confounding factor for our metrics: We're analysing north of 90 traces per second per cluster node, so all node flavours are required to be equally verbose.

The benchmarks based on the BLST workload now additionally support scaling budget components up or down at will. This means we can run a given cost model against custom execution budgets, controlling the point where the workload will exhaust it. This enables comparison of performance impact of potential changes to those budgets.

Porting our performance workbench's profile definitions to Haskell has been nearly completed, and an adequate test suite been implemented. This new component has now entered validation phase to make sure it correctly replicates all existing profile content.

Tracing

Two new metrics for cardano-node have landed in master - both for new and legacy tracing systems. They provide detailed build info, and indicate wether the node is a block producer or not.

We're now working on closing the gap in the metric naming schema between new and legacy tracing. The aim is to allow for a seamless interchange, without additional configuration required, so that all existing monitoring services can rely on identical metric names with identical semantics.

Furthermore, work has begun to factor out the RTView ("real-time view") component of cardano-tracer in the new tracing system. Unfortunately, the component has remained in prototype stage for over a year, and has revealed some design shortcomings. It's aim is to provide an interactive, real-time dashboard based on metrics from all nodes connected to cardano-tracer. The current design has all front-end side code baked into the backend service, requiring to rebuild the entire service in Haskell even for simple changes in the dashboard. We decided to isolate the component in the current code base, which still allows for optionally enabling it for a build. The long term goal however is to convert it into a downstream service: It will ingest metrics by reforwarding, or querying a REST API, and will provide a clear separation of frontend facing code. Thus we, and anybody, can use their favourite web technology for visualization of metrics.

· 5 min read
Michael Karg

High level summary

  • Benchmarking: We've performed and analysed benchmarks in the Conway era, with DReps injected.
  • Development: Tracing DRep data has been implemented; improved error reporting in tx-generator and analysis quick queries are ongoing work.
  • Workbench: We now fully supports the new CLI create-testnet-data command and DRep injection into Conway genesis. Haskell profile definition work is ongoing.
  • Tracing: Various additions to Node metrics are being worked on, such as build info and block producer role. Metrics naming will be further harmonized.
  • UTxO Growth: We've finalized analysis and reports of all benchmarks targeting UTxO scaling scenarios.
  • UTxO-HD / LMDB: We've performed multiple runs benchmarking the LMDB (on-disk) backend of UTxO-HD.

Low level overview

Benchmarking

We've run and analyzed a full set of benchmarks comparing the Conway ledger against the Babbage one, on Node 8.10.1-pre. For Conway, our additional goal was to measure a vanilla ledger state against one with a large amount of DReps - and delegations to those DReps - present. The benchmarks used our existing value and Plutus workloads to remain comparable to each other.

Development

Additional ledger queries for the tracing system have been implemented and merged to master. Those capture the amount of, and the number of existing delegations to, DReps as trace output - and thus enable creating a metric on top of it, which can then be monitored.

The (in our case) non-deterministic nature of shutting down different cluster setups - both local and cloud-based - carries the possibility that our transaction generation service occasionally misclassifies a regular shutdown as an error. Furthermore, in the case of network malfunctions, the service's errors are too unspecific. By implementing thread labels for submission threads, corresponding to each submission target, and by adding custom smart signal handlers, we'll improve the generator's error reporting significantly.

The initial tests for quick queries are being developed further. We're moving towards a principled, and generalized, syntax that supports both prepared, parametrizable queries from the application code, as well as ad-hoc queries stated e.g. on the command line.

Workbench

The performance workbench now fully supports the new cardano-cli command create-test-data. We use it to inject both stake delegated to stake pools into genesis, and - recently added - stake delegated to DReps as well. It has been proven very useful and versatile so far, and will eventually replace the current create-staked command.

Work on porting our performance workbench's profile definitions to Haskell, and providing them with an appropriate test suite, is still ongoing; currently, we're integrating all new profile families that came out of the UTxO growth scenarios.

Tracing

New metrics are being implemented for the tracing system. They will also be part of Prometheus output and as such accessible to monitoring services. There'll be cardano-node's detailed build info, as well as a node's block producer status, meaning the presence of forger credentials. Those new metrics are being backported to the legacy tracing system, too.

Furthermore, we've determined the need to revisit metrics naming. There's still a divergence between naming in the legacy and the new system. While this could be mitigated by passing in extra config options, we think that a transition to the new system should not impose any unnecessary effort for node operators. A design to fully harmonize the existing naming schemata is currently being set up.

UTxO Growth

The UTxO Growth benchmarking series has been finalized. We've finished analyses and reports for all scenarios that were tested and explored.

The overarching questions were, given a network of 32GB host systems, how large can the UTxO set grow in general, how large can it grow before the nodes have to operate close to the RAM limit over extended periods of time, and how does scaling the UTxO set size affect network metrics, such as block diffusion.

A dedicated "UTxO Scaling Squad" was set up, who was driving the entire process, and we enjoyed a very focused and productive collaboration with them.

UTxO-HD / LMDB

Last not least, we were able to benchmark UTxO-HD's on-disk backend on a network of block producing nodes, on a recent 8.9.1 version of cardano-node. The setup allowed of using a direct access SSD device for performance critical disk I/O, whereas the bulk of ChainDB and ledger snapshots remained on a standard AWS EBS volume.

The benchmarks comprised both optimistic and pessimistic RAM assumptions for the host OS to further optimize I/O via page cache, as well as medium and large UTxO set sizes - the latter almost tripling current mainnet's size. The results were promising; the LMDB backend has proven to be able to accomodate large UTxO sets using significantly less RAM than the default all-in-memory node - and with a more than reasonable trade-off performance-wise. Furthermore, running with pessimistic assumptions, the performance impact on LMDB was very moderate only.

· 3 min read
Michael Karg

High level summary

  • Benchmarking: We've performed benchmarks and analyses for Node versions 8.9.2 and 8.10.0.
  • Development: Design phase for implementing quick queries in the analysis pipeline has begun.
  • Workbench: We're finishing up the new features for the reporting pipeline; Haskell profile definition work is ongoing.
  • Tracing: Improving the Prometheus output is ongoing; the node's build info will be accessible as a Prometheus label.
  • UTxO Growth: Our tooling has been augmented to support benchmarks starting with a non-empty chain.

Low level overview

Benchmarking

We've performed a full set of release benchmarks for Node 8.9.2. Comparing with release 8.9.1, we could not detect any performance risks for that version.

The benchmarks for 8.10.0 have shown a slight improvement in the time the block forging loop needs to evaluate, whilst additionally, resource usage of the cardano-node process was also slightly reduced - a nice performance improvement.

Development

Our analysis pipeline is based on batch analysis of data from over 50 cluster nodes; it consumes very large amounts of trace output ex post facto, when the actual benchmark has terminated. This is very time-intensive, and not viable for obeserving an additional metric that you later on determine might need consideration.

We're planning to add quick queries into a benchmarking run's trace data to our analysis pipeline. These will be structured such that parameterizable, ad-hoc querying is supported. Initial tests showed that evaluation speed of such queries is fast enough to merit designing a principled, and generalized, syntax for them - and a subsequent implementation.

Workbench

The reporting pipeline has been augmented with direct support for customizable, and stylable, TeX rendering - currently receiving final touches.

Porting our performance workbench's profile definitions to Haskell, and providing them with an appropriate test suite, is ongoing work. It is our goal to both increase reliable profile definition and validation, and facilitate usage by engineers less familiar with the workbench.

Tracing

The work to improve system metrics as presented to Prometheus is still ongoing. Type annotations, as well as introducing Prometheus labels for certain metrics to convey (like e.g. build information), will make that interface more versatile. It also facilitates configuration of monitoring or dashboards like Grafana on top of those Prometheus metrics.

UTxO Growth

For the UTxO scaling benchmarks, we've augmented the workbench with the capability to support injection of a custom synthesized chain into the deployment, and start a benchmark only after replaying that chain - whereas our benchmarks usually start just with a genesis block.

To achieve that, different components of our tooling needed addition of features: distributing that chain to the node cluster, having analysis work without necessarily providing trace evidence of each block in the chain being forged by a benchmarking node. Cluster timing had to be adjusted to account for the gap between genesis start time and the chain tip. However, this entire mechanism opens up the possibility of having a very distinct ledger state at hand for a benchmark - one, that's been particularly crafted via a series of pre-defined transactions constituting the blocks during creation of the synthesized chain.

In the future, we plan to flesh out a more general design of that mechanism, which currently is tied to a very specific use case only.

· 3 min read
Michael Karg

High level summary

  • Benchmarking: Release benchmarks for 8.9.1 have been performed and analysed.
  • Development: We've implemented a benchmarking setup for UTxO-HD's LMDB (on-disk) backend.
  • Workbench: The now modular, nix-based genesis creation has been merged to master; DRep delegation and integration of a new cardano-cli command are ongoing.
  • Tracing: Benchmarking the new handle registry feature in cardano-tracer is complete; quality-of-life improvements to Prometheus output.
  • UTxO Growth: We've adjusted our framework to support running UTxO scaling benchmarks on both a single node and a cluster.
  • Nomad cluster: new multi-cluster support with the capability to quickly adjust to changes in deployed hardware

Low level overview

Benchmarking

We've performed a full set of release benchmarks for Node 8.9.1. Comparing with release 8.9.0, we could not detect any performance risks for that version.

Development

In context of UTxO scaling, we want to assess the feasability of the current on-disk solution (which is LMDB) of a UTxO-HD enabled node. Using that, the UTxO set will be kept in live tables and snapshots on disk, significantly reducing memory requirements.

We've implemented a benchmark setting, and a node service configuration, supporting direct disk access to a dedicated device which can be initialized with optimized file system and mount settings. It's purpose is to serve as storage for the highly performance-critical disk I/O of the LMDB component.

Workbench

Our automation for creating all flavours of geneses has seen cleanup and refactoring - which has been merged to master. It can now use a more principled, and rigorously checked, modular approach to define, create and cache the desired genesis files.

Working on integrating new cardano-cli functionality in our automation is ongoing. The performance workbench will support a different, and updated, CLI command which will allow injection of DRep delegations into genesis.

Tracing

Benchmarking cardano-tracer's new handle registry feature has been performed and evaluated. We're satisfied with seeing clear performance improvements along with cleaner code, and much better test coverage. Especially allocation rate and number of garbage collections (GC) could be significantly reduced, along with the CPU time required for performing GCs. This will allow for higher trace message throughput given identiacal system resources - plus less system calls issued to the OS in the process.

Furthermore, the new tracing system is getting improvements for its Prometheus output - like providing version numbers as metrics, or annotating metrics with their type - enhancing the output's overall utility.

UTxO Growth

The performance workbench now supports profiles aimed at simulating UTxO growth both for a single node and an entire cluster. Additionally, simulating different RAM sizes in combination with specific UTxO set sizes is supported. For a single block producing node, the focus is on quick turnaround when running a benchmark, gaining insight into the node's RAM usage and possible impact on the forging loop.

The cluster profiles enable capturing block diffusion metrics as well, however they require a much longer runtime. We can now successfully benchmark the node's behaviour when dealing with UTxO set sizes 4x - 5x of current mainnet, as well as a possible change in behaviour when operating close to phsyical RAM limit due to that.

Nomad cluster

Our backend now supports allocating and deploying Nomad jobs for multiple clusters simultaneously - all while keeping existing automations operational. We've taken special precautions a cluster, as seen by the backend, can be efficiently and easily modified to reflect newly deployed, or changed, hardware. Additionally, we've added support for host volumes inside a Nomad allocation - which will be needed for benchmarking UTxO-HD's on-disk solution.