Skip to main content

60 posts tagged with "performance-tracing"

View All Tags

Performance & Tracing Update

· 5 min read
Michael Karg
Performance and Tracing Team Lead

High level summary

  • Benchmarking: Release benchmarks for 10.7.1; LSM-trees benchmarks; Plutus interpreter benchmarks.
  • Development: tx-centrifuge - high-pressure tx submission service moved to testing.
  • Infrastructure: Optimizing benchmark genesis caching and post-processing to be highly modular.
  • Tracing: cardano-timeseries-io now integrated with cardano-tracer, offering HTTP query API.
  • Leios: Tx validation benchmarks with beacon; Arithmetic extension of cardano-recon-framework.
  • Node Diversity: Formal trace schema definition ready.

Low level overview

Benchmarking

The release benchmarks for 10.7.1 have been an iterative process. Various changes on 10.7 caused several performance regressions, which needed to be isolated from each other, located individually and addressed. This led to running and analyzing several full benchmarks, to confirm each change individually as well. Finally, the 10.7.1 release turned out to be a small, but consistent improvement as far as block production, diffusion and adoption metrics are concerned, and a huge improvement in CPU time and usage patterns.

With a healthy baseline in place, we were able to run benchmarks on the LSM-trees on-disk backing store. Initial performance results show the on-disk backend to be on par with the in-memory one given sufficient RAM. This is the optimal outcome, as using LSM-trees instead of in-memory without changing the underlying hardware does not incur a performance penalty. We're currently running benchmarks where the underlying hardware is indeed changed, and where multiple constraints on the Haskell heap and the OS's headroom for page caching force disk I/O under low RAM conditions. We'll then assess the effect on system and network metrics.

Last but not least, we confirmed a change to the Plutus interpreter - which impacts its performance characteristics - to be healthy and ready for integration.

Development

The new tx-centrifuge project has reached the MVP stage. It's built for massive tx submission pressure and seamless scaling, so it will be able to saturate a network running Leios over extended periods of time. To confirm all its intended properties, and iron out bumps in the pipeline, we're currently running tests on the benchmarking cluster - albeit on Praos nodes, as for those, we exactly know the expected outcome. For details on tx-centrifuge architecture and design, please see cardano-node PR#6494.

Infrastructure

Our benchmarks require very large genesis files, which are costly (hours, in the worst case) to create for each and every run. This is why our automation uses caching. We're currently reworking the caching mechanism so a genesis can be highly modular. This will lead to much more flexibility in applying a specific benchmarking profile to an existing cache entry, and widen the range of parameters a profile can modify without leading to a genesis cache miss.

Tracing

Our new Haskell library cardano-timeseries-io, which builds and stores timeseries of metrics from multiple sources, has been integrated with the cardano-tracer service. This now enables arbitrary queries over those timeseries to be submitted via an HTTP API. While this does not replace existing Prometheus endpoints, cardano-tracer can now answer PromQL-like queries directly without the need to run a separate scraper (cardano-node PR#6473).

This is currently an experimental feature; the API is not yet stable. We're working on aligning the the request and response schema closely with the Prometheus HTTP API, such that a Grafana integration, or any potential community-bulit frontend, can reuse much of the existing glue code out there.

Leios

We've created, and performed, transaction validation benchmarks for the current Cardano Ledger implementation. The benchmarks use beacon as a framework, which means looking at ledger operations only. Of those, anything besides block application using different validation strategies has been factored out. As input data, several synthetic workloads are used varying in tx content (script execution, or just moving ADA), block / tx batch size, or number of tx inputs. In the context of Leios, this allows for confirming protocol assumptions, or determining (and validating) potentially necessary optimizations in the ledger. For a full report and discussion, please see Leios issue#553. Next steps will be scaling different workload properties systematically, as well as forcing tx inputs to be read back from disk.

The cardano-recon-framework, a Linear Temporal Logic based verifier for observed system behaviour, now has better support for existential quantification in its propositions, as well as added support for Presburger arithmetic. This arithmetic extension allows for a wider range of properties to be evaluated, which are of particular interest to Leios. Those features, along with some quality-of-life improvements, are already released on CHaP; relevant PRs are cardano-node PR#6531 and cardano-node PR#6546. Current work is narrowing down the context in the framework's output; in case of a property not being satisfied, this will be highly specific as to which piece of evidence is the root cause for it.

Node Diversity

The comprehensive formal schema definition of the Node's existing trace messages is being merged (cardano-node PR#6527). Definitions are extracted directly from the actual implementation into a fully validated JSON schema. The extracted data is automatically verified, and can be compared to past data, to capture any changes. The schema can be amended manually with comments or refinement types, and these user-provided annotations will be merged with the extracted data - with a notification if any conflict is discovered. Future work will see hardened verification, as well as rendering a human-readable document, detailing the specification exhaustively.

Performance & Tracing Update

· 5 min read
Michael Karg
Performance and Tracing Team Lead

High level summary

  • Benchmarking: Compiler benchmarks on 10.6.2; Trace evaluation feature benchmarks.
  • Development: Started new project tx-centrifuge: A tx submission service generating extremely high, continuous workload.
  • Infrastructure: Small maintenance items, such as fixing profiled nix builds for local benchmarking.
  • Tracing: New tracing system now its own project: Hermod Tracing; New library cardano-timeseries-io, which accumulates metrics into queryable timeseries, released.
  • Leios: cardano-recon-framework (formerly LTL Trace Verifier) integrated and in use.
  • Node Diversity: Formal trace schema definition nearing merge; Trace forwarding in native Rust on hiatus.

Low level overview

Benchmarking

We've repeated the GHC9.12 compiler benchmarks on Node 10.6.2, which we now know to be completely free of regressions or any space leak. This confirmed our earlier findings that the code generated by GHC9.12 is on par performance-wise as far as block production, diffusion and adoption metrics go, but it exhibits unexplained increases in CPU time used, Allocations & Minor GCs. Several potential suspects for causing this have been identified with a profiled build. However, many of those will be replaced or changed in the 10.7 release, so that this benchmark will have to be re-run on Node 10.7.

The feature for new tracing, which forces a lazy trace value in a controlled section of code, is slated for inclusion in Node 10.7. To that end, we backported it to Node 10.6.2 and performed feature benchmarks for it - to ensure it won't distort the upcoming 10.7 performance baseline. Indeed we found the performance impact of that feature to be negligible in all categories of observed metrics.

Development

We've started a new project - tx-centrifuge - for transaction submission (i.e. workload generation) during benchmarks and other scenarios. It is meant to be complementary to the existing tx-generator. The latter is tailored very much to our Praos benchmarking use case and the implementation is based on a rather monolithic design. tx-centrifuge's approach however is a different one. It's built for seamless scaling, both horizontally and vertically. This means it will be able to saturate a network running Leios over extended periods of time, due to its massive tx output. Furthermore, it's able to cut down the setup phase (where UTxOs are created for benchmarking) and immediately launch into the benchmark phase. This also enables it to function as a potentially long-running, configurable submission service for scenarios other than benchmarking. The implementation is currently in prototype stage.

Infrastructure

As far as infrastructure is concerned, we've addressed various small-sized maintenance tasks. This includes fixing profiled nix builds for local benchmarks, migrating benchmarking profiles and configs to the upcoming Node 10.7 release and increasing robustness of the locli analysis tool in dealing with incomplete / partial trace output.

Tracing

Our new tracing system has been set up as its own project - and named the Hermod Tracing System. As of now, we've only migrated the core package trace-dispatcher. This marks the first step of eventually moving all tracing and metrics related packages out of the cardano-node project, and bundling them with consistent branding, API and documentation. Eventually, the system will be generalized so that it can be used by any Haskell application - not just cardano-node. Seeing that the dmq-node already adopted it, we have reason to assume it might be considered by the broader community as go-to choice to add principled observability to an application.

We've built and released a new Haskell library cardano-timeseries-io (cardano-node PR#6495). The library builds and stores timeseries of metrics from multiple source applications, much like Prometheus. It can process queries over those timeseries in a query language quite similar to PromQL. Integration into cardano-tracer, the trace / metrics processing service, is ongoing work. It will allow for custom monitoring solutions and alerts directly from cardano-tracer, without the need to scrape metrics and maintain them externally. It is not meant to replace existing Prometheus endpoints, rather provide richer functionality out of the box if desired: cardano-node PR#6473.

Leios

We've released cardano-recon-framework, formerly known as the Linear Temporal Logic (LTL) Trace Verifier (cardano-node PR#6454). It's already seen adoption, and is used productively to verify system properties and conformance exclusively based on live trace output. We've been asked by Formal Methods Engineering to extend the LTL fragment the framework uses, such that a wider range of properties can be expressed; work on that is already ongoing.

Node Diversity

The comprehensive formal schema definition of all the Node's existing trace messages is nearing integration / merging. The initial version will be able to extract all definitions from the actual implementation into a fully validated JSON schema. Future work will address completing the automated verification suite, adding a mechanism to amend the extracted schema manually (e.g. with comments or refinement types) and a pipeline to facilitate usage, such as automatic derivation of a parser, or rendering of a human-readable specification PDF.

Due to resourcing issues, the trace / metrics forwarding mini-protocol implementation in native Rust, unfortunately, had to be put on hiatus for the forseeable future.

Performance & Tracing Update

· 5 min read
Michael Karg
Performance and Tracing Team Lead

High level summary

  • Benchmarking: Release benchmarks for 10.5.4 and 10.6.2; Parallel GC benchmarks.
  • Development: Preparation of new PlutusV3 baseline.
  • Infrastructure: Performance cluster gets custom, isolated Nix cache - safe benchmarks for security-critical changes.
  • Tracing: Improving robustness by forcing lazy values in controlled sections of code.
  • Leios: LTL Trace Verifier completed, waiting integration.
  • Node Diversity: Formal trace schema definition entering validation phase; Trace forwarding in native Rust entering testing.

Low level overview

Benchmarking

We've performed, analysed and published relase benchmarks for both Node versions 10.5.4 and 10.6.2. We could determine both to be free of performance regressions. The 10.6.2 release contains the new 'Defensive Mempool' feature, which is therefore also covered by our benchmark. The 10.6.2 release has shown to be somewhat more efficient in its use of CPU time, but exhibited a slightly higher tendency to perform Major GC cycles.

To that end, we've reopened an old PR which changes the default / recommended GC settings for the Node process to a parallel, load-balanced GC (cardano-node PR#6222). The motivation is to update the current recommended settings (which are still tuned to GHC8.10) such that the ocurrence of Major GC cycles is greatly reduced (as they may temporarily halt the Node process to complete). We found in our benchmark that, apart from being even slightly more efficient regarding CPU time, the occurrence of Major GCs could be reduced by almost factor 30.

Development

We're performing an overhaul of the plutus-scripts-bench package, a library of benchmarkable Plutus scripts targeting various aspects of the Plutus interpreter, the respective cost model and the execution budgets. The aim is to create up-to-date performance baselines by using exclusively PlutusV3 scripts that have been built with a recent version of the compiler - thus factoring in potential performance improvements in generated code. Currently, the PR cardano-node PR#6440 is work in progress.

Infrastructure

Up to now, a benchmarking deployment required the target commit to be a public item on GitHub; the nix build (or cache retrieval from our CI) would be decentralized, with each cluster instance creating the benchmarking artifact independently. When there's a requirement to benchmark security-critical changes in an isolated, opaque fashion, this approach would reach its limits promptly. Together with SRE, we devised a way to achieve just that: An artifact can be built from a local commit on one cluster instance into its nix store, which in turn will serve as a substituter (i.e. cache) in a centralized manner for all other instances (cardano-node PR#6450).

Tracing

The new tracing system highly encourages trace values to be lazy. Thus, the emitting thread has the lowest possible overhead when doing so - which is highly relevant when you're on a hot code path. Furthermore, this overhead is assumed to be a constant factor - regardless of whether those traces are consumed by any subscriber or not. We're currently exploring an approach to increase robustness, guarding against shaky implementations of trace values themselves. The burden of evaluating a lazy trace still remains with a subscriber, however, this is now decoupled from handing over the trace result (such as a log line, a metric, etc.). By forcing a lazy trace value in a controlled section of code, immediately prior to handover, the system will reliably handle even blatant implementation errors in lazy traces.

Leios

The Linear Temporal Logic (LTL) Trace Verifier Cardano Trace LTL has reached production readiness. It is able to ingest multiple streams of trace evidence, basically multiple Node log files as they're being produced, and continuously evaluate a set of LTL propositions against them. While performant, real-time evaluation is a valuable thing to have, it required some of the LTL operators to be bounded to be able to operate in constant space over a long time. We've discussed our fragment of LTL with Formal Methods to start building a collection of properties worth checking, and to ensure there's provably no disjoint semantics introduced by the bounded operators.

The service is currently being integrated with the existing tooling in the cardano-node project, and will form a regular part of Leios setups / deployments in the future.

Node Diversity

We've reached basic viability of the comprehensive formal schema definition of all the Node's existing trace messages. We're building an automated verification suite that will ensure all definitions are fully compliant with existing JSON schema, as well as the observables implemented in (and the trace messages logged by) the Haskell Node conform to the defined schema. Further manual refinement of types in the schema will be the next step; eventually, this will serve as a basis to automatically derive parsers, and to render a human-readable reference documentation.

The implementation of our trace / metrics forwarding mini-protocol in Rust has completed and is now in testing phase. After cleanup and merge, this allows Rust projects to emit Cardano-style structured traces directly, and forward them to a running cardano-tracer for logging, processing and metrics exposition.

Performance & Tracing Update

· 6 min read
Michael Karg
Performance and Tracing Team Lead

High level summary

  • Benchmarking: Updated 10.5 and 10.6 performance baselines; Local reproduction of LSM-trees benchmark isolating space leak; merged LSM-tree profiles.
  • Development: HTTPS connections with cardano-tracer; PromQL-like query language and metrics timesieries for cardano-tracer.
  • Infrastructure: Reporting pipeline switched to new typesetting tool; 10.5.x backports to re-enable cluster benchmarks; extensive improvements to workbench's profiling capabilites and nix API.
  • Tracing: Shared traces between cardano-node and dmq-node; cardano-submit-api switched to new tracing; Small improvements to default configs.
  • Node Diversity: Comprehensive formal schema definition for trace messages; Proof-of-concept: Trace forwarding in native Rust.

Low level overview

Benchmarking

We've re-established performance baselines for both the 10.5 and 10.6 branches of Node releases. When many small changes accumulate over time, it's necessary to factor them into the baseline (unless they show regressions) - otherwise one loeses the ability to efficiently localize underlying causes of performance metric changes.

Last month's cluster benchmark of LSM-trees pointed towards a space leak in the integration. To be certain, we successfully aimed to locally recreate network behaviour that was observed in the cluster. With this at hand, Consensus engineers managed to localize it and create a potential fix - validation on the cluster still outstanding.

In the meantime, the benchmarking profiles targeting on-disk LSM-trees have been solidified, merged to master and eternalized in our encyclopedia of benchmarking profiles cardano-profile.

Development

The trace consumer and processor service cardano-tracer is currently being upgraded with optionally HTTPS enabled connections. For scraping or browsing Node metrics, in case the connection crosses public networks, it's highly recommended to encrypt traffic. So far this has to be achieved by placing cardano-tracer behind a webserver proxy which speaks HTTPS; with the planned change cardano-tracer can be configured to do this directly when provided with the relevant certificates.

The work on aggregating timeseries of Node metrics and evaluating PromQL-like queries over them is almost complete. Currently, it's a standalone application; we're planning to integrate it into cardano-tracer as a next step. We've had SRE review the up to now unnamed query language for conciseness and utility - a sort of UAT if you will. This building block forms the foundation for alerts or monitoring dashboard data queries, which in the future can be handled directly by cardano-tracer, if so desired.

Infrastructure

The integration of the Typst typesetting document compiler into our reporting pipeline is completed and merged to master in cardano-node PR#6418. Not only does that enable richer document features for the future; it also compiles much faster than the previos Emacs Org mode / LaTeX based pipeline - and produces smaller and more accessible PDF files. The PR also drops a nix flake input required by the previous pipeline, which should result in smaller and faster workbench builds both locally and in CI - and de-risk future nixpkgs bumps.

We performed some necessary backports from master to re-enable the release/10.5.x branch for cluster benchmarking. As cluster maintenance has moved on and deviated over the months, we're now ready to swiftly benchmark a potential new 10.5 patch release, should it see the light of day (cardano-node PR#6421).

We also merged a large PR (cardano-node PR#6380) that brings many improvements to the performance workbench. We reworked the configuration of profiled builds (where the Haskell runtime itself profiles execution). There is now full support for info-table profiling, and much more flexibility in configuring the runtime to write out an eventlog. Our automation now grants a long enough grace period to write out all profile data before Node processes are terminated. The PR also cleaned up the workbench's nix API, leading to more straightforward dependency resolution, a cleaner separation of concerns, removal of dead code, and an improved hit/miss ratio for cached binaries.

Tracing

Establishing trace-dipatcher (the Node's new tracing system) as the default for dmq-node is nearly complete. We're currently performing a large refactoring with the goal to share code defining trace rendering between cardano-node and dmq-node; they do use the same ouroboros-network components after all. This will avoid redundant implementation, and ensure the same traces are emitted the same way by either application.

We've ported a separate package in the Node project, cardano-submit-api, to use trace-dispatcher for logging and metrics - with nearly no changes to messages and metrics at the surface: cardano-node PR#6326. This removes the legacy tracing dependency from the Submit API - a requirement to eventually retire the system from the Node as well. However, as legacy tracing is no longer supported, users are advised to adjust their configs accordingly starting with cardano-submit-api-10.2.

We've also merged a small PR (cardano-node PR#6409) that contains improvements to the new tracing system for both users and implementors. Most prominently, we've adjusted default values and configs to prevent accidental misconfiguration of the forwarding backend - which should make for a better user experience and adoption.

Node Diversity

Additionally, we've been working on building a comprehensive formal schema definition of all the Node's existing trace messages. The Haskell Node being the current reference implementation, it is a logical starting point. The goal is to provide a fully compliant JSON schema so any tooling, or verification suite, can automatically derive parsers from it. Furthermore, it should be renderable into human-readable formal documentation. We believe this to be crucial to enable unified trace semantics across diverse Node implementations; right now, we're still evaluating the most effective approach to maintain this over a long time, and guarantee consistency with the implementation.

Last but not least, we're building a proof-of-concept of how to implement our forwarding mini-protocol in Rust. This would allow for all Rust-based projects to emit Cardano-style structured traces directly, and forward them to a running cardano-tracer for logging, processing and metrics exposition.

Performance & Tracing Update

· 4 min read
Michael Karg
Performance and Tracing Team Lead

High level summary

  • Benchmarking: 10.6 benchmarks confirming heap size fix; First LSM-trees benchmarks.
  • Infrastructure: New typesetting tool for reporting pipeline.
  • Tracing: Increased robustness of the PrometheusSimple metrics backend; previous quality-of-life improvements released.
  • Leios: Linear temporal logic based trace verifier demo for Leios.

Low level overview

Benchmarking

The underlying cause for increase in RAM usage on Node 10.6.0 has been indentified and addressed. While heap size increase is still present outside of our benchmarking environment, its extent there is negligible. We've re-run cluster benchmarks to confirm the fix is successful.

Additionally, we've performed and analyzed benchmarks on several LSM-trees integration branches. This feature has as of now not been released in some Node version, so it is not yet fully configurable. The benchmarks have to be understood as a very early performance assessment. We've performed benchmarks for both in-memory and on-disk backing stores. Especially for the on-disk benchmarks, we could observe RAM usage decreasing clearly, with only small increases in CPU usage. While there is some extra cost to block adoption, cluster diffusion metrics still remain almost identical to the in-memory benchmarks - mostly due to header pipelining. As we didn't artificially constrain memory, the benchmarks are illustrative of LSM-trees behaviour when there's no pressure from the garbage collector: Given that, will on-disk LSM-trees use caching / buffering efficiently, or will it perform redundant disk I/O? The answer is - the former.

Infrastructure

For convenient creation of reporting documents, we're integrating a new typesetting tool: The brilliant, open-source Typst project promises fully typesettable and scriptable documents, while maintaining a syntax that is (almost) as easy to grasp as Markdown. Typst extensions even render our gnuplots inline - and fast. Easily scriptable styling enables us to deliver an often requested feature: Colorizing individual result metrics based on how risky (or beneficial) a deviation from the baseline is deemed to be. Up to now, our reporting pipeline depended on Emacs Org mode and a medium-sized LaTeX distro as part of the Performance Workbench; we might be able to drop these heavy dependencies in favor of something more modern soon.

Tracing

The Node's internal PrometheusSimple backend to expose metrics has received several robustness improvements. All those aim to mitigate factors in the host environment which can contribute to the backend becoming unreachable. It will now reap dangling socket connections more eagerly, preventing false positives in DoS protection. Furthermore, there now is a restart/retry wrapper in place, should the backend fail unexpectedly. All start and stop events are traced in the logs, exposing a potential error cause. Merged in cardano-node PR#6396.

The previous batch of quality-of-life improvements in cardano-node PR#6377 has also been merged and released. It includes Prometheus HTTP service discovery for cardano-tracer, more robust recovering and tracing of forwarding connection interruptions as well as stability improvements for engineers implementing tracers.

Leios

Our conformance testing framework which evaluates linear temporal logic propositions against trace output has matured. It has seen some performance and usability improvements, for instance a helpful human-readable output as to what minimal sequence of traces caused some proposition to be false - and the ability to consume traces from an arbitrary number of nodes instead of only one. We've already created several propositions targeting the well-behavedness of the block forging loop; diffusion related propositions for Praos and eventually Leios are logical next steps.

Even though this framework was built with Node Diversity in mind, we could showcase it at this month's Leios event, and demonstrate what it could deliver for this project as well - and we were very satisfied with the reception it got.

Performance & Tracing wishes you Happy Holidays...

...and a Joyful New Year!