Skip to main content

51 posts tagged with "performance-tracing"

View All Tags

· 3 min read
Michael Karg

High level summary

  • Benchmarking: Governance action / voting benchmarks on Node 10.0; performed PlutusV3 RIPEMD-160 benchmarks.
  • Development: Governance action workload fully implemented; generator-based is submission ongoing work.
  • Tracing: New tracing system production ready - cardano-tracer-0.3 released; work advancing on typed-protocols-0.3 bump and metrics naming.

Low level overview

Benchmarking

We've run and analyzed the new voting workload on Node 10.0. This workload is a stream of voting transactions submitted on top of the existing value workload from release benchmarking. The delta in the comparison can claim to demonstrate the "performance cost of voting" in the Conway ledger era. The workload itself is a puppeteer of 10.000 DReps overall, who vote on up to 5 governance actions simultaneously. We made sure these are mutually independent proposals, that vote tallying occurs, and that the actions get ratified and enacted (and hence removed from the ledger). Then, voting moves on to the next actions - keeping the number of actions needing vote tallying stable over the benchmark. We could observe a very slight performance cost of voting; it's deemed to be a reasonable one given the stress we've put the system under.

The results can be found here along with those from release benchmkarks.

Furthermore, we've developed and run a new Plutus benchmark targeting the RIPEMD-160 internal. We've compared the resulting performance observations against other Plutus workloads - both memory-constrained and (same as RIPEMD-160) CPU-constrained. We have concluded that there are no performance risks to that algorithm in PlutusV3, given existing execution budgets, and that it's consistently priced wrt. other CPU-intensive internals.

Development

The voting workload is currently implemented using decentralized submission via cardano-cli on each of our cluster machines. It has proven reliable - and scalable, at least to some extent. We're already working on improvements that reduce the (very slight) overhead of using the CLI for submission. Additionally, we're aiming for a linear performance comparison when submitting twice the number votes per transaction at the same TPS rate - forcing double the work for vote tallying.

Implementation of that workload using the centralized (and much better scalable) tx-generator submission service is still ongoing.

Tracing

Metrics naming is currently receiving a last round of consistency checking, so that it's aligned as closely as possible between legacy and new tracing system. In the process, we're adressing aspects of documentation, and incorporating feedback to define a few metrics in the new system that previously were present in the legacy one only.

For migrating to the new typed-protocols-0.3 API, two of the new tracing system's packages are affected. The work for ekg-forward-0.7 is completed and merged to master - yet to be released on CHaP. Work on the second package, trace-forward, is ongoing.

We've finally released cardano-tracer-0.3, which incorporates all features, enhancements and fixes that have been reported on here over the past months. This release marks 100% production readiness of the new tracing system. We're focusing now on making documentation and example scripts and configs yet more user-friendly for community rollout. We're very much looking forward to receiving feedback - and have time and space reserved to address it, as well as to provide intial support for the migration away from the legacy system.

· 2 min read
Michael Karg

High level summary

  • Benchmarking: Started release benchmarks for Node 10.0.
  • Development: Governance action workload - alternative tx submission method built, passes tests.
  • Tracing: Preparing the bump to typed-protocols-0.3.

Low level overview

Benchmarking

We've started the benchmarking process for the freshly tagged, fully Chang 2 capable Node version 10.0 pre-release.

Development

Calibrating a governance action / voting workload within our submission service tx-generator is more involved than anticipated.

As measurements for performance impact of voting are required very shortly, we have - in parallel - created a nix / bash based solution. That one uses cardano-cli for creating and submitting proposals and voting transactions, while the generator can run any other known workload simultaneously. Thus, we expect to get a clear performance delta between voting vs. no voting going on. This setup has already been deployed, and is passing testing - soon to be used for the first real-world voting benchmarks.

The implementation however is less flexible, much less parametrizable, and in its design tied to the very specific, fixed topology of the Nomad cluster. The workload definition inside tx-generator will thus continue, and eventually be used as the standard for benchmarks targeting voting / governance.

Tracing

The new tracing system, more specifically, the components that forward metrics and traces to cardano-tracer, contain well-defined peers in the sense of the typed-protocols package. The upcoming bump to recently released version 0.3 contains breaking changes in the package API. We've begun necessary downstream adjustments in our packages, re-defining aforementioned peers using the new API.

· 3 min read
Michael Karg

High level summary

  • Benchmarking: New GHC9 benchmarks for Node 9.2.
  • Development: Progress on Governance action workload.
  • Workbench: Switch to Haskell-based profile content generation is imminent, along with significant code cleanup.
  • Tracing: New major release: cardano-tracer-0.3; metrics alignment is ongoing.
  • Consensus QTAs: Automation setup and implementation for beacon is complete, entering testing.

Low level overview

Benchmarking

The GHC team has been busy investigating the optimization behaviour of GHC9 vs GHC8 on the Cardano code base. It appears that speculative evaluation (a feature absent from GHC8) can, in some cases, lead to overeager evaluation - and hence to an unnecessary performance impact on the system. We've created a build of the Node with a patched version of GHC9.6 which disables speculative evaluation for these cases only - and then we've run our release benchmark workloads on a cluster of those nodes. The raw data is still under analysis, but initial results look promising.

Development

The new governance action / voting workload for cluster-wide benchmarks is still under works. Our submission service tx-generator is now able to assume the identity of all registered DReps to sign votes. Currently, we're working on defining (and correctly throttling) a stream of those votes, such that a constant number of open proposals keep receiving votes throughout the entire benchmarking run.

Workbench

The Haskell service to create benchmarking profile content, cardano-profile, has been in beta use for some time now. It has proven to be much more maintainable, and its approach to declare profile content is much more principled than the existing implementation with jq / bash. We've decided to switch to that service for good, including a final validation of all possible profiles between implementations. In the same go, we'll take the opportunity to remove some bulky parts of workbench which were motivated by the jq implementation, as well as simplify the corresponding nix evaluations (and redundant shell invocations).

Tracing

Further adjustments to the metrics naming schema in the new tracing system is ongoing.

For cardano-tracer, several PRs have successfully landed. EKG monitoring is now capable of serving many metrics stores from just one process. Prometheus exposition has been made OpenMetrics compliant. The space leaks in the forwarding backend have been verifiably closed, and the log rotation issue has been resolved. Thus, cardano-tracer will see (alongside the next Node release) a new major release 0.3. This release is considered to be 100% production-ready.

Consensus QTAs

beacon - a first step in building a benchmarking framework based on Delta-Q system design - has received a fully functional automation. According to the design we settled on 3 months ago, it's now possible to test, manage, and deploy a nix service for a self-hosted GitHub runner which performs beacon benchmarks on pre-defined workload fragments. The runner can be triggered automatically or manually. As the nix service will likely share a host with other, potentially resource intensive tasks, a locking mechanism is implemented to prevent distortion of the measurements. The automation is now entering testing phase.

· 4 min read
Michael Karg

High level summary

  • Benchmarking: Release benchmarks for Node 9.2.0. Validating the new "age of Voltaire" performance baseline.
  • Development - New Tracing System: A space leak in the forwarding mechanism was fixed; a log rotation bug is being investigated.
  • Workbench: Large refactoring of workbench, optimizing nix closure size and adding profile flake outputs. Adjusted Nomad backend was merged.
  • Infrastructure: Dropping Vault for the Nomad cluster was tested and merged.
  • Tracing: Further metrics names alignment; be OpenMetrics specs compliant; adding annotations to Prometheus metrics; internal monitoring servers routing has entered testing.

Low level overview

Benchmarking

We've run and analyzed a full set of release benchmarks for Node version 9.2.0. In comparison with Mainnet release 9.1.1, we could not observe any performance regression.

Moreover, we've validated the stability of our new "age of Voltaire" performance baseline on 9.1.1. Currently, we're running a cross-comparison between baselines and Node versions 9.1.1 and 9.2.0 to ascertain that the new baseline arrives - at scale - at the same performance observations and predictions as the previous one.

Development - New Tracing System

Forwarding traces and metrics in the new system exhibited a tiny space leak. Under conventional operation, this leak would only become noticeable after running uninterrupted for days or even weeks. It took very hard pressure on the system, and additional profiling, to make it visible. It could be fixed by avoiding unnecessary allocations of continuations: The buffer of objects to forward inherently carries the position of the next object to process, such that a fully evaluated closure can trivially be reused to handle any subsequent forwarding request. This has led to new versions of packages trace-foward-2.2.7 and ekg-forward-0.6. Huge thanks to John Lotoski and Javier Sagredo, whose meticulous information helped to swiftly address the issue.

On the benchmarking cluster, we've observed cardano-tracer's log rotation to occasionally misbehave: under certain circumstances, the service leaks handles by not redirecting output to the latest log file in the rotation. We've located the issue and are working towards a fix.

Workbench

We've been working on a major refactoring of workbench code. The main benefit of this endeavour is being able to pull in a very heavy dependency optionally only when required, when building and running the workbench shell. This will especially facilitate runs on CI machines after garbage collections, but also building a local shell on individual developer machines. Additionally, benchmarking profiles designed for the cluster are now provided as nix flake outputs. This allows for building a more versatile automation in the future, where workbench and cardano-node commits won't need to be tied to each other. Last not least, the refactoring simplified the way the shell commands are evaluated, doing away with nested calls in many instances. The refactoring PR has been thouroughly tested and merged.

Furthermore, the workbench is now prepared for a nixpkgs upgrade and has dropped the container-based Nomad / podman backend - the respective PR was merged successfully.

Infrastructure

Removal of the Vault service for managing benchmarking cluster credentials has been successfully tested and merged. The service is scheduled for final shutdown end of month, reducing hardware cost and maintenance effort.

Tracing

We've received initial feedback regarding the alignment of metrics names between new and legacy tracing systems. Based upon that feedback, we're currently working on some further adjustments to the naming schema.

The implementation for hosting multiple EKG monitors in one single service has been finished and is currently in the testing phase. The dynamic routing to monitoring data, now used both for EKG and Prometheus, reflects the nodes that are connected to cardano-tracer. We've also added a JSON response format, which makes it easier to query and scrape existing routes as part of automations. Finally, this PR also removes the dependency on the snap server framework and transitively on HsOpenSSL (which is prone to cause build issues in the future).

Currently, we're working on various improvements to the Prometheus metric expositions in cardano-tracer. We aim to implement full compliance with the OpenMetrics specification, which should greatly enhance integration processes. Furthermore, metrics will be augmented with # TYPE and # HELP annotations, as tracked in issue cardano-node#5021.

Last not least, we've closed off issue cardano-node#3988. For adding an optional prefix to metrics names, the Node config option TraceOptionMetricsPrefix can now be used.

· 4 min read
Michael Karg

High level summary

  • Benchmarking: Release benchmarks for Node 9.1.1; additional UTxO-HD in-memory benchmarks.
  • Development: Created a local reproduction for observed UTxO-HD RAM increase.
  • Workbench: Created a new "age of Voltaire" performance baseline. Adjusted Nomad backend has entered testing phase.
  • Infrastructure: Dropping the requirement on Vault, optimizing cluster setup.
  • Tracing: New metrics naming schema was merged. Routing to internal monitoring servers is ongoing. Dropping dependency on HsOpenSSL.

Low level overview

Benchmarking

Runs and analyses for a full set of release benchmarks have been performed for Node version 9.1.1. In comparison with Mainnet releases 9.0 and 9.1.0, we could determine this version does not exhibit any performance regression.

Having been provided with the patch by Consensus targeting the increased RAM usage of the UTxO-HD in-memory backend (read below), we've performed additional benchmarks to validate the desired result on the cluster. Our measurements demonstrate the increased memory need has now vanished. We're confident that by now we've located - and addressed - all performance risks for UTxO-HD in-memory that we can capture given the instruments at our disposal. To gain further confidence in the stability of resource usage pattern and network metrics observed on the benchmarking cluster, we've advised long-running UTxO-HD nodes under close monitoring.

Development

We succeeded in creating a local reproduction of the increase in RAM usage that was observed for the UTxO-HD in-memory backend on the cluster. That reproduction enabled the Consensus team to inspect in real-time and profile running Node processes - which led to a swift identification of the underlying cause and a patch addressing it.

Workbench

After the smooth Chang hard fork which transitioned Cardano into the Conway era, we've created - and merged - a new performance baseline. It's intended for release benchmarks and caters to the new features of the Conway ledger. Apart from incorporating the latest protocol version and Plutus cost models, it includes DRep presence in ledger when performing measurements.

The PR preparing our workbench for a nixpkgs upgrade and removing the container-based Nomad / podman backend is complete and has entered testing phase.

Infrastructure

Currently, our Nomad cluster uses Vault to manage access and credentials for the benchmarking cluster. As the cluster exclusively relies on static routes, and fixed deployment endpoints, encoding access as a set of rules into the cloud infrastructure is a viable option. That way, we will no longer depend on the Vault service, removing the requirement of hosting, and maintaining, an instance of it.

Tracing

Aligning the metrics naming schema and semantics between new and legacy tracing systems has been completed and merged. This will enable a seamless interchange in the community, as all existing configurations of monitoring services remain their validity.

As for hosting multiple EKG metrics monitors in one single service application, we ascertained that the ekg package was not built for that use case. However, we've come up with a much nicer design for cardano-tracer using dynamic routing based on the names of nodes connected to it. It has successfully passed prototype stage in that it's able to serve multiple EKG monitors without the need for any server restart; the full implementation is being worked on.

Last not least, both existing tracing systems rely on the snap server framework, and thus by transitive dependency, on HsOpenSSL to speak the HTTPS protocol. However, we've determined the latter package to have a risk of breaking the build, both currently and in the future (cf. HsOpenSSL#95 and HsOpenSSL#88). As a consequence, a switch to the wai / warp based framework was decided, which implements HTTPS capability differently, thus preempting the risk. This has already been carried out for the legacy system, and currently is for cardano-tracer - a big shoutout to Erik de Castro Lopo for his support on that issue.