52 posts tagged with "performance-tracing"

Performance & tracing update

May 3, 2023 · 2 min read

Performance and Tracing Team Lead

Benchmarking: The benchmarks and performance investigations for the new 8.0 release branch are ongoing.
New tracing: Performance optimization of the new tracing system is paying off and we could notably shrink its resource footprint.
Analysis pipeline: An exhaustive documentation and dataflow diagram for our analyses is being worked on.
Infrastructure: The plutus-apps flake input for cardano-node has finally been removed.
Nomad backend: A PR implementing placement of benchmarking clusters has been merged.

Benchmarking

The performance investigations on the 8.0 release branch have lead to pinpointing and addressing incosistent behaviour. For that, we created yet another local reproduction with the workbench's forge-stress benchmark.

Currently we're working on scaling up the dataset size (UTxO and delegations) on the AWS cluster to gain further insight into 8.0 and subsequent releases.

Additionally, we've refined the trace-bench family of profiles that target benchmarking our own new tracing system.

Tracing

Optimization of the tracing system has identified several locations where inefficient serializations were used; those were not originally intended to run on a performance-critical codepath. We've worked on improving those, as well as eliminating cases of redundant conversion between different serialization formats. This has brought down both memory and CPU impact of the tracing system.

Infrastructure & Analysis

Dataflow documentation

The LogObject CLI locli is at the heart of our analysis and reporting pipeline. To increase its accessibility and facilitate further development, we're creating a detailed and illustrated documentation of all dataflows that happen during analysis and reporting.

Remove redundant Plutus flake input

This step is the conclusion of porting Plutus benchmarking scripts to our own library. By finally removing the now unnecessary flake input, we simplify the dependency graph for cardano-node, as well as enable immediate feedback when developing Plutus benchmarks.

Nomad backend

Sophisticated placement of nodes across various regions of the globe is a cornerstone of the model cluster we use for benchmarking. This capability has now been added to the Nomad backend and can be controlled with Nomad job descriptions. A PR with this, along with various quality-of-life improvements, has been merged to master.

Performance & tracing update

April 19, 2023 · 3 min read

Michael Karg

Performance and Tracing Team Lead

Benchmarking: We performed a series of benchmarks aimed at the new 8.0 release branch and built a timeline from the 1.35 releases to that branch.
New tracing: Work on safeguarding the new tracing system performance-wise is ongoing. A practical use case for data points is being tackled with Galois.
Analysis pipeline: We're working on automatically obtaining a detailed manifest for each run.
Infrastructure: The library for benchmarking Plutus scripts has been merged. Also, we've laid the ground for including GHC profiling data into our workbench.
Nomad backend: The first iteration of a distributed / multi-client Nomad cluster has been merged.

Benchmarking

We have performed various cluster runs targeting the 8.0 release branch. That way we were able to catch an inconsistency in behaviour early on. This led to the creation of a specialized workbench profile epoch-transition for local reproduction of what we observed on the benchmarking cluster.

Furthermore, we bridged the gap between the run data from the 1.35.x releases to the the new 8.0.x release branch. This included walking the master branch backwards and pinpointing the order, as well as the dates and commits of all relevant component bumps. This timeline is absolutely crucial in locating possible regressions for the new release branch, as it provides the exact points in history we would need to target with a comprehensive set of benchmarks.

Tracing

In-depth performance analysis of the new tracing system has already yielded results and helped us smoothing some rough edges. However, this work is still ongoing.

In coordination with Galois, who are developing a system assurance service by observing a number of cardano-nodes, we're working with the implementation of data points which the node provides during runtime. While making the view on data points expressive enough for the external service, the computational burden inside the node needs to be kept to an absolute minimum. We're currently in ideation about whether cardano-tracer could be extended with a richer feature set to that end.

Infrastructure & Analysis

Detailed manifest

A run manifest documents, among other things, the component dependencies that were used for a specific build the run has been performed with. These dependencies come from different package sources, have different versioning policies, and an identical package version might provide different performance characteristics depending on the exact commit used for the build. This manifest will greatly increase insight into where changes in measured behaviour might have originated by making all component bumps visible and accessible.

GHC profiling inside workbench

The workbench has been equipped with a new -profnix profile flavour. This enforces a -fprof-auto build for all node-related packages. The type of profiling data generated by the GHC runtime can be customized and will enter statistical analysis. The relevant PR for this new feature has already been merged to master.

Nomad backend

The added feature for a multi-client Nomad cluster greatly enhances how jobs are organized by the backend and mapped within specific instances. This results in great maintainability while not giving up on flexibility. However, work on that feature is still ongoing.

Performance & tracing update

April 5, 2023 · 3 min read

Michael Karg

Performance and Tracing Team Lead

Benchmarking: We worked on adjusting our infrastructure to the new 8.0 release branch and performed a (very) early run.
New tracing: We're profiling the new tracing system for minimizing its resource footprint and guarantee high throughput.
Analysis pipeline: Variance analysis both for reporting and for serving as a point of comparison has been merged.
Infrastructure: A library for Plutus scripts will be integrated in our tooling and benchmarking profiles. Also, a profile family aimed at the tracing systems has been added.
Nomad backend: Various specializations of the backend are currently being implemented, along with streamlining credentials management.

Benchmarking

We have adapted our benchmarking cluster to the requirements of the 8.0 release branch. Testing runs of a very early feature branch for 8.0 helped us localize an important issue in collaboration with the other teams. We look forward to gathering preliminary metrics for 8.0 soon.

Tracing

Analysis of resource usage profiles of both the legacy and new tracing system, with and without trace forwarding, have lead us to gather very detailed profiling data for each possible setup. This is to ensure we keep resource usage within the node to an absolute minimum, while still providing the highest possible throughput of data for forwarding to cardano-tracer.

Additionally, we've worked on a very practically-oriented document targeted at end users of the new tracing system. It provides tested step-by-step instructions for tunneling trace forwarding from a node to cardano-tracer via an easy to manage system service, which will match the production setup of most users.

Infrastructure & Analysis

General

Variance analysis as a full-fledged entity in our tooling has been merged. Not only is this type of analysis now part of our reporting pipeline - variance analysis can be fed back and serve as an additional point of comparison.

Furthermore, we've created a profile family for the workbench that's specifically aimed at measuring and comparing tracing system configurations.

Plutus library

We opened a PR containing a new package for benchmarking - an extendable library that holds all Plutus scripts we use in our benchmarking profiles. This will enable us in the future to iteratively work on customizing any given script, and the way is called in the context of a specific profile. It is a refinement of current affairs, where we have additional build inputs solely to generate a static script file tied to an external commit.

Nomad backend

The nomad backend is being specialized in three ways: using a podman driver locally, using nomad agents supporting nix installables, and using nomad cloud agents. This supports having a common surface independent of the actual backend driver being used. In addition, vault retrieval and management of cloud access credentials is being improved to minimize any friction for the backend user.

Performance & tracing update

March 22, 2023 · 3 min read

Michael Karg

Performance and Tracing Team Lead

Benchmarking: We performed benchmarks for the new tracing system, and started benchmarking for varying GHC RTS configurations.
New tracing: Backwards compatibility with legacy tracer nomenclature has been merged; we're currently improving documentation and creating setup guidelines for end users.
Analysis pipeline: Our refined metrics PR has been merged. We're working on including variance analysis to our reporting machinery.
Infrastructure: Support for Conway genesis in our workbench has been merged. At the moment, we're laying the groundwork for enabling GHC 9.2 in our benchmarks.
Open Sourcing: The API demo has reached prototype phase; work on documenting the API and providing exemplifying use cases is ongoing.
Nomad backend: The nomad-exec based task driver has been merged. The backend has been equipped with the capability for genesis distribution via S3 bucket.

Performance

New tracing

The new tracing system has undergone various benchmarking runs with variance analysis, and comparison to a baseline using legacy tracing. We could observe a slight shift in the resource usage profile from memory to CPU, but no regressions in block propagation metrics. Variance was observed to be notably smaller, which gives the new system a much better predictability. From this angle, we consider the new system fit for production use.

GHC RTS parametrization

We're currently prerforming various runs on the cluster to explore the space of different GHC RTS settings for running nodes. The main focus lies on different configurations for the garbage collector, as well as increasing the number of CPU cores the node may use.

Open Sourcing

Our API demo has reached prototype stage, and operates on live data from the production database. Making use of the experience gained, we're refining version 1 of the API to provide optimized usability, and creating documentation that both is descriptive of the API endpoints, and focuses on practical, exemplary use cases.

Tracing

For the new tracing system we're currently undertaking an effort to multi-layered documentation: a condensed version, as well as a setup guide with pragmatical focus, will be provided alongside the in-depth documentation. This effort should cater to different audiences, and provide distinct entry points for users of the new system, depending on their wants and needs.

Infrastructure & Analysis

General

Having included Conway genesis in the workbench, as a next step in future-proofing out benchmarking infrastructure, we're laying the foundation for a switch in compiler version to GHC 9.2. Additionally, we considered variance analysis of our runs to merit inclusion into our reporting pipeling - which will increase confidence in specific metrics.

Nomad backend

We have implemented an appropriate mechanism for genesis distribution: Only after a benchmarking cluster has been deployed successfully, genesis is patched and uploaded to an AWS S3 bucket for the nodes to retrieve - as a final step before initiating the actual run. We're confident that this deferred approach will provide clearer evidence for genesis patches, as well as minimize startup time for all runs by factoring in deployment re-tries.

Performance & tracing update

March 8, 2023 · 2 min read

Michael Karg

Performance and Tracing Team Lead

Release benchmarking: We again performed benchmarks for the next 1.35.6 release candidate.
New tracing: Backwards compatibility with legacy tracer nomenclature is being implemented to smoothe the transition for end users.
Analysis pipeline: A major refinement of benchmarking metrics has been realized, along with a structural improvementents regarding metrics denomination.
Open Sourcing: Work on going live with our benchmarking data has begun, as well as creating an API demo and documentation.
Nomad backend: The backend was adapted to a major refactoring in workbench and is being equipped with a nomad-exec based task driver.

Performance

1.35.6 release

Benchmarking the second release candidate for 1.35.6 could again attest to a perfectly clean bill of health.

Analysis pipeline

Our analysis pipeline has seen an introduction of additional metrics, especially when focusing on the block producing node. They allow us to better differentiate the timing of ledger ticking and mempool snapshotting in the forging loop - a feature that promises much deeper insight into UTxO-HD performance. Additionally, a restructuring of metrics names has been undertaken along with improvements in their data dictionary; a measure that will make benchmarking data more easily accessible.

Open Sourcing

As a prerequisite for going live with our benchmarking data, we're currently working on consolidation of existing analyses, such as to provide a common foundation when accessing them externally. Additionally, we've begun working on a small visualization demo and interactive API documentation. Those will enable third parties to make use of that data much more easily, by having reliable guidelines and a working example.

Tracing

The new tracing system is being outfitted with a comprehensive mapping of its structure to the legacy tracer nomenclature. This feature will make the switch to the new system as smooth as possible for end users, allowing them to gradually adapt their tooling without breaking any functionality in the process.

Infrastructure

Nomad backend

The Nomad backend was adapted to the latest major refactoring in workbench. Work was done on making stateful Nomad clients more autonomous, which will greatly facilitate any automation building on that backend. A task driver based on nomad-exec is currently being implemented.

Benchmarking

Tracing

Infrastructure & Analysis

Dataflow documentation​

Remove redundant Plutus flake input​

Nomad backend​

Benchmarking

Tracing

Infrastructure & Analysis

Detailed manifest​

GHC profiling inside workbench​

Nomad backend​

Benchmarking

Tracing

Infrastructure & Analysis

General​

Plutus library​

Nomad backend​

Performance

New tracing​

GHC RTS parametrization​

Open Sourcing​

Tracing

Infrastructure & Analysis

General​

Nomad backend​

Performance

1.35.6 release​

Analysis pipeline​

Open Sourcing​

Tracing

Infrastructure

Nomad backend​

Dataflow documentation

Remove redundant Plutus flake input

Nomad backend

Detailed manifest

GHC profiling inside workbench

Nomad backend

General

Plutus library

Nomad backend

New tracing

GHC RTS parametrization

Open Sourcing

General

Nomad backend

1.35.6 release

Analysis pipeline

Open Sourcing

Nomad backend