High level summary
- Benchmarking: Release benchmarks for Node
9.2.0
. Validating the new "age of Voltaire" performance baseline. - Development - New Tracing System: A space leak in the forwarding mechanism was fixed; a log rotation bug is being investigated.
- Workbench: Large refactoring of
workbench
, optimizing nix closure size and adding profile flake outputs. Adjusted Nomad backend was merged. - Infrastructure: Dropping
Vault
for the Nomad cluster was tested and merged. - Tracing: Further metrics names alignment; be OpenMetrics specs compliant; adding annotations to Prometheus metrics; internal monitoring servers routing has entered testing.
Low level overview
Benchmarking
We've run and analyzed a full set of release benchmarks for Node version 9.2.0
. In comparison with Mainnet release 9.1.1
, we could not observe any performance regression.
Moreover, we've validated the stability of our new "age of Voltaire" performance baseline on 9.1.1
. Currently, we're running a cross-comparison between baselines and Node versions 9.1.1
and 9.2.0
to ascertain that
the new baseline arrives - at scale - at the same performance observations and predictions as the previous one.
Development - New Tracing System
Forwarding traces and metrics in the new system exhibited a tiny space leak. Under conventional operation, this leak would only become noticeable after running uninterrupted for days or even weeks. It took very hard pressure on the
system, and additional profiling, to make it visible. It could be fixed by avoiding unnecessary allocations of continuations: The buffer of objects to forward inherently carries the position of the next object to process, such that a fully
evaluated closure can trivially be reused to handle any subsequent forwarding request. This has led to new versions of packages trace-foward-2.2.7
and ekg-forward-0.6
. Huge thanks to John Lotoski and Javier Sagredo, whose
meticulous information helped to swiftly address the issue.
On the benchmarking cluster, we've observed cardano-tracer
's log rotation to occasionally misbehave: under certain circumstances, the service leaks handles by not redirecting output to the latest log file in the rotation. We've located the
issue and are working towards a fix.
Workbench
We've been working on a major refactoring of workbench
code. The main benefit of this endeavour is being able to pull in a very heavy dependency optionally only when required, when building and running the workbench
shell. This will especially facilitate runs on CI machines after garbage collections, but also building a local shell on individual developer machines. Additionally, benchmarking profiles designed for the cluster are now provided as nix flake outputs. This allows for building a more versatile automation in the future, where workbench
and cardano-node
commits won't need to be tied to each other. Last not least, the refactoring simplified the way the shell commands are evaluated, doing away with nested calls in many instances. The refactoring PR has been thouroughly tested and merged.
Furthermore, the workbench is now prepared for a nixpkgs
upgrade and has dropped the container-based Nomad / podman
backend - the respective PR was merged successfully.
Infrastructure
Removal of the Vault
service for managing benchmarking cluster credentials has been successfully tested and merged. The service is scheduled for final shutdown end of month, reducing hardware cost and maintenance effort.
Tracing
We've received initial feedback regarding the alignment of metrics names between new and legacy tracing systems. Based upon that feedback, we're currently working on some further adjustments to the naming schema.
The implementation for hosting multiple EKG monitors in one single service has been finished and is currently in the testing phase. The dynamic routing to monitoring data, now used both for EKG and Prometheus, reflects the nodes that are connected to cardano-tracer
. We've also added a JSON response format, which makes it easier to query and scrape existing routes as part of automations. Finally, this PR also removes the dependency on the snap
server framework and transitively on HsOpenSSL
(which is prone to cause build issues in the future).
Currently, we're working on various improvements to the Prometheus metric expositions in cardano-tracer
. We aim to implement full compliance with the OpenMetrics specification, which should greatly enhance integration processes. Furthermore, metrics
will be augmented with # TYPE
and # HELP
annotations, as tracked in issue cardano-node#5021.
Last not least, we've closed off issue cardano-node#3988. For adding an optional prefix to metrics names, the Node config option TraceOptionMetricsPrefix
can now be used.