High level summary
- Benchmarking: Release benchmarks for Node
9.1.1
; additional UTxO-HD in-memory benchmarks. - Development: Created a local reproduction for observed UTxO-HD RAM increase.
- Workbench: Created a new "age of Voltaire" performance baseline. Adjusted Nomad backend has entered testing phase.
- Infrastructure: Dropping the requirement on
Vault
, optimizing cluster setup. - Tracing: New metrics naming schema was merged. Routing to internal monitoring servers is ongoing. Dropping dependency on
HsOpenSSL
.
Low level overview
Benchmarking
Runs and analyses for a full set of release benchmarks have been performed for Node version 9.1.1
. In comparison with Mainnet releases 9.0
and 9.1.0
, we could determine this version does not exhibit any performance regression.
Having been provided with the patch by Consensus targeting the increased RAM usage of the UTxO-HD in-memory backend (read below), we've performed additional benchmarks to validate the desired result on the cluster. Our measurements demonstrate the increased memory need has now vanished. We're confident that by now we've located - and addressed - all performance risks for UTxO-HD in-memory that we can capture given the instruments at our disposal. To gain further confidence in the stability of resource usage pattern and network metrics observed on the benchmarking cluster, we've advised long-running UTxO-HD nodes under close monitoring.
Development
We succeeded in creating a local reproduction of the increase in RAM usage that was observed for the UTxO-HD in-memory backend on the cluster. That reproduction enabled the Consensus team to inspect in real-time and profile running Node processes - which led to a swift identification of the underlying cause and a patch addressing it.
Workbench
After the smooth Chang hard fork which transitioned Cardano into the Conway era, we've created - and merged - a new performance baseline. It's intended for release benchmarks and caters to the new features of the Conway ledger. Apart from incorporating the latest protocol version and Plutus cost models, it includes DRep presence in ledger when performing measurements.
The PR preparing our workbench for a nixpkgs
upgrade and removing the container-based Nomad / podman
backend is complete and has entered testing phase.
Infrastructure
Currently, our Nomad cluster uses Vault
to manage access and credentials for the benchmarking cluster. As the cluster exclusively relies on static routes, and fixed deployment endpoints, encoding access as a set of rules into the cloud infrastructure
is a viable option. That way, we will no longer depend on the Vault
service, removing the requirement of hosting, and maintaining, an instance of it.
Tracing
Aligning the metrics naming schema and semantics between new and legacy tracing systems has been completed and merged. This will enable a seamless interchange in the community, as all existing configurations of monitoring services remain their validity.
As for hosting multiple EKG metrics monitors in one single service application, we ascertained that the ekg
package was not built for that use case. However, we've come up with a much nicer design for cardano-tracer
using dynamic routing based on the names of nodes
connected to it. It has successfully passed prototype stage in that it's able to serve multiple EKG monitors without the need for any server restart; the full implementation is being worked on.
Last not least, both existing tracing systems rely on the snap
server framework, and thus by transitive dependency, on HsOpenSSL
to speak the HTTPS protocol. However, we've determined the latter package to have a risk of breaking the build, both currently and in the future (cf. HsOpenSSL#95 and HsOpenSSL#88). As a consequence, a switch to the wai
/ warp
based framework was decided, which implements HTTPS capability differently, thus preempting the risk. This has already been carried out for the legacy system, and currently is
for cardano-tracer
- a big shoutout to Erik de Castro Lopo for his support on that issue.