High level summary
- Benchmarking: Preliminary
10.3
benchmarks;GHC8
/GHC9
compiler version comparison; Plutus budget scaling; runtime parameter tuning onGHC9
. - Development: Started new Plutus script calibration tool; maintenance updates to benchmarking profiles.
- Infrastructure: Adjusted tooling to latest Cardano API version; simplification of performance workbench nearing completion.
- New Tracing: Battle-tested metrics monitoring on mainnet; generalized nix service config for
cardano-tracer
.
Low level overview
Benchmarking
We've run and analyzed several benchmarks these last two weeks:
Preliminary 10.3
integration
As performance improvement is a stated goal for the 10.3
release, we became involved early in the release cycle. Benchmarking the earliest version of the 10.3
integration branch, we could already
determine that the effort put in has yielded promising results and confirm improvements in both resource usage and block production metrics. A regular release benchmark will be performed, and published, from the final
release tag.
Compiler versions: GHC9.6.5
vs. GHC8.10.7
So far, code generation with GHC9.6
has resulted in a performance regression for block production under heavy load - we've established that in various past benchmarks. The optimization efforts on 10.3
also
focused on removing that performance blocker. Benchmarking the integration branch with the newer compiler version has now confirmed it has not only vanished; moreover, code generated with GHC9.6
even exhibited slightly more favourable performance characteristics. So in all likelihood, Node 10.3
will be the last release to support GHC8.10
, and we will recommend GHC9.6
as the default build platform for it.
Plutus budget scaling
We've repeated several Plutus budget scaling benchmarks on Node version 10.3
/ GHC9.6
. By scaling execution budgets to 1.5x and 2x their current mainnet values, we can determine the performance impact on the
network of potential increases of said budgets. We independently measured bumping the steps (CPU) limit with a CPU-intensive script, and bumping the memory limit with a script performing lots of allocations. We could
observe the performance impact to correspond linearly with the limit bump in each case. This gives certainty and predictability of the impact when suggesting changes to mainnet protocol parameters.
Our team presented those findings and the data to the Parameter Comittee for discussion.
Runtime system (RTS) settings
The recommended RTS settings for cardano-node
encompass number of CPU cores to use, behaviour of the allocator, and behaviour of the garbage collector. The recommended settings so far are tuned to GHC8.10
's RTS - one
cannot assume the same settings are optimal for GHC9.6
's RTS, too. So we've started a series of short, exploratory benchmarks comparing a matrix of promising changes to update our recommendation in the future.
Development
We've started to develop a new tool that calibrates our Plutus benchmarking scripts given a range of constraints on the expected workload. These entail exhausting a certain budget (block or transaction), or calibrating for a constant number of transactions per block while exhausting available steps, or memory, budget(s). The result of that directly serves as input to our benchmarking profile definition. This tool may also be of wider interest, as it allows for modifying various inputs, such as Plutus cost models, or script serializations generated by different compilers or compiler versions. That way, one can compare at a glance how effective a given script makes use of the available budgets, given a specific cost model.
Additonally, our benchmarking profiles are currently undergoing a maintenance cycle. This means, setups for which motivation has ceased to exist are removed, several are updated to use the Voltaire performance baseline, others need to be tested for their conformity with the Plomin hard-fork protocol updates.
Infrastructure
The extensive work of simplifying the performance workbench is almost finished and about to enter testing phase. We have been moving away from scripting to declarative (Haskell) definitions of all benchmark profiles and workloads in past PRs. The simplification work now reaps the benefits of that: We can optimize away many recursive / redundant invocations or nix evaluations, we can collate many nix store paths into just a couple ones, reduce the workbench's overall closure size and complexity. Apart from saving significant resources and time for CI runners, this will reduce maintence effort necessary on our end.
Furthermore, we've done maintenance on our tooling by adjusting to the latest changes in cardano-api
. This included porting the ProtocolParameters
type and type class instances over to us, as our use case requires
we continue supporting it. However, it's considered deprecated in the API, and so this unblocks the team for eventually removing it.
New Tracing
Having addressed all feature and change requests relevant for the Node 10.3 release, we performed thorough mainnet testing of the new system's metrics in a monitoring context. We relied on the extremely diligent and helpful feedback from the SRE team. This enabled us to iron out a couple of remaining inconsistencies - a big shout-out and thank you to John Lotoski.
Additionally, again with SRE, a nix service configuration (SVC) has been created for cardano-tracer
that was generalized and aligned in structure with the existing cardano-node
SVC. It was evolved from the existing
SVC in our performance workbench, which however was tied tightly to our team's use case. With the general approach we hope other teams, and the community, can reliably and easily set up and deploy cardano-tracer
.