Skip to main content

Performance & tracing update

· 4 min read
Serge Kosyrev

High level summary

  1. SECP benchmarking enablement was completed: we are now able to do local runs of the SECP workloads. The next step is to port this to the AWS environment.
  2. A new workstream for Plutus cost modeling improvement: we've planned and started implementing the smart contract call overhead measurement machinery.
  3. The new tracing system: after doing more benchmarking to address inter-run variance, we discovered that the regression, while still there, is small enough not to be release critical. Nevertheless, we're continuing with the further performance-oriented rework of the internals.
  4. Infrastructure: a significant refactoring of the workbench internals was merged. We also started improving the denotation for ever-evolving protocol parameters. Comparative analysis of multi-run batches implementation started.
  5. Open sourcing: our plans matured sufficiently so that we now expect actual deployment work to start this week.

Performance

The SECP benchmarking workload has been fully implemented in the workbench. We are now porting it over to AWS, and after that we'll be running the model cluster workload.

We've also started implementing mechanics for the upcoming investigation of the Plutus smart contract call overhead, which is expected to lead us to improved Plutus cost modeling.

Tracing

After the initial model-scale performance data caused us to panic, among other things we've done more benchmarks, and it turned out that inter-run variance increase was the culprit. The actual regression averages to barely noticeable 1-2% in key metrics -- which is certainly not release critical.

To understand the impact of the new tracing system, we have to bear in mind the extra functionality it provides:

  1. We are now processing all messages generated by the system, without making any shortcuts that the old system had to resort to. That causes the new tracing to do more work, but is more useful for all users and developers involved -- since it leads to a simple, non-confusing configuration. Incidentally, that's also the area where we are reworking the internals, to deduce and enable the optimisations that are implied by the particular configuration.
  2. The new tracing system is benchmarked with remote tracing as the default backend (whereas the old one was using local, builtin log storage mechanism). In some sense it's the fair benchmark, because that's the way we expect SPO's to set up tracing. That, however also causes it to do more work.

All that said, since we've established the performance of the new system to be adequate for the release, we won't be delaying it much further.

In addition, we're still pursuing our performance-enhancing rework of the new tracing internals.

Infrastructure

After implementing the multi-backend capability in the workbench, we got the opportunity to reassess the generic/backend boundaries and perform some long-awaited cleanups and simplifications in that area. The results of this work have been merged and will serve as a solid foundation for the CI and cloud backends.

Moving to analysis, we've also improved provenance of the raw data, by collecting more identification information and statistics about it. This means, e.g. that we now record checksums, message frequencies and timestamps from the log files coming into analysis. This will be used to enable us to see more data anomalies earlier, and lift that information directly into the generated reports.

A new feature is now under implementation -- the ability to provide comparative analysis of multi-run batches. Previously we only had automation for two aspects separately, so we only could either:

  • compare individual runs (used for different node configurations / versions)
  • collect variance statistics from a batch of runs (used to enhance statistical confidence for a single node configuration / version) Naturally, combining these two capabilities was a long-desired feature of our analysis pipeline.