Skip to main content

37 posts tagged with "performance-tracing"

View All Tags

· 3 min read
Michael Karg
  • SECP benchmarking: we concluded our benchmarking runs and analyses of the new SECP primitives for the Valentine hard-fork.
  • Release benchmarking: we performed a round of benchmarks for the 1.35.6 release.
  • UTxO-HD benchmarking: we performed first runs for UTxO-HD and are currently refining the benchmarking setup.
  • New tracing: for better accessibility, the new tracing system is being outfitted with introspective capabilities.
  • Infrastructure: with the Nomad cloud workbench backend we were able to perform our first test cluster runs successfully on SRE infrastructure.
  • Infrastructure: the initial NixOps workbench backend has been completed; a PR containing this work, along with many quality-of-life improvements of our tooling, got merged.

Performance

SECP

  1. For SECP, we settled on a fixed tx count per block, while simultaneously spending as much as possible of the block budget. Thus we were able to minimize the impact of per-SC-call overhead.
  2. The final runs were performed with various fractions, e.g. half, of the current block budget to ascertain how these workloads would fare compared to a value-only run.
  3. The SECP machinery and profiles are currently being generalized into an approach to aim for very specific aspects of a smart contract for benchmarking.

UTxO-HD

  1. After analyzing initial UTxO-HD runs, it turned out that mempool snapshotting had to be throttled for benchmarking; it affects a lock that UTxO-HD had to introduce into the forging loop.
  2. We're currently adapting the benchmark setup to that, and will then perform a new combination of baseline and UTxO-HD runs.

1.35.6 release

Benchmarking the 1.35.6 release candidate could attest to a perfectly clean bill of health.

Tracing

Work on the new tracing system's introspective capabilites is ongoing: Immediate use cases of the new API include being able to statically validate generated tracer documentation, as well as providing information of a specific tracing setup in the node via traces themselves. These features will make the new system both more robust, and more accessible.

Infrastructure

Nomad backend

  1. Work on the cloud deployment capability of the Nomad workbench backend continued; for testing we can automate multiple Nomad clients.
  2. Locality assumptions were removed and job monitoring was refactored.
  3. To facilitate directly-executable derivations, Nomad Job specification files are now self contained with GitHub references and configs needed to run a cluster.
  4. We're currently evaluating different options for genesis distribution in said cluster.

NixOps backend

The NixOps workbench backend has reached an initial functional stage. Consequently, the relevant PR was merged. It also contained many improvements to our analysis tooling, as well as a structural overhaul of workbench itself. We consider this an important step of future-proofing our benchmarking machinery.

· 3 min read
Serge Kosyrev

High level summary

  1. SECP benchmarking: we ran several rounds of SECP benchmarks, refining the benchmark setup as we discovered the properties of the system. After formulating an initial suggested change to the protocol parameters, we're currently running what we consider the final benchmark, to validate the underlying assumptions.
  2. Release benchmarking: we've performed a round of benchmarks for the hotfix 1.35 release update and initiated the 1.35.6 benchmarks.
  3. New tracing: the improvement in the tracing API, with the underlying restructuring, was completed and merged into the node.
  4. New tracing: before going live, we're performing the documentation update, as well as reworking the end user migration guide.
  5. Open sourcing: the benchmarking data publishing has been completed and deployed. After populating it with relevant benchmark data and providing basic user documentation we can go live.
  6. Infrastructure: the cloud workbench backend is progressing well, the networking aspects of multi-region deployment are currently being worked on.
  7. Infrastructure: the NixOps workbench backend is still being worked on, as part of migration from cardano-ops and benchmarking infrastructure unification.

Performance

We are approaching the end of a chain of SECP benchmarks, as we gradually eliminated deficiencies in the setup as we were discovering them and answering newly appearing questions:

  • we improved the tx/block filling strategy in the generator, to maximise the per-block utilisation of resources and so better approximate the worst-case,
  • after a discovery of what looked like significant per-SC-call overhead, we again tweaked the the tx/block filling strategy,
  • finally, we're redoing all benchmarks together with a value-only run against the backdrop of Mainnet-sized datasets, to balance the suggested adjustment. That also ran into difficulties wrt. limitations of our benchmarking hardware.

In addition, we started benchmarks of the 1.35.6 release.

Tracing

A rework of the new tracing system's internals and API was merged. It extended the system with introspection, which enabled a range of improvements, some of which were implemented along the way.

Specifically, we were able to completely short-cut processing of messages generated by the tracers that were made provably ineffective by current tracing configuration. Further, now ongoing work enabled by the introspection facilities, includes static validation of documentation and enhanced node state reporting.

Infrastructure

On the opensourcing/transparency front, the benchmark data publishing machinery was finally fully assembled and put online. As resources permit, we'll work on populating it with benchmarking data, preparing basic documentation and engaging the stakeholders.

The work on the cloud deployment capability of the Nomad workbench backend continued with focus on setting up inter-node networking and removal of locality assumptions. A major step besides those, was completion of a switch-over to the directly-executable derivations, which eliminate the need for creation and distribution of images -- thereby increasing the speed of deployment.

The Nixops workbench backend progressed steadily, reaching minimal deployment capability. The remaining parts are proper shared configuration generation, and porting of the run control functionality from cardano-ops.

· 2 min read
Serge Kosyrev

High level summary

Since our last update, we focused on infrastructure work: benchmark enablement, tracing system, benchmark environment merge and open source support:

  1. SECP benchmarking enablement is underway: enabling SECP runs in our cardano-ops benchmarking environment is still in progress.
  2. The new tracing system: the improved API of the new tracing system was implemented, and we're now porting the tracing integration layer over.
  3. Infrastructure: the mainnet protocol parameter history is now encoded in the workbench profile machinery at epoch-level granularity, which gives us a systematic approach towards description of past and future benchmarks.
  4. New benchmark deployment infrastructure: we've made some progress on Nomad deployment backend, shared by both of the data publishing and benchmarking needs.
  5. Legacy benchmarking: we've started merging the legacy benchmark deployment infrastructure into the workbench.
  6. Open sourcing: the benchmarking data publishing tool was adapted to the Nomad execution environment provided by SRE, pending final deployment.

Performance

The AWS cluster infrastructure necessary for SECP benchmarking is still being worked on.

Tracing

The improved tracing internals were implemented, and we're now into the phase of updating the tracing integration, which is also mostly done.

Infrastructure

Thanks to collaboration with the DevX team, we have identified and pursued a design that would enable our Nomad workbench backend to execute deployments of both the benchmarking cluster and our data publishing components.

On the benchmark parametrisation front, we have eliminated a long-standing weakness in the way we were specifying the protocol parameters. We now have a very clear and granular method to keep track of protocol parameter evolution -- e.g. the mainnet history changes are now tracked at epoch granularity, while also allowing for systematically described change overlays. This makes the benchmark profile definition much more clear and robust against mistakes.

We also started a merge of the legacy benchmarking environment (based on cardano-ops) into the workbench. The separation between environments was too costly, causing us to reimplement any benchmarking change twice -- first, during development, in the workbench, then in cardano-ops. In addition, maintenance of compatibility code was incurring additional costs, slowing benchmark data analysis development. Once this merge is complete, this will allow us to sharply cut the benchmark development cycle and overheads.

· 4 min read
Serge Kosyrev

High level summary

  1. SECP benchmarking enablement was completed: we are now able to do local runs of the SECP workloads. The next step is to port this to the AWS environment.
  2. A new workstream for Plutus cost modeling improvement: we've planned and started implementing the smart contract call overhead measurement machinery.
  3. The new tracing system: after doing more benchmarking to address inter-run variance, we discovered that the regression, while still there, is small enough not to be release critical. Nevertheless, we're continuing with the further performance-oriented rework of the internals.
  4. Infrastructure: a significant refactoring of the workbench internals was merged. We also started improving the denotation for ever-evolving protocol parameters. Comparative analysis of multi-run batches implementation started.
  5. Open sourcing: our plans matured sufficiently so that we now expect actual deployment work to start this week.

Performance

The SECP benchmarking workload has been fully implemented in the workbench. We are now porting it over to AWS, and after that we'll be running the model cluster workload.

We've also started implementing mechanics for the upcoming investigation of the Plutus smart contract call overhead, which is expected to lead us to improved Plutus cost modeling.

Tracing

After the initial model-scale performance data caused us to panic, among other things we've done more benchmarks, and it turned out that inter-run variance increase was the culprit. The actual regression averages to barely noticeable 1-2% in key metrics -- which is certainly not release critical.

To understand the impact of the new tracing system, we have to bear in mind the extra functionality it provides:

  1. We are now processing all messages generated by the system, without making any shortcuts that the old system had to resort to. That causes the new tracing to do more work, but is more useful for all users and developers involved -- since it leads to a simple, non-confusing configuration. Incidentally, that's also the area where we are reworking the internals, to deduce and enable the optimisations that are implied by the particular configuration.
  2. The new tracing system is benchmarked with remote tracing as the default backend (whereas the old one was using local, builtin log storage mechanism). In some sense it's the fair benchmark, because that's the way we expect SPO's to set up tracing. That, however also causes it to do more work.

All that said, since we've established the performance of the new system to be adequate for the release, we won't be delaying it much further.

In addition, we're still pursuing our performance-enhancing rework of the new tracing internals.

Infrastructure

After implementing the multi-backend capability in the workbench, we got the opportunity to reassess the generic/backend boundaries and perform some long-awaited cleanups and simplifications in that area. The results of this work have been merged and will serve as a solid foundation for the CI and cloud backends.

Moving to analysis, we've also improved provenance of the raw data, by collecting more identification information and statistics about it. This means, e.g. that we now record checksums, message frequencies and timestamps from the log files coming into analysis. This will be used to enable us to see more data anomalies earlier, and lift that information directly into the generated reports.

A new feature is now under implementation -- the ability to provide comparative analysis of multi-run batches. Previously we only had automation for two aspects separately, so we only could either:

  • compare individual runs (used for different node configurations / versions)
  • collect variance statistics from a batch of runs (used to enhance statistical confidence for a single node configuration / version) Naturally, combining these two capabilities was a long-desired feature of our analysis pipeline.

· 2 min read
Serge Kosyrev

High level summary

  1. Benchmarks for the 1.36 first pre-release bump of the internal components have been delivered, and data shows the component bump is clear for release.
  2. SECP benchmarking enablement is underway: the necessary generator features have been implemented, and are now being integrated into the workbench.
  3. The new tracing system: in response to the performance regression we previously discovered we are working on pre-planned implementation improvements, and doing more benchmarks.
  4. Infrastructure: the Nomad-based workbench backend has been made closer to a cloud deployment scenario. Cleanup in preparation for Cicero CI/CD integration started.
  5. Open sourcing: ongoing SRE collaboration on production deployment of performance data publishing.

Performance

We have ran benchmarks for the first component bump of the upcoming 1.36 release, and we don't see any significant performance changes. The component bumps are therefore clear for release.

Tracing

For the tracing system regression that we spotted -- even before, we already had plans for further efficiency improvement, and now we are actively pursuing them. The idea is to collect more statically-available information to enable shifting of more tracing decisions from message delivery time to configuration time.

To support this effort, we also started running more benchmarks and enhanced data analysis with relevant metrics.

Infrastructure

Generation support for Plutus V2 has been implemented and so, with the help of the previously made looped signature-verifying script, the generator is now capable of producing two SECP workloads: verifying either ECDSA or Schnorr signatures. This is now being integrated into the infrastructure -- the generator parametrisation API is being enhanced and the workbench is being extended to handle the new parametrisation.

In addition the workbench is now being enhanced to handle protocol-version-based choices for the Plutus cost model.

The intermediate cloud compatibility iteration of the workbench cloud enablement effort was merged. We are now doing some cleanup work in preparation for starting the Cicero backend, which will bring us nearly completely to the CI/CD integration.

We continue collaboration with SRE on production deployment of data publishing. We now have a gradual rollout plan, which respects the plans for SRE infrastructure feature availability.

We are working on recovering the software dependency manifest feature that was lost with the organisation-wide transition to CHaP.

As usual, a number of smaller workbench, data analysis & reporting improvements have been made.