Skip to main content

45 posts tagged with "performance-tracing"

View All Tags

· 2 min read
Serge Kosyrev

High level summary

Since our last update, we focused on infrastructure work: benchmark enablement, tracing system, benchmark environment merge and open source support:

  1. SECP benchmarking enablement is underway: enabling SECP runs in our cardano-ops benchmarking environment is still in progress.
  2. The new tracing system: the improved API of the new tracing system was implemented, and we're now porting the tracing integration layer over.
  3. Infrastructure: the mainnet protocol parameter history is now encoded in the workbench profile machinery at epoch-level granularity, which gives us a systematic approach towards description of past and future benchmarks.
  4. New benchmark deployment infrastructure: we've made some progress on Nomad deployment backend, shared by both of the data publishing and benchmarking needs.
  5. Legacy benchmarking: we've started merging the legacy benchmark deployment infrastructure into the workbench.
  6. Open sourcing: the benchmarking data publishing tool was adapted to the Nomad execution environment provided by SRE, pending final deployment.

Performance

The AWS cluster infrastructure necessary for SECP benchmarking is still being worked on.

Tracing

The improved tracing internals were implemented, and we're now into the phase of updating the tracing integration, which is also mostly done.

Infrastructure

Thanks to collaboration with the DevX team, we have identified and pursued a design that would enable our Nomad workbench backend to execute deployments of both the benchmarking cluster and our data publishing components.

On the benchmark parametrisation front, we have eliminated a long-standing weakness in the way we were specifying the protocol parameters. We now have a very clear and granular method to keep track of protocol parameter evolution -- e.g. the mainnet history changes are now tracked at epoch granularity, while also allowing for systematically described change overlays. This makes the benchmark profile definition much more clear and robust against mistakes.

We also started a merge of the legacy benchmarking environment (based on cardano-ops) into the workbench. The separation between environments was too costly, causing us to reimplement any benchmarking change twice -- first, during development, in the workbench, then in cardano-ops. In addition, maintenance of compatibility code was incurring additional costs, slowing benchmark data analysis development. Once this merge is complete, this will allow us to sharply cut the benchmark development cycle and overheads.

· 4 min read
Serge Kosyrev

High level summary

  1. SECP benchmarking enablement was completed: we are now able to do local runs of the SECP workloads. The next step is to port this to the AWS environment.
  2. A new workstream for Plutus cost modeling improvement: we've planned and started implementing the smart contract call overhead measurement machinery.
  3. The new tracing system: after doing more benchmarking to address inter-run variance, we discovered that the regression, while still there, is small enough not to be release critical. Nevertheless, we're continuing with the further performance-oriented rework of the internals.
  4. Infrastructure: a significant refactoring of the workbench internals was merged. We also started improving the denotation for ever-evolving protocol parameters. Comparative analysis of multi-run batches implementation started.
  5. Open sourcing: our plans matured sufficiently so that we now expect actual deployment work to start this week.

Performance

The SECP benchmarking workload has been fully implemented in the workbench. We are now porting it over to AWS, and after that we'll be running the model cluster workload.

We've also started implementing mechanics for the upcoming investigation of the Plutus smart contract call overhead, which is expected to lead us to improved Plutus cost modeling.

Tracing

After the initial model-scale performance data caused us to panic, among other things we've done more benchmarks, and it turned out that inter-run variance increase was the culprit. The actual regression averages to barely noticeable 1-2% in key metrics -- which is certainly not release critical.

To understand the impact of the new tracing system, we have to bear in mind the extra functionality it provides:

  1. We are now processing all messages generated by the system, without making any shortcuts that the old system had to resort to. That causes the new tracing to do more work, but is more useful for all users and developers involved -- since it leads to a simple, non-confusing configuration. Incidentally, that's also the area where we are reworking the internals, to deduce and enable the optimisations that are implied by the particular configuration.
  2. The new tracing system is benchmarked with remote tracing as the default backend (whereas the old one was using local, builtin log storage mechanism). In some sense it's the fair benchmark, because that's the way we expect SPO's to set up tracing. That, however also causes it to do more work.

All that said, since we've established the performance of the new system to be adequate for the release, we won't be delaying it much further.

In addition, we're still pursuing our performance-enhancing rework of the new tracing internals.

Infrastructure

After implementing the multi-backend capability in the workbench, we got the opportunity to reassess the generic/backend boundaries and perform some long-awaited cleanups and simplifications in that area. The results of this work have been merged and will serve as a solid foundation for the CI and cloud backends.

Moving to analysis, we've also improved provenance of the raw data, by collecting more identification information and statistics about it. This means, e.g. that we now record checksums, message frequencies and timestamps from the log files coming into analysis. This will be used to enable us to see more data anomalies earlier, and lift that information directly into the generated reports.

A new feature is now under implementation -- the ability to provide comparative analysis of multi-run batches. Previously we only had automation for two aspects separately, so we only could either:

  • compare individual runs (used for different node configurations / versions)
  • collect variance statistics from a batch of runs (used to enhance statistical confidence for a single node configuration / version) Naturally, combining these two capabilities was a long-desired feature of our analysis pipeline.

· 2 min read
Serge Kosyrev

High level summary

  1. Benchmarks for the 1.36 first pre-release bump of the internal components have been delivered, and data shows the component bump is clear for release.
  2. SECP benchmarking enablement is underway: the necessary generator features have been implemented, and are now being integrated into the workbench.
  3. The new tracing system: in response to the performance regression we previously discovered we are working on pre-planned implementation improvements, and doing more benchmarks.
  4. Infrastructure: the Nomad-based workbench backend has been made closer to a cloud deployment scenario. Cleanup in preparation for Cicero CI/CD integration started.
  5. Open sourcing: ongoing SRE collaboration on production deployment of performance data publishing.

Performance

We have ran benchmarks for the first component bump of the upcoming 1.36 release, and we don't see any significant performance changes. The component bumps are therefore clear for release.

Tracing

For the tracing system regression that we spotted -- even before, we already had plans for further efficiency improvement, and now we are actively pursuing them. The idea is to collect more statically-available information to enable shifting of more tracing decisions from message delivery time to configuration time.

To support this effort, we also started running more benchmarks and enhanced data analysis with relevant metrics.

Infrastructure

Generation support for Plutus V2 has been implemented and so, with the help of the previously made looped signature-verifying script, the generator is now capable of producing two SECP workloads: verifying either ECDSA or Schnorr signatures. This is now being integrated into the infrastructure -- the generator parametrisation API is being enhanced and the workbench is being extended to handle the new parametrisation.

In addition the workbench is now being enhanced to handle protocol-version-based choices for the Plutus cost model.

The intermediate cloud compatibility iteration of the workbench cloud enablement effort was merged. We are now doing some cleanup work in preparation for starting the Cicero backend, which will bring us nearly completely to the CI/CD integration.

We continue collaboration with SRE on production deployment of data publishing. We now have a gradual rollout plan, which respects the plans for SRE infrastructure feature availability.

We are working on recovering the software dependency manifest feature that was lost with the organisation-wide transition to CHaP.

As usual, a number of smaller workbench, data analysis & reporting improvements have been made.

· 2 min read
Serge Kosyrev

High level summary

  1. P2P performance investigation is ongoing, in support of the networking team.
  2. SECP benchmarking enablement is underway: we already have the script and are working on Plutus V2 generation support.
  3. Unexpected setback in the new tracing system: full scale benchmarks have shown a performance regression: local chain syncing benchmarks were an improvement over legacy tracing.
  4. On the open sourcing front we added an integrated data dictionary, which is necessary for explaining ourselves to the world. SRE collaboration on production deployment of performance data publishing has started.
  5. We have started bringing the Nomad-based workbench backend closer to a cloud deployment scenario.

Performance

We are supporting the networking team on P2P performance investigation. Generation support for Plutus V2 was started. We have collaborated with the Plutus team to get a SECP benchmark script, which is now ready for use, pending Plutus V2 support. The transaction generator has also been updated to the cardano-api changes.

Tracing

We ran an initial round of full-scale benchmarks for the new tracing system -- which uncovered a regression relative to legacy tracing, which is contrary to the local chain syncing benchmarks, that showed improvement instead. We added tracing to cardano-tracer, fixing some minor bugs on the way. Network and disk IO metrics are now collected once again and are integrated into analysis.

Infrastructure

The first iteration of the Nomad-based local workbench backend was completed -- it has reached feature parity with the existing supervisor backend. The next iteration started, bringing it closer to the cloud scenario, by deploying to separate Nomad tasks connected by a virtual network. This will serve as basis for CI and full cloud backends.

We designed and implemented the authoring pipeline for the performance data dictionary, which will be henceforth embedded in our performance reports. We are collaborating with SRE on production deployment of data publishing.

A number of smaller workbench, data analysis & reporting improvements have been made.

· 2 min read
Serge Kosyrev

High level summary

On the performance side, the team ran benchmarks for the the P2P feature and the 1.35.4 release. We finished a prototype for performance data publishing. We almost finished the local deployment backend for the workbench using the new SRE deployment infra. We worked on fixing and improving our data analysis pipeline.

On the tracing side, the team worked on isolating a critical issue causing message loss in the remote tracing backend. The issue was resolved and we now have proper end-to-end coverage for the scenario.

Executive summary

  • The new tracing system public release is getting closer, as we're resolving remaining rough edges that are discovered in full-scale deployments. The local benchmarks we ran were already showing improvement relative to legacy tracing, so we expect similar results at full scale.
  • The first (local deployment) iteration of benchmarking adopting the new SRE deployment infra is nearly done. We thank Michael Fellinger and Robin Stumm for their assistance. Two further phases remain: CI integration and cloud deployment.
  • The benchmarking data publishing prototype is ready. This serves as a springboard for both opening our performance assessment workflow (to support the wider Cardano developer community), and for data provision to the business community. Our next steps are to secure a permanent deployment for this mechanism and to integrate it into the benchmarking infrastructure. This requires collaboration with SRE.