Skip to main content

· 3 min read
Michael Karg

High level summary

  • Benchmarking: We've performed both low-level network and high-level variance analysis of our benchmarking clusters.
  • Infrastructure: Our reporting pipeline was adjusted to classify various workloads easily reducing rework time.
  • Tracing: Work on machine-readable tracing of tracer configuration is ongoing.
  • Nomad backend: We've been able to eliminate several possible confounders on the nomad cluster.
  • Team: We're currently onboarding a new team member: Welcome to Cardano Performance & Tracing, Baldur Blöndal!

Low level overview

Benchmarking

As part of the effort to bring the Nomad backend into production use, we've been equipping both that and the existing benchmarking backend with means to measure and document network latency for each run. Furthermore we've implemented means to capture TCP packets for a limited time window during a benchmarking run - which will allow us to spot differences in the behaviour of the underlying networking stack at OS level.

Additionally, we're running variance analysis in parallel on both backends to ascertain confidence in metrics originating from either. We've concluded that baseline profile runs aren't directly comparable between the two, so we decided to compare standard deviations instead to validate the measurements from nomad.

Infrastructure

Reporting on benchmarks does require human time and effort to rework the final document. Improvements to the reporting pipeline have been merged to master. They reduce the time necessary to do so by various changes to the template and the workload classification logic in analysis.

Beyond that, we've looked into issues where services would quit with an unjustified exit failure upon shutdown - under rare circumstances. By reworking shutdown logic for trace-dispatcher and tx-generator we were able to address those issues.

Tracing

After various steps in constructing a configuration upon node startup, it is vital to document which runtime configuration the node arrived eventually. We're working on providing a machine-readable JSON/YAML trace message for that purpose.

This will facilitate hot-reloading a node's tracer configuration in the future: users will be able to take such a trace message, apply their intended change and hot-reload it immediately into the node.

Nomad backend

As with the existing benchmarking cluster, nomad is currently under scrutiny with regard to the reliability of metrics it produces, as well as the behaviour of its OS-level network stack. For instance, differing kernel versions can have an impact on our measurements, as we'd be basically using two different instruments to take them.

Along the way we've already been successful in eliminating some possible confounders that had been introduced by the nomad service or the slightly different system architecture of the new cluster.

New team member

Baldur Blöndal is an extremely capable and experienced Haskell developer. Also, he's an excellent fit for our existing team. So I'm very pleased to welcome him onboard with IOG, and with Performance & Tracing. He will be working on cardano-tracer, the component receiving, processing and making available node traces and metrics.

· 2 min read
Damian Nadales

High level summary

We have a proposed fix for the mempool forging regression observed in the UTxO-HD branch. We need to confirm this by running system level benchmarks. We are still working on a fall back mechanism for keeping the baseline performance of Cardano node, if the performance of the UTxO-HD is not enough. On the Genesis front, we confirmed with the researchers that the proposed Genesis design is satisfactory for the historical Cardano chain. We also have a proposed fix for the wrong protocol version bug, found in the Sanchonet, after transitioning to Conway.

UTxO-HD

  • We optimized the mempool revalidation process, which in turn ought to solve the regression observed during system-level benchmarks in the in-memory version (349). System level benchmark results are pending.
  • Regarding the workaround to keep the node's baseline performance if that of the in-memory backend turns out not to be enough for our stakeholders (344), we are still expanding the legacy block package such that we could at some point run the node with a legacy Cardano block. There are some loose ends to wrap up before we can begin the first test run.
  • We also brought the UTxO-HD branch up to date with node version 8.4.0.

Genesis

  • We finished the discussion with the Researchers on how to argue that the proposed Genesis design is satisfactory for the existing historical Cardano chain. We are now drafting the final self-contained argument. (4157)

Support

  • We debugged a bad parameter update on the Babbage to Conway transition in the SanchoNet testnet (339). A superficial patch is within reach and we are in the process of reviewing the PRs related to this fix (340, 354, and 355) However we are investigating a more principled redesign of the epoch transition logic, which required us to revisit the existing interfaces of the ConsensusProtocol type class and the HardForkBlock combinator (345 and 346). This is important to prevent these kind of errors in the future. This is an overdue step in the process of taking full ownership of the HFC: reconsidering original HFC design decisions for which we now have much more context, a few years later.

· One min read
Jean-Philippe Raynaud

High level overview

This week, the Mithril team has completed the refactoring of the terraform deployment workflows in GitHub actions, and the implementation of snapshot compression parameters in the deployments. They kept working on the refactoring and standardization of the errors in the Mithril nodes. The team also completed the implementation of Cloudflare protection for the aggregator infrastructure and started working on its deployment and activation in the Mithril networks. Additionally, they worked on recording download statistics on the aggregator which will be used to produce usage reports.

Finally, they kept working on the aggregator performance bottleneck that occurs with high client traffic and started creating a new distribution.

Low level overview

  • Completed the issue Add snapshot compression parameters in infrastructure deployments #1200
  • Completed the issue Add Cloudflare protection of infrastructure #986
  • Worked on the issue Record statistics about the downloaded snapshot in the aggregator #1127
  • Worked on the issue Error refactoring #798
  • Worked on the issue Activate Cloudflare protection of infrastructure #1230
  • Worked on the issue Release new 2337 distribution #1219
  • Completed the issue Upgrade dependencies #1238

· One min read
James Chapman

The team works on applied research and consulting in formal methods that is directly applicable to evidence based engineering in Core Tech and beyond.

High level summary

The team is currently formalising mini protocols and also further developing the performance modelling prototype.

Details

  • working on collating and open sourcing performance analysis prototype

  • improvements to Ouroboros Praos specification in Isabelle

  • working on formalising chain sync mini-protocol

  • reviewing an alternatice semantics for DeltaQ

  • Seminar talk at U. Bergen on algebraic properties of timeliness

· One min read
Sasha Bogicevic
Sebastian Nagel

High-level summary

This week, most of the Hydra team was attending a cardano scaling workshop in Nantes, France. They used this oportunity to meet fellow mithril team and spend some time together to hack on some code and, as always, reflect on the past work and find optimal path forward for both projects. They also fixed a bug that caused hydra-node to crash when querying L1, worked on a new network resillience proof-of-concept and accepted a new ADR related to stateless transaction observation.

What did the team achieve this week

  • Cardano scaling workshop with members of hydra and mithril teams
  • Accepted user contribution for possible new use-case #1048
  • Fix for the hydra-node crash related to internal wallet query #1053
  • Collected experimental CI findings #1070
  • Propose first POC for the network resilience #1074

What are the goals of next week

  • Monthly review meeting & report including updates from Mithril
  • Review POC and discuss our options for the network resilience
  • Update cardano-api to version 8.20
  • Address TODOs on aiken commit validator #1072
  • Complete hydra-support in kupo kupo#117