Skip to main content

· 2 min read
Carlos LopezDeLara

· One min read
Franco Testagrossa
Sasha Bogicevic

High-level summary

This week the team focused on exploring the event sourced persistence in order to improve hydra-node performance. Because of this work the team noticed we need to refactor the emit snapshot emission logic and update the spec in the light of new changes. They also took the time to revisit their goals and product plans for the next quarter as well as doing some security fixes related to multisignatures.

What did the team achieve this week

  • Finished spike about performance improvements of event sourced persistence #963.
  • Refactor snapshot emission in protocol logic.
  • Revisited our roadmap and goals.
  • Prepared and conducated a learning session on lean-waste.
  • Improve security of multi-signature checks, see this Github security advisory.
  • Implemented a cache friendly way to version our binaries #962.

What are the goals of next week

  • Implement Event sourced persistence #913.
  • Remove deprecated internal commit #954 and close #728.

· 2 min read
Jean-Philippe Raynaud

High level overview

The Mithril team created a new 2327.0 distribution. They focused on preparing the beta launch on the mainnet: they tested the new production signer deployment model with the pioneer SPOs, they prepared an SPO on-boarding guide, and they kept working on the deployment and monitoring of the mainnet infrastructure. The team also worked on the implementation of a simple stress test tool for benchmarking the aggregator. Additionally, they completed the refactoring of the interface to the cryptographic library.

Finally, they fixed a bug that sporadically prevented the latest signer registration of a SPO to be used in the associated signing epoch, they fixed a bug in the epoch gap detection of the certificate chain in the aggregator, and worked on multiple other optimizations and bugs.

Low level overview

  • Released the new distribution 2327.0
  • Worked on the epic that prepares the Mithril infrastructure for mainnet #767:
    • Completed the issue Add infrastructure monitoring #987
    • Worked on the issue Deploy 'mainnet' infrastructure #988
    • Worked on the issue Handle Secrets management #989
  • Worked on the epic Benchmark performances of Mithril Aggregator #904:
    • Worked on the issue Design & implement basic stress test tool for aggregator #991
  • Worked on optimizations:
    • Completed the issue Remove certificate hash from Artifact #932
    • Completed the issue Check vulnerabilities in CI #1037
    • Completed the issue Add 'created_at' in Mithril Stake Distribution messages #1030
    • Completed the issue Add a 'run-only' option in end to end test #1048
  • Worked on refactoring:
    • Completed the issue Factorize protocol crypto operations #669
    • Completed the issue Refactor aggregator dependency injection and services #1058
    • Completed the issue Build static binaries in CI #874
  • Worked on documentation:
    • Completed the issue Prepare SPO on-boarding guide #1049
    • Completed the issue Add instructions to set firewall using iptables #1040
    • Completed the issue Update ufw command to set firewall on Mithril Signer installation instructions #1041
  • Worked on bugs:
    • Completed the issue Aggregator does not detect certificate chain epoch gap #952
    • Completed the issue 'testing-preview' network does not create certificates #1015
    • Completed the issue SQLite compatibility in aggregator #837
    • Completed the issue Q&A followup fixes #1035
    • Completed the issue E2E tests are flaky in CI #1023

· One min read
Damian Nadales

High level summary

This week the team working on UTxO-HD discovered a space leak in the peer metrics code. This was communicated to the Networking team who has a proposed fix. The ad-hoc benchmarks that the team ran using a local immutable DB server showed good memory and time performance. We still have to check the performance on a memory constrained machine.

The team working on the Genesis design started onboarding the team of engineers that will implement the new Genesis protocol. This team is also finalizing the statistical model for historical Genesis feasibility.

On the support front, the team drafted an information exchange requirement (IER) for the Networking team to safely and efficiently control peer load.

· 3 min read
Michael Karg

High level summary

  • Benchmarking: The peformance investigation into the compiler switch to GHC9 is ongoing. Additionally, a roadmap for implementing Consensus QTAs has been developed.
  • Infrastructure: Our workbench has undergone some refactoring to seamlessly integrate its profiles into all available backends.
  • Tracing: Optimization of the new tracing system is ongoing and yielding good performance results.
  • Nomad backend: We developed a new feature for the nomad backend which allows pinning deployments to specific machines.

Low level overview

Benchmarking

Our analysis of the GHC9 build of cardano-node has produced several locations in the code base where the new compiler seems to miss opportunities for optimization. Our hypothesis is, that those can account for the difference in resource usage we observe when benchmarking with a full cluster run. Instructing the compiler on how to perform the optimizations which GHC8 apparently applied out of the box requires further investigation.

In an effort to define Quantitative Timeliness Agreements (QTAs) on a per-component basis, we have coordinated with the Consensus team and developed a roadmap for providing those on consensus level. Making use of the insight that system-level benchmarks allow, we intend to set up and calibrate a benchmark that can reliably predict a regression or optimization for select metrics before needing full integration into cardano-node. This will help tremendously in various ways: catching regressions much earlier, localizing them much easier, avoiding repeated component integration and much shorter feedback cycle.

Infrastructure

We have worked on seamless integration of our benchmarking profiles into the many available backends that the workbench provides. The goal was to be backend-agnostic, to guarantee that all benchmarking run artifacts be structurally identical as far as their file name, format and location are concerned. This lead to refactoring work and has already landed in master.

Tracing

Much effort went into further optimization of the new tracing system. After working on configuration to align both new and legacy tracing system with regard to their trace frequencies, we could uncover some increase in resource usage. This occurred for corner cases under very heavy load. These cases have been addressed already, and do now surpass the legacy tracing system in terms of performance.

Nomad backend

For reliable benchmarking results it is vital to introduce as few confounding factors as possible when performing runs. This includes hardware and network topology. The nomad backend has been outfitted with a mechanism to pin the nomad job for some node in our benhcmarking cluster to a specific machine instance. This greatly increases confidence in the metrics observed from a run.

Furthermore this feature will detect any change in the underlying hardware or topology so it can be taken into account. The new feature has been merged to master.