Skip to main content

· 5 min read
Michael Karg

High level summary

  • Benchmarking: Various maintenance to prepare for upcoming Node 10.6 changes; metrics migration guide for SPOs.
  • Development: Prototyping a PromQL-based alert system for new tracing.
  • Infrastructure: Located a space leak that was interfering with on-disk benchmarks.
  • Tracing: Equipping dmq-node with the new tracing system; cardano-tracer library and API ongoing.
  • Leios: Impact analysis and tech design; preparation for simulations hand-off.
  • Hydra: Kick-off for development of system integration level benchmarks.
  • Node diversity: Trace semantics specification ongoing; example test case ready for merging.

Low level overview

Benchmarking

We've performed various maintenance tasks to accomodate our automation to several breaking changes that come with Node 10.6. There are new constraints on Plutus cost models in genesis files which make it more difficult to set up a customized testnet, like our benchmarking cluster. Furthermore, we've been assisting with integrating and debugging the upcoming release's components, such as tx validation in the cardano-api, or the implementation of new trace messages the Node emits.

As the new tracing system will be the default on Node 10.6, we've created a tool which, given a side-by-side listing of metrics names from both systems, will automatically generate an exhaustive migration guide for SPOs. As the metrics names have changed slightly, a migration for existing monitoring setups is required as a one-off effort. The migration guide will be published on the Cardano Developer Portal.

Whilst the legacy tracing system is still fully operational in Node 10.6, the release marks the begin of its deprecation period - giving SPOs sufficient time to adjust their setups.

Development

While cardano-tracer logs forwarded trace messages according to detailed configs, and exposes forwarded metrics, it does not provide any built-in functionality for monitoring and alerts. The experimental RTView component, which still can be built for cardano-tracer and remains fully functional, was an attempt to provide dashboards and alerts out of the box. However, due to its restricted design and low interoperability with existing monitoring standards, it has been discontinued.

Currently, we're taking another stab at this: We're building a prototype that creates timeseries directly from observed metrics, and is able to parse and evaluate PromQL queries referring to them. Based on that prototype, we'll assess resource usage and feasibility of fully building that feature into cardano-tracer. As most monitoring alerts can be (and are) defined as conditions on PromQL query results, and PromQL is an established industry standard, we see a low barrier for adaptation. Furthermore, if built sufficiently modular, it would eliminate the need to operate additional 3rd party services for scraping metrics and monitoring for alert conditions - at least in some usage scenarios.

Infrastructure

With help and support from the Consensus team (Gracias, Javier!), we were able to locate a space leak that affected on-disk benchmarks. While the current on-disk benchmarks are representative, the space leak prevented us from scaling memory limits for the Node process with finer granularity. This will get merged post 10.6 release, and will be of much use when we do comparative benchmarks of the LMDB and the new lsm-trees on-disk backing stores.

Tracing

We've been working on equipping the Network team's new dmq-node service with our new tracing system, trace-dispatcher - still a work in progress. Currently, the dmq-node uses plain contravariant tracing. Having a configurable system, with an API abstraction to define metadata and message serializations as well as process metrics, is a necessary step towards production-ready tracing. The added benefit of using trace-dispatcher is reusability of definitions already implemented in cardano-node, and a uniform way of how cardano-node and dmq-node are configured, as well as how the expose trace evidence and metrics.

The work on cardano-tracer as a library, with principled API and intra-process communications, is ongoing. Implementation is almost complete, and will enter testing phase soon.

Leios

We've contributed to Leios Impact Analysis. The Performance & Tracing section summarizes how implementing, and eventually benchmarking Leios will impact various aspects of our approach. This spans from adding new Leios-specific observables into component code, to deriving submission workloads suitable for Leios, to finding a scaling approach to be able to correlate performance observations to exact changes in Leios protocol parameters.

Additionally, we're currently working on the Leios technical design and implementation plan, which lays out our approach and derives some very high-level steps on how to realize it, based on the impact analysis.

Hydra

We've kicked off a collaboration with the Hydra team. The goal is to build system integration level benchmarks for Hydra, which can target system pressure points, and which are able to scale operations in various directions - much akin to what we're doing for the Cardano node. Eventually, those benchmarks should provide more than just immediate feedback for Hydra engineers; they should be able to yield realistic figures of what to expect from operating one (or more) Hydra heads, and what the resource requirements are to achieve that. Currently, we're familiarizing ourselves with the system and its built-in observability to assess if it meets the requirements for the benchmarks we have in mind.

Node diversity

The work on (multi-)node conformance testing is ongoing. We're in the process of creating a specification document for semantics of existing Node traces. While a few of them might be unique to the Haskell implementation, the majority documents how Cardano itself runs and behaves; those traces can implemented accordingly across diverse Node projects.

Our own demonstration test case for conformance testing is fully implemented and ready to be merged after the Node 10.6 release. It validates a metric exposed by a node wrt. trace evidence in its logs and internal events the metric is based on; see cardano-node PR#6322.

· 3 min read
Marcin Szamotulski
Marcin Wójtowicz

Overview of spint 100 and 101

Cardano Incident

On November 12th, Cardano experienced a fork due to a bug in ledger (de-)serialisation. Some nodes accepted an invalid tx, while others rejected it, leading to a fork of the chain.

The network team worked closely with the other Cadano teams, Intersect, Cardano Foundation, Emurgo, and stake pool operators to monitor and resolve the incident.

The network layer was not affected by the incident, and its resiliency played a role in the recovery of the network. We identified some areas that can further improve the network layer's robustness in such situations, and we'll be working on addressing these issues.

Churn Mitigation

We rolled out a churn mitigation included in the latest cardano-node-10.5 and 10.6 releases. This change ensures the speed of churn of hot peers is at least the speed of churn for established peers, and the speed of churn of the established peers is at least the speed of churn of known peers. This way, we avoid a situation where, over a long period of time, established peers will accumulate already tried hot peers, and cold peers will accumulate already tried established peers. See #5238 for more details.

The issue was identified using CF's cardano-ignite tool, with analysis provided by Karl Knutsson.

DMQ-Node

We are initiating a public repository for dmq-node to host its codebase.

We removed the KES evolution configuration; as a consequence, the genesis file option is no longer needed in the configuration of dmq-node, see #5244.

Ouroboros Leios development

We have contributed improvements to consensus team's Leios demo, which includes minimal prototypes of a few miniprotocols and a simple network emulation layer for exchanging data between a leios server and patched cardano nodes. Our updates addressed unrealistic results obtained stemming from the use of a program toxiproxy for modeling bandwidth and delays. We have also improved our packet capture tooling and gathered some preliminary data for analysis. Based on this early investigation, we have identified a few areas where non-trivial changes may have to be made to the network stack to support the new protocol, as well as new features which could be introduced to deal with greatly expanded traffic requirements, while maintaining the base Praos protocol timeliness guarantees. More details about this were provided at the recent November Leios demo presentation, and further work will continue along these lines.

Peer Selection Improvements

Refined peer selection for local root peers behind firewalls: instead of polling, it now waits for incoming connections and reuses them outbound. See #5241 for details.

· 2 min read
Jean-Philippe Raynaud

High level overview

This week, the Mithril team continued to implement the first phase of decentralizing the configuration parameters. They also completed enhancements to the client library and CLI, providing access to Cardano database incremental snapshots by epoch. Additionally, they kept working on the design of the snarkification of the certificates.

Finally, the team added a section about compatibility with the Cardano node in the GitHub release notes and worked on adapting to the security changes for NPM package publication.

Low level overview

Features

  • Completed the issue Provide Cardano database incremental snapshots needed for Amaru bootstrap #2704
  • Worked on the issue Decentralization of configuration parameters - Phase 1 #2692
  • Worked on the issue Integrate the Haskell DMQ node in the e2e test #2674
  • Worked on the issue Release 2543 distribution #2727
  • Worked on the design of the snarkification of the certificates

Protocol maintenance

  • Completed the issue Cardano node compatibility in GitHub release notes #2743
  • Worked on the issue Implement a common aggregator client - Phase 1 #2640
  • Worked on the issue Enhance protocol security page on website #2703
  • Worked on the issue Support NPM security changes with trusted publisher tokens #2745

· One min read
Noon van der Silk

What did the team achieve?

What's next?

  • Release 1.1.0 including deposit improvements #2180
  • Continue working on partial fanout #1468
  • Merge documentation for committing a script UTxO #1906
  • Finish working on HTLC example in Hydra #2080
  • Continue working on chain following speedups #2206
  • Continue supporting the Glacier Drop

· 2 min read
Jean-Philippe Raynaud

High level overview

This week, the Mithril team successfully integrated the Haskell DMQ node with Mithril nodes in the end-to-end tests. They continued to implement the first phase of decentralizing the configuration parameters. They also worked on some enhancements to the client library and CLI to provide access to Cardano database incremental snapshots by epoch. Additionally, they updated the signer to ensure it has a unique protocol initializer for each epoch.

Finally, the team fixed some bugs in the CI and completed the first refactoring of the STM cryptographic library that enhances its readability.

Low level overview

  • Completed the issue Enhance STM library readability - 1/2 #2568
  • Completed the issue Mithril signer should have a unique protocol initializer for an epoch #2424
  • Completed the issue DMQ integration tests are flaky in CI #2728
  • Completed the issue Backward compatibility with latest release is broken #2730
  • Worked on the issue Decentralization of configuration parameters - Phase 1 #2692
  • Worked on the issue Integrate the Haskell DMQ node in the e2e test #2674
  • Worked on the issue Implement a common aggregator client - Phase 1 #2640
  • Worked on the issue Enhance protocol security page on website #2703
  • Worked on the issue Release 2542 distribution #2727