Skip to main content

46 posts tagged with "network"

View All Tags

· 2 min read
Marcin Szamotulski

High-level overview of sprint 51

Outbound Governor Bug in cardano-node-8.7.2

In the current sprint, we received a bunch of reports from SPOs about nodes not maintaining some connection when using cardano-node-8.7.2 (running in P2P mode). Such regressions are very important to us since they can lead to lost blocks. We were able to reproduce this issue. Every time there's a longer pause of block production (due to the statistical nature of Ouroboros), there is a chance that the bug will be armed. For this reason cardano-node-8.7.2 needs to be closely monitored.

We found the bug and developed a fix, ref. Karl Kntusson (CF) wasn't able to reproduce the bug with the patched version of the node for long enough (almost two weeks now) for us to belive that the fix is correct.

Advise for SPOs

We created a release branch for 8.7.3. The advice from the network team is to either downgrade to the previous release, e.g. 8.1.2 or use the above release branch (note that there were no benchmarks made or Q&A tests yet).

Testing plans

We were also able to reproduce the bug using IOSim, ouroboros-network#4757. However, the bug relies on a particular schedule of two threads which are involved and we needed to artificailly modify IOSim schedule in production code - something that we don't want to commit to the master branch. We also experimented with a randomised scheduler for IOSim, but that did not lead to finding the schedule which arms the bug: the search space grows exponentially with the number of steps in the threads, partial order reduction techniques implemented in IOSimPOR are more appropriate - unfortunatelly the simulation test is too large to be executed in IOSimPOR even with large amounts of RAM. To use IOSimPOR we need to implement a test which includes the two interacting components:

  • connection-manager
  • outbound-governor (where the bug was located)

which communicate through PeerStateActions, without including all the diffusion components as we do in our sim-net tests. More in style of outbound-governor tests where there is just a single outbound-governor, unlike in the sim-net which runs multiple communicating diffusions.

Bootstrap peers

We continued working on bootstrap peers, ouroboros-network#4555

TxSubmission Decision Logic

We continued working on tx-submission decision logic, ouroboros-network#3311

· 3 min read
Marcin Szamotulski

High-level overview of sprint 49 & sprint 50

Fixed PeerSelection bug

Karl Knutsson (Cardano Foundation (CF)) found a bug in the cardano-node-8.7.0 version used on the Sancho Net which was fixed in 8.7.1. It resulted in a node not being able to reconnect to an upstream peer once it was demoted by an asynchronous exception. This bug would be caught by Q&A in a mainet release, but for testnet releases, Q&A test suite is not used. We also developed a test which covers the bug in the ouroboros-network, we also identified a missing PeerSelection test which we need to port to our simulation network. See ouroboros-network#4734, ouroboros-network#4665.

Bootstrap Peers

Still under review, ouroboros-network#4555. The consensus team is now implementing the API we need for bootstrap peers. Once consensus API is implemented we will integrate changes in an experimental branch of cardano-node.

Tx-Submission

We started working on a new implementation of the tx-submission application. No tx-submission protocol changes are foreseen, but we want to be able to download each tx from just one upstream peer and share the results between different connections. We want to distribute the bandwidth between multiple clients. We also think that this work will prepare us for the future Ouroboros-Leios changes, which will contain various versions of tx-submission like mini-protocols. See ouroboros-network#4701.

Peer Sharing

Various fixes and improvements were implemented:

  • ouroboros-network#4725

    • disabled peer sharing with initiator-only nodes: currently it's not possible to get peers from initiator-only nodes (edge nodes, e.g. wallets). In the future, we might change this, which will require running a server-side of the peer-sharing protocol by such nodes. See ouroboros-network#4726.
    • fixed peer-sharing codec
    • fixed a handshake bug which returned a wrong peer-sharing option
  • ouroboros-network#4728

    • disabled peer-sharing for NodeToNodeV_11 and NodeToNodeV_12
  • Karl Knutsson (CF) has been working on additional improvements, e.g. ouroboros-network#4735

With these fixes, Karl Knutsson (CF) was able to see that two peers on the mainnet can discover themselves through peer-sharing and keep being mutually useful and thus the connection surviving outbound-governor churn events.

IOSim

We improved the memory footprint of IOSim in io-sim#126, see ouroboros-network#4721 for heap profile improvements on large test cases.

We are working on optimising the memory footprint of IOSimPOR. We are reimplementing VectorClocks using a trie, instead of a map which leads to significant improvements.

Cardano-Ping

cardano-node-0.2.0.10 was released to CHaP, ouroboros-network#4746. This version exports more APIs which turned out to be useful in cardano-node test suite, see cardano-node#5536.

Technical Debt

We addressed some small tech-debt issues in ouroboros-network#4722:

  • fixed some typos
  • using bracket instead of onException in withSnocket
  • improved haddocks
  • organised TracePeerSelection constructors

We improved the memory footprint of some of our tests in ouroboros-network#4721.

· One min read
Marcin Szamotulski

High-level overview of sprint 48

Bootstrap Peers

We continued reviewing bootstrap peers, ouroboros-network#4555.

IOClasses / IOSim

We prepared slides for a Haskell meetup were we presented a talk on IOSimPOR. The recording will be availble on YouTube.

We also used the opportunity to do some refactoring of the IOSim code base: io-sim#117. We released io-sim-1.3.0.0 on Hackage: io-sim#119.

We also added forkFinally to MonadFork (not included in 1.3.0.0 release): io-sim#123.

Tech debt

We refactored Resource used by the DNS subsystem: ouroboros-network#4707. We continued reviewing the ouroboros-network#4625 PR, which refactors RootPeersDNS module.

· 2 min read
Marcin Szamotulski

High-level overview of sprint 47

Bootstrap Peers

We continued to review the process of bootstrap peers, see ouroboros-network#4555

CI / Tests

We investigated our CI issues. We found a memory leak in typed-protocols function used for testing codecs which triggered out of memory manager (OOM) on some platforms (typed-protocols#43); we also found a bug in the connection manager which resulted in CI timeouts (see connection-manager-fix).

KeepAlive client

We found two small issues with the keep-alive client, which were addressed by Karl Knutsson (Cardano Foundation), ouroboros-network#4689.

Galois

We merged two large PRs prepared by Galois:

Cardano Network Service Assurance (CNSA)

Galois made the following progress on CNSA:

  • a simple [InfuxDB] database backend has been added;
  • the documentation has been updated;
  • internal improvements to the code;
  • progress on a new "CSNA analysis" that provides, for each sampler node, the block download throughput in bytes over time.

New CHaP Release

We cut a new release of ouroboros-netowrk packages to CHaP: chap#547

More details

CI / Tests

We improved the memory footprint of some of our tests by analysing a stream of IOSim traces without retaining them, see ouroboros-network#4696

As a safety measure, we introduced an upper bound for heap memory used by test artefacts in our nix tests. We use 200MB limit for all tests except for network-mux tests which use 350MB limit, see ouroboros-network#4702.

We refactored one of our tests to use ephemeral ports thus allowing it to run concurrently, see ouroboros-network#4702.

We merged ouroboros-network#4623 which fixes a bunch of test failures.

All of them were due to a bug in test logic rather than a bug in production code.

Release Process

We updated our release process & associated scripts, see ouroboros-network#4705.

· One min read
Marcin Szamotulski

High-level overview of sprint 46

Bootstrap Peers

We continued reviewing of bootstrap peers, see ouroboros-network#4555.

Towards Typed Protocols 0.2.0.0

We diagnosed the performance regression of the new design. The work on typed-protocols will be postponed. For more details see the typed-protocols#3. As an outcome of the performance debugging we prepared PR which updates the demo-ping-pong and demo-chain-sync applications.

Peer Sharing

We made progress in review of ouroboros-network#4644, which simplifies the peer sharing and fixes the ouroboros-network#4642 issue.

Tech Debt

We reviewed the ouroboros-network#3836 PR which inspects all the uses of error in ouroboros-network. The PR was prepared by Galois.