High-level overview of sprint 51
Outbound Governor Bug in cardano-node-8.7.2
In the current sprint, we received a bunch of reports from SPOs about nodes not
maintaining some connection when using cardano-node-8.7.2
(running in P2P
mode). Such regressions are very important to us since they can lead to lost
blocks. We were able to reproduce this issue. Every time there's a longer
pause of block production (due to the statistical nature of Ouroboros), there
is a chance that the bug will be armed. For this reason cardano-node-8.7.2
needs to be closely monitored.
We found the bug and developed a fix, ref. Karl
Kntusson (CF) wasn't able to reproduce the bug with the patched version of
the node for long enough (almost two weeks now) for us to belive that the fix
is correct.
Advise for SPOs
We created a release branch for 8.7.3
. The advice from
the network team is to either downgrade to the previous release, e.g. 8.1.2
or use the above release branch (note that there were no benchmarks made or Q&A
tests yet).
Testing plans
We were also able to reproduce the bug using IOSim
, ouroboros-network#4757.
However, the bug relies on a particular schedule of two threads which are
involved and we needed to artificailly modify IOSim
schedule in production
code - something that we don't want to commit to the master
branch. We also
experimented with a randomised scheduler for IOSim
, but that did not lead to
finding the schedule which arms the bug: the search space grows exponentially
with the number of steps in the threads, partial order reduction techniques
implemented in IOSimPOR
are more appropriate - unfortunatelly the simulation
test is too large to be executed in IOSimPOR
even with large amounts of
RAM
. To use IOSimPOR
we need to implement a test which includes the two
interacting components:
connection-manager
outbound-governor
(where the bug was located)
which communicate through PeerStateActions
, without including all the
diffusion components as we do in our sim-net tests. More in style of
outbound-governor
tests where there is just a single outbound-governor
,
unlike in the sim-net which runs multiple communicating diffusions.
Bootstrap peers
We continued working on bootstrap peers, ouroboros-network#4555
TxSubmission Decision Logic
We continued working on tx-submission decision logic, ouroboros-network#3311