High-level overview of sprint 52
Happy New Year!
In this short sprint we analysed a failure which happened on a new large
cluster that's run by IOG. The process exhausted all file handles and was left
without any functional connections. The issues apparently is rare, and thus
doesn't impose a high risk.
We also continued working on tx-submission
: ouroboros-network-3311.
Detailed description
It turned out that the process exhausted the number of file handles leaking
multiple /proc/{PID}/stat
files open. We suspect that the bug is caused by
- using lazy IO in
iohk-monitoring-framework
, and - using a recent kernel version
With lazy IO file handles are read as long as the data is required and they are
closed only when EOF
is reached. We currently suspect that a new linux kernel
added something at the end of the /proc/{PID}/stat
which is not parsed by
iohk-monitoring-framework
, so whenever the file is read we leak it (it's
never closed) and eventually, there are no file handles to be used by the
network layer: the accept
loop doesn't return any inbound connection, neither
an outbound connection can be created. This issue will be addressed by the
profiling team (which owns the logging subsystem).
The fix will be proposed in the future release, in the meantime we suggest to
keep observing file handles used by the node.
I would like to thank John Lotoski (IOG), Karl Knutsson (CF), Neil Davies
(PNSol) and Michael Karg (IOG) who all contributed to this analysis.
While analysing the log we also found a few smaller issues in the outbound
governor which were fixed in [ouroboros-network-#4764].
The IO error indicating exhausting file handles is not currently visible. It
is not re-thrown nor logged. This needs to be fixed in a future version. See
ouroboros-network-4769.