Skip to main content

Network Team Update

· 2 min read
Marcin Szamotulski

High-level overview of sprint 52

Happy New Year!

In this short sprint we analysed a failure which happened on a new large cluster that's run by IOG. The process exhausted all file handles and was left without any functional connections. The issues apparently is rare, and thus doesn't impose a high risk.

We also continued working on tx-submission: ouroboros-network-3311.

Detailed description

It turned out that the process exhausted the number of file handles leaking multiple /proc/{PID}/stat files open. We suspect that the bug is caused by

  • using lazy IO in iohk-monitoring-framework, and
  • using a recent kernel version

With lazy IO file handles are read as long as the data is required and they are closed only when EOF is reached. We currently suspect that a new linux kernel added something at the end of the /proc/{PID}/stat which is not parsed by iohk-monitoring-framework, so whenever the file is read we leak it (it's never closed) and eventually, there are no file handles to be used by the network layer: the accept loop doesn't return any inbound connection, neither an outbound connection can be created. This issue will be addressed by the profiling team (which owns the logging subsystem).

The fix will be proposed in the future release, in the meantime we suggest to keep observing file handles used by the node.

I would like to thank John Lotoski (IOG), Karl Knutsson (CF), Neil Davies (PNSol) and Michael Karg (IOG) who all contributed to this analysis.

While analysing the log we also found a few smaller issues in the outbound governor which were fixed in [ouroboros-network-#4764].

The IO error indicating exhausting file handles is not currently visible. It is not re-thrown nor logged. This needs to be fixed in a future version. See ouroboros-network-4769.