Skip to main content

78 posts tagged with "consensus"

View All Tags

· 3 min read
Damian Nadales

High level summary

During the past two weeks, the Consensus team finalized the QSM tests for the backing store and Mempool on the UTxO-HD branch with important discoveries regarding parallel QSM testing. We also worked with the Ledger team to envisage the modifications that are required in Ledger and Consensus to accommodate the changes in the crypto VRF and KES. The db-analyser now supports bechmarking the ledger operations, which will allow us to identify, debug, and profile potential performance problems. We drafted a document that defines how to manage the versions of Consensus-related packages. The top level documentation of ouroboros-network now features a description of the consensus components and provides a hyperlinked map to the modules documentation.

Workstreams

UTxO HD prototype

Whereas we had passing sequential state-machine tests for the mempool, the parallel case proved to be more challenging than we thought. The operation of adding a list of transactions to the mempool is not atomic and, as a result, when adding a list of transactions, transactions from other processes can be added in between. The mempool implementation handles this correctly, however this required us to redesign the parallel model we had to take the lack of atomicity into account.

Backing store property tests

We finished refactoring the backing store property tests. The second review round is ongoing.

LSM tree implementation

We are working on benchmarking (in terms of time and number of IO operations) fetching/looking up data from disk.

Genesis

We worked on the design of a mechanism to prevent a DoS attack on our Genesis design related to rollbacks. This was arguably the biggest outstanding question.

During the discussions around Genesis, we noticed a design boundary that nicely delineates a fundamental component. We almost have a full Haskell prototype of it. It will be very nicely self-contained, perhaps even usable in the ultimate implementation!

New VRF and KES crypto integration

We collaborated with the Ledger team on preparing the ledger state and crypto types to avoid huge allocation on the epoch boundary when changing aspects of the crypto that will only manifest in headers, not in the ledger states.

Technical debt

We merged the pull-request that adds a support to db-analyser for benchmarking ledger operations. This will allow us to identify, debug, and profile potential performance problems. The benchmark focus on the main 5 ledger operations that are involved in chain syncing, block forging, and block validation, namely:

  1. Forecast.
  2. Header tick.
  3. Header application.
  4. Block tick.
  5. Block application.

The following figure shows a plot of the benchmarking results for the first 65 million blocks (approximately) of the Cardano chain. The thin yellow lines under the x-axis show the epoch boundaries, whereas the thick yellow lines correspond to the era transitions.

As we can see in this figure, era and epoch boundaries require more computation time. The ledger team are aware of this problem, and we are working to improve this situation.

Fostering collaboration

We drafted a document motivating and defining how Consensus (and possibly other core teams) will/should manage our package versions. This pull-request garnered many great discussions from our team members and other teams too: Sebastian Nagel, Arnaud Bailly, Michael Peyton-Jones, Ziyang Liu, et al. We want to thank you all for your input, and we found this discussion very enlightening!

We merged the pull request that adds an overview of consensus to the top level documentation of ouroboros-network. This overview describes the consensus components and adds a hyperlinked map to the modules documentation.

· 3 min read
Damian Nadales

High level summary

During the past two weeks, the consensus team merged improvements to the monadic cursor API that was needed to implement LMDB range reads, which is in turn required for the implementation of the UTxO HD feature. We added tables to several tests in for the UTxO HD feature, which increases our confidence in the correctness of the prototype. The mempool property tests are close to being completed. Also, we finished the LSM tree tuning algorithm.

On the Genesis front we started simplifying the BlockFetch logic with CSJ-specific workloads in mind.

We are also documenting the Block Diffusion Pipelining feature, and added a high-level overview of consensus to the top level documentation of ouroboros-network.

Workstreams

UTxO HD prototype

We merged the implementation of a monadic cursor API (#1)) which was needed to solve a bug with LMDB range-reads. After this PR was merged, we focused on bridging the gap between the lmdb-simple interface and consensus by facilitating using lmdb-simple's cursor API without Serialise constraints (#3).

We refactored the backing store property tests to use quickcheck-lockstep (#4081).

We added tables to the mock ledger in the UTxO-HD feature branch (#4184). Every test that used to run with SimpleBlocks now uses tables. This will enable us to exercise the UTxO HD mempool integration by leveraging the existing mempool property-tests. The new state-machine property-tests are still needed for testing the parallel behaviour of the mempool.

Our work on the mempool state-machine tests revealed the need for improvements in the quickcheck-state-machine library. Parallel testing assumed that the state machine did not have access to mutable references. However, the mempool tests require the use of such mutable references for mocking the ledger interface. As a result, our parallel tests were failing with rather obscure messages. @Jasagredo submitted a pull request (#12) that allows for new mutable references to be created at each run of the state machine.

Backing store property tests

LSM tree implementation

We finished the LSM Tree tuning algorithm. We are currently tidying up the code and gathering results (i.e., plots and their interpretation).

CSJ prototype

We started simplifying the BlockFetch logic with CSJ-specific workloads in mind.

New VRF and KES crypto integration

Started working on supporting new version of StandardCrypto which uses compact KES and batched VRF (#4151).

Technical debt

We reviewed the existing state of the Block Diffusion Pipelining document. We are now working on the "Implementation" section (#4020).

Fostering collaboration

We cleared up our understanding of the error dynamics of forecasting (#4146 and #4174).

We submitted a pull request that adds an overview of consensus to the top level documentation of ouroboros-network (#4197). This overview describes the consensus components and adds a hyperlinked map to the modules documentation.

https://github.com/input-output-hk/ouroboros-network/pull/4197

· 6 min read
Damian Nadales

High-level summary

During the past two weeks, the consensus team started documenting the implementation of the UTxO HD feature and continued developing tests for it. As part of our work on UTxO HD, we improved the Haskell support for LMDB. We also spent time working on the LSM tree prototype, and designed a parameter tuning algorithm for it. Regarding our work on Genesis, our investigation of the "plateaus" pointed at the TICKF slowdown on era boundaries as culprit. This led us to developing a caching strategy that will not only remove the aforementioned "plateaus", but can help alleviating the growing block production delay on epoch switch. We also helped reviewing the block forge credential hotswap feature, which is intended for use in the adoption of P2P.

We also worked on paying technical debt and fostering collaboration. In particular, we improved the io-sim framework, which is crucial for testing and simulating Cardano components. We also removed thunks that appeared on era translations, and improved our diffusion pipelining feature. We are working on a presentation for explaining Praos and Genesis.

High-level status report

  • Finish the UTxO HD prototype: in progress.
    • We added documentation for this feature.
    • We developed the second version of the mempool tests.
    • We fixed benchmarks that were inflating the speedup we observed in the anti-diff implementation of sequences of differences. Speedups are now in the range of [3.33, 4.75], which remain significant.
    • We continued improving Haskell LMDB support.
    • We finished implementing a "parameter tuning algorithm" for the LSM tree prototype. This enables us to run experiments to check the correctness of the algorithm.
  • Genesis: in progress.
    • Work investigating the "plateaus" in the ChainSync jumping prototype pointed to the TICKF slowdown on era boundaries as culprit.
  • Tech debt:
    • We improved the capabilities of our io-sim library, which is crucial for testing and simulating Cardano components.
    • We removed thunks from epoch translations in the ledger.
    • We added Linux CI support for lmdb-simple.
    • We got pending diffusion pipelining improvements merged.
  • Fostering collaboration:
    • We are working on a explanation of Praos and Genesis protocols.
  • Support:
    • Investigation of CSJ "plateaus" led us to developing a caching strategy for TICKF that will not only remove these "plateaus", but can help alleviating the growing block production delay on epoch switch.
    • We reviewed the block forge credential hotswapping feature which is intended for use in the adoption of P2P.

Workstreams

Finish the UTxO HD prototype

We merged PR #4060, which adds a report documenting the UTxO HD feature, and puts emphasis in explaining how the mempool works in combination with UTxO HD.

We opened a draft PR with the second iteration of the property tests for the mempool (#4076).

We fixed the Arbitrary instances for keys and values in DiffSeq benchmarks (#4143). The problem was that we were testing with mostly small values, which artificially boosted the performance gains we saw on benhcmarks. Speedups are now in the range of [3.33, 4.75] across the different configurations.

Backing store property tests

We focused on incorporating feedback on the monadic cursor API PR (#1). This required us to make small tweaks to quickcheck-lockstep to test the new API. We also updated the backing store property tests to use the new version of the monadic cursor API.

LSM tree implementation

We worked on the LSM tree prototype. In particular: finished implementing a "parameter tuning algorithm" that adapts the LSM tree design based on factors like:

  • workload
  • machine specs,
  • and characteristics of the data being stored.

We are now running experiments to gather results and cross-reference them with existing experimental results from the LSM tree paper to see if the algorithm is working correctly.

Benchmarking the CSJ prototype

We focused on investigating the "plateaus" in the ChainSync tip, which turned out to be due to the TICKF bug which we previously were only aware of in the context of the long forging times near epoch boundaries. For the most drastic patch by @nfrisby to speed up TICKF, full sync is speeding up by 7%.

The following plot shows that by caching the TICKF the ChainSync tip and the VolatileDB tip progress at the same rate.

The plot below shows the speedup observed by caching the TICKF rule wrt the baseline.

Technical debt

After addressing the PR comments, we merged PR #16, which implements the MonadCatch instance for STM. This extends the capability of our io-sim library, which is crucial for testing and simulating Cardano components PR #16 closed #1461. This new feature was published as version 0.4.0.0 of io-sim.

We continued with our work fixing the NoThunk errors required for enabling nightly tests, with the help of TVarInvariant checks in strict-stm and nothunks libraries. We proposed fixes in cardano-ledger that took care of thunks that appeared in era translations (#3143). The fixes will be integrated back into consensus when cardano-ledger approves and publish the changes introduced in #3143.

We added CI support for lmdb-simple (#2). We currently test the build on a Linux environment only.

We got pending diffusion pipelining PRs (#3857, #3860, #3856) merged, after rebasing and addressing feedback.

Fostering collaboration

@nfrisby finished a visualisation tool and outlined scripts for the Praos and Genesis explanation presentations. The idea is to produce a video that gives an overview of these protocols.

Support

We started working on caching the computation of the TICKF rule (#4054), since this was blocking our benchmarking work for Genesis. In addition, this issue has the Cardano community quite concerned, so we are hoping the work done in caching the computation of the TICKF rule can help alleviating the growing block production delay on epoch switch.

We reviewed the block forge credential hotswapping PR #3800 from the networking team, which is intended for use in the adoption of P2P.

· 4 min read
Damian Nadales

High-level summary

During the past two weeks, the consensus team continued its work on testing the UTxO HD prototype. We completed the era-transition and backing store tests, and the mempool tests are advancing at a steady pace. Regarding our work in the Genesis design, we continued our collaboration with the research and networking teams, and we continue investigating strategies for making the chain-sync jumping prototype faster.

High-level status report

  • Finish the UTxO HD prototype: on track.
    • We worked on state-machine tests for the mempool, and spotted potential bugs in the implementation. Investigation is ongoing.
    • We have a set of property tests for the backing store. We still need to incorporate the improvements to the LMDB cursor API that these tests made possible.
    • We merged the era-transition tests PR.
  • Genesis: on track.
    • Design work around Genesis continues in collaboration with researchers and the networking team.
    • We continued trying to improve the performance of the chain-sync jumping prototype. We gained additional insight on which parameters to tweak next. In spite of the baseline still being faster, the current prototype already achieves a significant speedup when compared to the naive approach of simply running full chain-sync with all peers.
  • Tech debt: on track.
    • We clarified a common source of confusion around VRF tie-breaking and cross-era chain selection.

Workstreams

Finish the UTxO HD prototype

We continued working on property-tests for the UTxO HD prototype. In particular we merged the era-transition tests PR.

Backing store property tests

The backing store property tests PR has been reviewed. The next steps are:

  • Improve error handling and command generation.
  • Add coverage testing to check that we are not failing to cover interesting test cases.

The monadic cursor API went through its first review round. The API is in a relatively stable state. This PR also unifies the cborg and serialise-based interfaces to LMDB operations. The next steps are:

  • Write quickcheck-dynamic state-machine tests for this API.
  • Adapt the changes in the serialisation interface in the backing store property tests. This will involve adding boilerplate code in consensus to make up for the removal of the cborg-based interface.

LSM tree implementation

We worked on the LSM tree prototype. In particular, we focused on tuning the LSM tree design to the different workloads that consensus has (eg syncing, normal node operation, etc).

Benchmarking the CSJ prototype

Work on improving the chain-sync jumping performance is ongoing. In particular we compared the performance of different jump intervals, which, somewhat surprisingly, do not make a significant difference. In particular, we are seeing periodic "plateaus" where the chain-sync tip does not progress, but they are much longer for the prototype. Our hypothesis is that this seem to be due to a combination of the garbage collector (GC) pauses, and the actual time it takes the non-dynamo chain-sync peers to jump to the tip of the slot of the dynamo fragment.

In the coming weeks we will try to shorten these plateaus via a combination of tweaking GC options and less synchronisation in the CSJ governor.

The following plot shows the performance of the chain-sync jumping prototype using different jumping intervals. It compares the syncing progress by plotting the slots of adopted blocks against time. The baseline is still faster, however it is worth noting that the current prototype already achieves a significant speedup when compared to the naive approach of simply running full chain-sync with all peers.

The second plot shows the syncing progress sliced to a chosen ~5min interval, and includes, in addition to the slots of adopted blocks, the slots of the tip of the ChainSync fragment. This allows us to see how far ahead of the selected tip the CS dynamo is, i.e. how much room we have for BlockFetch not to get stalled. It shows periodic behaviour (due to the forecasting limit), and shows that the CS fragment tip is not progressing for significant periods ("plateaus").

Technical debt

We clarified a common source of confusion around VRF tie-breaking and cross-era chain selection. This PR involved correcting potentially misleading names of VRF-related functions, and providing context for a particular VRF value is used for tie-breaking.

· 4 min read
Damian Nadales

High-level summary

During the past two weeks, the consensus team worked on adding property test for different aspects of the UTxO HD prototype: era transitions, mempool, and backing store. Thanks to these tests we were able to uncover a bug in the prototype. On the Genesis front, we benchmarked a different version of the ChainSync jumping prototype to try to improve its performance, but this did not result in any noticeable speedup.

High-level status report

  • Finish the UTxO HD prototype: on track.
    • We focused on increasing test coverage for the UTxO-HD prototype:
      • We started implementing Cadano-eras transition property-tests.
      • We started implementing state-machine property-tests for the mempool.
      • We merged the mempool rewrite.
      • We started working on state-machine tests for the backing store. This uncovered a bug in the range-read implementation of the LMDB backing store.
  • Genesis: on track.
    • We benchmarked a version of the Genesis ChainSync Jumping prototype that spreads out the ChainSync updates over a longer period of time. This did not result in any noticeable speedup.
    • We investigated the overhead introduced by non-ChainSync components, but no conclusions could be drawn from the benchmarks we ran.

Workstreams

Finish the UTxO HD prototype

We focused on increasing test coverage for the UTxO HD prototype. We also merged the mempool rewrite.

Era transition property tests

We started implementing Cardano era transition property tests, which are needed for making sure that the ledger tables get updated in the right way when we move from one era to the next. There are at the moment two important transitions.

  • Byron to Shelley: where all the UTxO is transferred from in-memory Byron state (which has no tables) to the ledger tables of the Shelley state.
  • Shelley to Allegra: where the AVVM addresses must be deleted.

We have tests for the Byron to Shelley transitions. We are working on adding the remaining ones.

Mempool state-machine tests

We started implementing state-machine property tests for the mempool. The mempool is currently tested via pure property tests, and use a ledger state without tables. With the introduction of UTxO HD, testing the concurrent behavior of the mempool became of crucial importance (eg now we have to acquire locks to flush the backing store). In addition, we need to test a ledger state with tables. These needs led to the creation of a new set of property tests. In particular we aim to run parallel state-machine tests that exercise the mempool in a way similar to how the node would make use of it.

Backing store property tests

We started working on state-machine tests for the backing store that UTxO HD uses. The property tests uncovered errors in the range-reads implementation of the LMDB backing store. To facilitate fixing this bug, we made changes to the Haskell LMDB bindings.

Benchmarking the CSJ prototype

Prompted by previous benchmarks showing significant improvements in sync time by using more capabilities, we implemented a way to spread out the ChainSync updates over a larger period instead of firing them all at the same time. This didn't result in a noticeable speedup.

We also benchmarked the prototype with CSJ disabled (such that just the dynamo peer is running ChainSync, but e.g. BlockFetch still sees all peers) to rule out/confirm overhead by non-ChainSync (mainly BlockFetch) related components. This results in era-specific behavior (speed is like the prototype in Byron, but like the baseline in Shelley). This deserves a closer look in the future.

This diagram shows the respective syncing progress, starting at Genesis and continuing a good part into Shelley (with the dashed line indicating the Byron-to-Shelley transition).

  • Red: baseline
  • Green: CSJ prototype, 10 peers, jumps every 3000/f slots, jumps in clumps.
  • Blue: like Green, jumps are spread out.
  • Orange: variant with no jumping, to measure unrelated overhead.