commits
It turns out that resetting the zstd encoder takes a significant amount of time (about as much as compressing the average compressible KV value). When using the nicer EncodeAll function, this reset must happen synchronously. By not using this helper function and switching to a pool of encoders, the resets can happen asynchronously without blocking the writes.
Most importantly IMO, it has a pool of sha256 summers (it's a mystery how this wasn't there from the start, considering how much a merkle tree relies on hashing...)
From what I've observed so far most of the queries we make end up not using the "fast nodes," (because we mainly read "committed" data) meaning they mainly exist to take up space and to slow down snapshot import greatly. This might be reverted if further evidence ends up proving this was a bad idea.
- Force recreate filter after snapshot application, to avoid inconsistent state as cometbft transitions to block sync / consensus
- Don't save 0xffffffffffffffff bloom filter seq when there are no operations
Not sure what I was thinking when I moved the writes to a goroutine. I suppose I assumed the pipe would be read at roughly the same rate it would be written, which is obviously not true.
- Delete files as chunks are applied, behind cometbft's back
- Use unsafe reflection to "fix" cometbft bug where temp_dir setting was ignored
Ideally both of these would result in PRs to cometbft (one for a feature to let the ABCI application specify snapshots to delete, another to fix the temp_dir bug) but ain't nobody got time for that
- Will keep polling for, and importing, new operations now after reaching "the end"
- Tentative websocket client support (not tested, and doesn't actually reduce usage of HTTP /export when new operations are detected)
- The way we're forcing CometBFT to produce a new block is a major hack but it works. This hack will be redundant if we later decide to let create_empty_blocks be true.
- Fix major data corruption by removing Next skip from iterator adapter (a hack that was added for leveldb)
- Serialize and restore DID bloom filter
- Move stuff around, clean up test helpers
And all it took was disabling prefetching on the iterator (the bloom filter then applied an additional 2x speedup)
The index sadly can't be rebuilt from just the tree nodes, mainly because of the validFrom-validTo logic. If we ever confirm that historical queries (at a specific tree height) are not worth supporting, we can revisit this.
leveldb causes too much write amplification on SSDs to the point of being ridiculous, especially when paired with iavl's merkle tree updates.
badger uses _much_ more memory but if we separate the operation values (not the tree inner nodes) into its value log, it seems to write significantly fewer GBs to the disk and compactions are either less frequent or much less noticeable (i.e. they don't stall the normal operation of the application as much).
(writing to the iavl tree still causes many more GBs to be written to disk than the data that's actually being inserted into the tree, but such appears to be the cost of using merkle trees)
- Badger turned out to stall too much on compactions and ended up seemingly having a disk space disadvantage over goleveldb, even when both were set to snappy compression (which goleveldb uses by default)
- Keeping the DID->operations pointers in the IAVL tree led to writes of unordered keys, which caused constant rebalancing and increased-rehashing within the merkleized tree, leading to writes becoming impossibly slow as the tree grew and ruining the performance of the underlying KV store, no matter its implementation, as more reads and writes were needed to reach a certain outcome (While everything the IAVL tree was doing was still certainly O(log n), you definitely don't want to hit the worst cases all the time)
- Begin cleaning up TreeVersions/Adapters/Heights mess through the introduction of unified transaction interfaces that manage the commits/rollbacks within tree and the index KV store in tandem
This should allow us to once again keep the entire tree history (the diffs between tree versions should be quite small, especially now that almost no keys are ever rewritten)
- Avoid fetching next seq from tree on every op store
- Misc cleanups
- Eager async fetch of entries to import
- Drop old tree history (improves situation but it's not a permanent solution, more investigation required) with async iavl tree pruning
- Skip executing import on proposal preparation
- Effectively enforce timeouts through proper context propagation and checking
- Iterate mutable tree with correct iterator when obtaining latest sequence value
- Avoid repeated lock acquisition when importing multiple operations
It turns out that resetting the zstd encoder takes a significant amount of time (about as much as compressing the average compressible KV value). When using the nicer EncodeAll function, this reset must happen synchronously. By not using this helper function and switching to a pool of encoders, the resets can happen asynchronously without blocking the writes.
From what I've observed so far most of the queries we make end up not using the "fast nodes," (because we mainly read "committed" data) meaning they mainly exist to take up space and to slow down snapshot import greatly. This might be reverted if further evidence ends up proving this was a bad idea.
- Delete files as chunks are applied, behind cometbft's back
- Use unsafe reflection to "fix" cometbft bug where temp_dir setting was ignored
Ideally both of these would result in PRs to cometbft (one for a feature to let the ABCI application specify snapshots to delete, another to fix the temp_dir bug) but ain't nobody got time for that
- Will keep polling for, and importing, new operations now after reaching "the end"
- Tentative websocket client support (not tested, and doesn't actually reduce usage of HTTP /export when new operations are detected)
- The way we're forcing CometBFT to produce a new block is a major hack but it works. This hack will be redundant if we later decide to let create_empty_blocks be true.
leveldb causes too much write amplification on SSDs to the point of being ridiculous, especially when paired with iavl's merkle tree updates.
badger uses _much_ more memory but if we separate the operation values (not the tree inner nodes) into its value log, it seems to write significantly fewer GBs to the disk and compactions are either less frequent or much less noticeable (i.e. they don't stall the normal operation of the application as much).
(writing to the iavl tree still causes many more GBs to be written to disk than the data that's actually being inserted into the tree, but such appears to be the cost of using merkle trees)
- Badger turned out to stall too much on compactions and ended up seemingly having a disk space disadvantage over goleveldb, even when both were set to snappy compression (which goleveldb uses by default)
- Keeping the DID->operations pointers in the IAVL tree led to writes of unordered keys, which caused constant rebalancing and increased-rehashing within the merkleized tree, leading to writes becoming impossibly slow as the tree grew and ruining the performance of the underlying KV store, no matter its implementation, as more reads and writes were needed to reach a certain outcome (While everything the IAVL tree was doing was still certainly O(log n), you definitely don't want to hit the worst cases all the time)
- Begin cleaning up TreeVersions/Adapters/Heights mess through the introduction of unified transaction interfaces that manage the commits/rollbacks within tree and the index KV store in tandem
This should allow us to once again keep the entire tree history (the diffs between tree versions should be quite small, especially now that almost no keys are ever rewritten)
- Eager async fetch of entries to import
- Drop old tree history (improves situation but it's not a permanent solution, more investigation required) with async iavl tree pruning
- Skip executing import on proposal preparation
- Effectively enforce timeouts through proper context propagation and checking
- Iterate mutable tree with correct iterator when obtaining latest sequence value
- Avoid repeated lock acquisition when importing multiple operations