readme.md at main · microcosm.blue/star

microcosm.blue / star
fork atom
Streaming Tree ARchive format
fork atom
star / readme.md
at main 182 lines 11 kB view raw view rendered
wrap content
bad-example.com this probably can work 7w ago
2f22ac34
  1# STAR: Streaming Tree ARchive format
  2
  3_status: just thinking about it_
  4
  5STAR is an archival format for Merkle Search Trees (MSTs) as implemented in atproto. It offers efficient key-ordered streaming of repository contents and reduced archive size compared to CARs.
  6
  7- convertible to/from CAR (lossless except any out-of-tree blocks from a CAR)
  8- extra garbage strictly not allowed (unlike CAR)
  9- canonical (unlike CAR)
 10- strict depth-first (MST key-ordered) node and record ordering -> efficient reading (unlike CAR)
 11- the header simply is the commit, followed by the serialized tree.
 12- layer 0 MST nodes leave record links implicit, and content blocks in the STAR are not prefixed by their content hash: linked blocks follow in a deterministic order and recompute CID from on their contents.
 13- small spec cleanups, eg: there are no exceptions to the no-empty-MST-nodes rule: an entirely-empty MST is serialized as `data: null` in the commit header.
 14
 15the two primary motivations are
 16
 171. bounded-resource streaming readers.
 18
 19   atproto MSTs in CARs *have* to buffer and retain all record blocks, and typically buffer most MST node blocks, just to traverse the tree. even if a CAR appears to be in stream-friendly block ordering, you can only safely discard record blocks if you *know for sure* it's actually stream-friendly.
 20
 21   you also cannot reliably identify MST node blocks and record blocks in an atproto CAR without walkign the tree, so you cannot discard *any* potentially garbage blocks from the buffered data before walking. A malicious PDS can serve a cheap-to-generate endless CAR stream of garbage blocks, and you just have to keep buffering them.
 22
 23   since STAR is strictly stream-ordered, there is no node/block ambiguity, and extra garbage is not allowed. CIDs commit the contents of subtrees and records, and since reading is the same as walking the tree, it *might* be possible to reject some kinds of malicious block-generation attacks early. (haven't thought this through)
 24
 252. reduced archive size.
 26
 27   CIDs are large, compression-unfriendly, and redundant if you have access to the CID's actual content.
 28
 29   for example, my atproto repo is around 5.0MB and contains 14,673 blocks with a CID prefix plus 14,675 CID links in its MST. Each CID is 32 bytes, so `(14,673 + 14,675) * 32 = 0.9MB` just for the CIDS, almost 20%.
 30
 31   from a few more samples of various sizes from real atproto repos:
 32
 33   ```
 34     CIDs     CAR   potential savings
 35   0.53KB / 3.4KB = 16%
 36   23.2KB / 279KB = 8%
 37    0.9MB / 5.0MB = 18%
 38   25.9MB / 128MB = 20%
 39   94.8MB / 449MB = 21%
 40   ```
 41
 42   These calculations don't include the 4-bytes-per-CID prefix size, since that overhead will already typically be eliminated by compression.
 43
 44   STARs don't include content hashes before content blocks, reducing the number of CIDs immediately by half. They omit record link CIDs from layer-0 MST nodes as well, for an **overall reduction of CIDs by 80%**.
 45
 46   Note that all repository content in a STAR is still cryptographically bound to the signed root commit's CID: it's just a little more work to prove it.
 47
 48
 49### scope
 50
 51STAR is specialized for atproto MST storage, and best-suited for serializing complete trees.
 52
 53- It should work for "CAR slices" -- contiguous narrowed partial trees that may omit content before and/or after a specific key. (CIDS referencing missing nodes at the boundaries cannot be eliminated)
 54
 55  - It's desireable to be able to archive complete *subtrees*, so enforcing a well-formed atproto commit as the header might not be sufficient on its own. (subtrees could be stored as CAR slices so this may be unnecessary)
 56
 57- It *might* be interesting to allow arbitrary sparse trees. Not sure yet.
 58
 59- It's not suitable for firehose commit CARs, which need to include blocks that aren't in a strict single MST.
 60
 61STAR format does not aim to provide efficient access to random nodes or through other tree iteration patterns. Almost any kind of inspection requires a linear scan through the archive (especially if global key compression happens).
 62
 63
 64## format
 65
 66```
 67|--------- header ---------| |---------------- optional tree ----------------|
 68[ '*' | ver | len | commit ] [ node ] [ node OR record ] [ node OR record ] …
 69```
 70
 71### header
 72
 73- `*` (fixed u8): The first byte of the header is always `*` (hex `0x2A`).
 74
 75- `ver` (varint): Next is an [`LEB128` `varint`](https://en.wikipedia.org/wiki/LEB128) specifying the `version`, with a fixed value of `1` for the current format.
 76
 77- `len` (varint): The length of the proceeding atproto commit object in bytes.
 78
 79- `commit` (DAG-CBOR): An atproto commit object in `DAG-CBOR` derived from the [repo spec](https://www.ietf.org/archive/id/draft-holmgren-at-repository-00.html#name-commit-objects):
 80
 81    - `did` (string, required): same as repo spec. may become optional for subtree archives, but it's nice to be able to inspect, for now.
 82    - `version` (integer, required): corresponding CAR repo format version, currently fixed value of `3`
 83    - `data` (hash link, **optional**): CID of the first (root) node in the MST. an empty tree is represented by the absence of this key.
 84    - `rev` (string, required): same as repo spec (may become optional)
 85    - `prev` (hash link, **optional**): same as repo spec, but optional instead of nullable. only included for lossless CAR round-tripping.
 86    - `sig` (byte array, **optional**): to enable archiving stable sub-trees which might be later stitched into full signed MSTs, the `sig` property is allowed to be omitted.
 87
 88#### verifying a commit
 89
 90The `commit` object can be converted to a repo-spec compliant commit:
 91
 92 - if `data` is absent, replace it with the CID of an empty repo-spec style MST (`bafyreihmh6lpqcmyus4kt4rsypvxgvnvzkmj4aqczyewol5rsf7pdzzta4`)
 93 - follow steps from the repo spec to resolve the identity and verify the signature.
 94
 95When `sig` is absent (typically for archived MST sub-trees), this STAR _cannot_ be converted to a repo-spec compliant CAR. however, if it's a subtree, it can be stitched back into a full MST that can be converted back to a compliant CAR, provided the complimentary sparse STAR containing it.
 96
 97
 98### optional tree
 99
100There are two kinds of blocks in the tree: `node` blocks and `record` blocks. If the tree is present, the first block must always be a `node` (the MST root node), and blocks follow in depth-first tree traversal order. Blocks may arbitrarily be omitted (for STAR slices, sparse trees, and `key -> CID`-only archives). `node`s have a flag for each link to indicate its upcoming presence (or absence) in the archive.
101
102
103#### `node`
104
105```
106|----- node -----|
107[ len | mst node ]
108```
109
110- `len` (varint): the length of the proceeding CBOR block, in bytes.
111
112- `mst node` (DAG-CBOR): object with the following schema
113  - `l` (hash link, optional): reference to a subtree at a lower depth containing only keys to the left of this node. if absent, there is no left subtree.
114  - `L` (bool, optional): "archived": if `true`, the subtree is contained in this archive. must not be present when `l` is not present.
115  - `e` (array, required): ordered array of entry objects with length of at least one, each containing:
116    - `p` (integer, required): number of bytes shared with the previous entry (TODO key compression actually)
117    - `k` (byte string, required): key suffix remaining
118    - `v` (hash link, optional): reference to the record data for this key.
119      - for MST nodes at depth=0: 
120        - `v` must be omitted when the record is included in the archive
121        - `v` mut not be omitted if the record is not included
122      - for MST nodes at depth>0:
123        - `v` is required (`V` signifies if it's in the archive)
124    - `V` (bool, optional): "archived": if `true`, the record is contained in this archive. must not be present when `v` is not present.
125    - `t` (hash link, optional): link to a subtree that sorts to the right of this entry's key and to the left of the next entry's key. if absent, there is no subtree.
126    - `T` (bool, optional): "archived": if `true`, the subtree is contained in this archive. must not be present when `t` is not present.
127
128for now see the atproto repo spec for key compression (`p` and `k`)
129
130#### `record`
131
132An atproto record. Its CID can be computed over its bytes of its `block` (see below).
133
134  ```
135  |--- record --|
136  [ len | block ]
137  ```
138
139  - `len` (varint): the length of the proceeding binary record block in bytes.
140
141  - `block` (bytes): the raw bytes of the (DAG-CBOR) record.
142
143
144### verifying the whole tree
145
146Each MST node, including the root, must be verified to match its expected CID. To compute the CID of MST nodes:
147
148For nodes at depth>0 (all child CID links are included): Convert the MST node into repo-spec format, compute its CID as the sha256 hash of its DAG-CBOR serialization.
149
150For nodes at depth=0 (record CID links excluded unless omitted from archive): read all included linked records from the node into a buffer and compute their CIDs (they will immediately follow in the STAR since a depth=0 node cannot have any other children). With the record CIDs available, the MST node can be converted into repo-spec format, and its CID calculated as with depth>0 nodes.
151
152To compute the CID of included records:
153
154The required bytes for the CID calculation are the exact included record bytes (sha256 over them).
155
156
157## open questions
158
159### how far to go with implicit CIDs?
160
161there is a trade-off between going fully implicit on CIDs (possible as long as the content is present to compute the CID) vs fully explicit in MST node links (CIDS still omitted before the content blocks themselves for 50% fewer vs CAR).
162
163the problem with omitting *all* possible CIDS is that you then cannot verify the root node until you finish walking the entire MST. a consumer might have already written data somewhere only to find out they need to undo it all!
164
165as a compromise we're only omitting CIDs from MST nodes for **layer 0 record links**:
166
167- 75% reduction in record CIDs written to the STAR; 60% overall CID reduction including subtree links
168- MST nodes contain four records on average; a verifying streamer can buffer this small number of records before emitting, so it never omits content that later is found to be unverifiable.
169- no special casing required for MST subtree links, their CIDs are always included
170
171this is probably the right balance: considering the 50% initial reduction compared CARs by dropping the hash prefix in front of blocks, the all-in CID reduction is **80%**.
172
173but we could take it one step higher: have layer1 nodes do implicit CIDs for subtrees and records:
174
175- 89% overall reduction in CIDs in the STAR
176- a verifying streamer needs to buffer 16 records on average. an attacker gets a 5x space amplification benefit if trying to generate extra-wide bufferable bottom-level tree nodes.
177- layer-dependent special-casing required for CID links as well as record links, across two MST layers
178
179with the 50% initial reduction, this would be **95% total CID reduction**
180
181CIDs make up around 20% of uncompressed CAR file sizes. the first approach gets that down to 4%; second 1%. however, CIDs are uncompressible, so it's probably worth measuring the real effect of both approaches on large repos post-compression before completely committing one way or another.
182