STAR: Streaming Tree ARchive format#

status: just thinking about it

STAR is an archival format for Merkle Search Trees (MSTs) as implemented in atproto. It offers efficient key-ordered streaming of repository contents and reduced archive size compared to CARs.

convertible to/from CAR (lossless except any out-of-tree blocks from a CAR)
extra garbage strictly not allowed (unlike CAR)
canonical (unlike CAR)
strict depth-first (MST key-ordered) node and record ordering -> efficient reading (unlike CAR)
the header simply is the commit, followed by the serialized tree.
CID link are implicit for blocks included in the STAR: linked blocks follow in a deterministic order and recompute CID from on their contents.
fewer edge cases: empty MST nodes strictly disallowed: commit omits data for an empty tree

the two primary motivations are

bounded-resource streaming readers.

atproto MSTs in CARs have to buffer and retain all record blocks, and typically buffer most MST node blocks, just to traverse the tree. even if a CAR appears to be in stream-friendly block ordering, you can only safely discard record blocks if you know for sure it's actually stream-friendly.

you also cannot reliably identify MST node blocks and record blocks in an atproto CAR without walkign the tree, so you cannot discard any potentially garbage blocks from the buffered data before walking. A malicious PDS can serve a cheap-to-generate endless CAR stream of garbage blocks, and you just have to keep buffering them.

since STAR is strictly stream-ordered, there is no node/block ambiguity, and extra garbage is not allowed. CIDs commit the contents of subtrees and records, and since reading is the same as walking the tree, it might be possible to reject some kinds of malicious block-generation attacks early. (haven't thought this through)
reduced archive size.

CIDs are large, compression-unfriendly, and redundant if you are including the CID's actual content.

for example, my atproto repo is around 5.0MB and contains 14,673 blocks with a CID prefix plus 14,675 CID links in its MST. Each CID is 32 bytes, so (14,673 + 14,675) * 32 = 0.9MB just for the CIDS, almost 20%.

from a few more samples of various sizes from real atproto repos:
```
  CIDs     CAR   potential savings
0.53KB / 3.4KB = 16%
23.2KB / 279KB = 8%
 0.9MB / 5.0MB = 18%
25.9MB / 128MB = 20%
94.8MB / 449MB = 21%
```
These calculations don't include the 4-bytes-per-CID prefix size, since that overhead will already typically be eliminated by compression.

STARs retain the raw CBOR serialization of records, but may use a new MST node serialization that further reduces this overhead.

Since omitting CIDs by making them implicit removes uncompressible content from CARs, I'm optimistic that real savings for compressed STARs vs CARs will be higher.

Note that all repository content in a STAR is still cryptographically bound to the signed root commit's CID: it's just a little more work to prove it.

scope#

STAR is specialized for atproto MST storage, and best-suited for serializing complete trees.

It should work for "CAR slices" -- contiguous narrowed partial trees that may omit content before and/or after a specific key. (CIDS referencing missing nodes at the boundaries cannot be eliminated)
- It's desireable to be able to archive complete subtrees, so enforcing a well-formed atproto commit as the header might not be sufficient on its own. (subtrees could be stored as CAR slices so this may be unnecessary)
It might be interesting to allow arbitrary sparse trees. Not sure yet.
It's not suitable for firehose commit CARs, which need to include blocks that aren't in a strict single MST.

STAR format does not aim to provide efficient access to random nodes or through other tree iteration patterns. Almost any kind of inspection requires a linear scan through the archive (especially if global key compression happens).

problems#

It might be difficult to convert a STAR to stream-friendly (preorder traversal) CAR format, since the CID of each MST node block can only be computed after visiting all of its children. STAR (could)[https://bsky.app/profile/bad-example.com/post/3mcv4zxwtgs2w] require that MST nodes above a certain depth store their own CIDs which would be sad but pragmatic.

Similarly, the validity of the commit signature cannot be known until the root node's CID is calculated. a parser might emit an entire repo-worth of keys-record pairs only to find out at the very end that none of it was valid. The same boring pragmatic fix as the last problem can probably also address this: small near-leaf subtrees need to be buffered for validation; as trees get larger this looks more like streaming. (see verifying the whole tree below)

format#

|--------- header ---------| |---------------- optional tree ----------------|
[ '*' | ver | len | commit ] [ node ] [ node OR record ] [ node OR record ] …

* (fixed u8): The first byte of the header is always * (hex 0x2A).
ver (varint): Next is an LEB128 varint specifying the version, with a fixed value of 1 for the current format.
len (varint): The length of the proceeding atproto commit object in bytes.
commit (DAG-CBOR): An atproto commit object in DAG-CBOR derived from the repo spec:
- did (string, nullable): same as repo spec
- version (integer, required): corresponding CAR repo format version, currently fixed value of 3
- data (hash link, nullable): CID of the first (root) node in the MST. an empty tree is represented by the presence of a null here
- rev (string, required): same as repo spec
- prev (hash link, nullable): same as repo spec
- sig (byte array, nullable): to enable archiving stable sub-trees which might be later stitched into full signed MSTs, the sig property is allowed to be null.

verifying a commit#

The commit object can be converted to a repo-spec compliant commit:

if data is null, replace it with the CID of an empty repo-spec style MST (bafyreihmh6lpqcmyus4kt4rsypvxgvnvzkmj4aqczyewol5rsf7pdzzta4)
follow steps from the repo spec to resolve the identity and verify the signature.

When sig is null (typically for archived MST sub-trees), this STAR cannot be converted to a repo-spec compliant commit.

optional tree#

node: TODO: we need a new node format. It must be convertible back to a repo-spec style node.
record: The atproto record. Its CID can be computed over the bytes of its block (see below).

node / base#

|----- node -----|
[ len | mst node ]

len (varint): the length of the proceeding CBOR block, in bytes.
mst node (DAG-CBOR): object with the following schema
- l (hash link, nullable)

note1: it's a bit tempting to redesign the MST nodes, because the reason (and lack of special-ness) for l being separate from the entries in e took a long time for me to understand. but the existing format definitely works so maybe sticking close to it is the move?

note2: a magic special zero hash-link is a pretty gross way to shoehorn in a sentinel! null was already taken because subtrees always are optional

(this section is very much in flux)

was thinking of making base (depth=0) nodes special (implicit cid) and then further simplifying to a simple array of entries since they can't have subtrees (l or ts).

buuuutttt it's probably simpler just to give the node a nullable cid property that's required when depth=0.

on the other track, i was thinking nodes could be rewritten as a pair of arrays

index:   [        0        ,        1        ,        2        ,        3      ]

new
entries: [   (keyA, linkA) ,   (keyB, linkB) ,   (keyC, linkC) ] xxxxxxxxxxxxxxx
trees:   [ * tree before A , * tree before B ,  <null>         , *tree after C ]

vs old repo spec
mst node:[ tree in `l`     , keyA's `t`      ,  keyB's null `t`, keyC's `t`    ]

i think most languages can handle a pair of arrays ok with zip? but the equal-or-one-shorter length of entries compared to trees seems like asking for bugs.

so let's keep it simple (similar to the repo spec), trying again:

|----- node -----|
[ len | mst node ]

len (varint): the length of the proceeding CBOR block, in bytes.
mst node (DAG-CBOR): object with the following schema
- cid (hash link, nullable): the CID of this MST node. must be null for nodes at depth=0; required to be non-null for nodes at any higher depth.
- l (hash link, nullable): reference to a subtree at a lower depth containing only keys to the left of this node. when the referenced node is included in the archive, it must be given a special zeroed-out link reference (all zero bytes (deal with hash link prefixes or whatever... probably can assume sha256 but careful for lossless reversibility back to CAR))
- e (array, required): ordered array of entry objects, each containing:
  - p (integer, required): number of bytes shared with the previous entry (TODO key compression actually)
  - k (byte string, required): key suffix remaining
  - v (hash link, nullable): reference to the record data for this key. must be null if the STAR includes the record; must not be null if the record is not included in the STAR
  - t (hash link, nullable): link to a subtree that sorts to the right of this entry's key and to the left of the next entry's key. see l above.

NOTE: the option to not include v (and requiring its hash link to be present in that case) keeps the option open for key->CID-only archives, which can be nice for things like diffing a repo to handle a firehose #sync event, or perhaps to exclude large records specifically from the archive. (make this cohesive with optional vs null handling if using that)

TODO: nullable vs optional? (in general??)

tempting to do something like:

omitted means there is no subtree
null means there is a subtree and it's included (CID to-calculate)
non-null means there is a subtree and it's not included (MST slice or sparse tree)

hmmm: having separate optional and null cases might make deserializing into some languages tricky. i'm not sure if serde can handle that well? omitempty + nullable => Option<Option<T>>? should probably check other languages.

record#

|--- record --|
[ len | block ]

len (varint): the length of the proceeding binary record block in bytes.
block (bytes): the raw bytes of the (DAG-CBOR) record

order of `node`s and `record`s#

The MST must be stored in key order, which for an MST is a depth-first walk across the tree.

For each included child of a node (indicated by ?? in its entries. null for cid?), todo blah blah

excluded children (indicated by a CID link being present in entries) are not included in the series of nodes and records.

key compression#

TODO

(but basically do what the repo-spec does but apply it across the whole stream)
(but also actually run some tests and measure how much this decreases file sizes post-normal-file-compression)

verifying the whole tree#

A STAR reader must compute CIDs for all MST nodes as they are encountered, so that parent node CIDs can be computed, until eventually the root node's CID is known and can be compared againt the commit object's data hash link. If the root node's CID does not match, the commit's signature is not valid for the archive.

Clone this repository