STAR: Streaming Tree ARchive format#

status: just thinking about it

STAR is an archival format for Merkle Search Trees (MSTs) as implemented in atproto. It offers efficient key-ordered streaming of repository contents and reduced archive size compared to CARs.

convertible to/from CAR (lossless except any out-of-tree blocks from a CAR)
extra garbage strictly not allowed (unlike CAR)
canonical (unlike CAR)
strict depth-first (MST key-ordered) node and record ordering -> efficient reading (unlike CAR)
the header simply is the commit, followed by the serialized tree.
layer 0 MST nodes leave record links implicit, and content blocks in the STAR are not prefixed by their content hash: linked blocks follow in a deterministic order and recompute CID from on their contents.
small spec cleanups, eg: there are no exceptions to the no-empty-MST-nodes rule: an entirely-empty MST is serialized as data: null in the commit header.

the two primary motivations are

bounded-resource streaming readers.

atproto MSTs in CARs have to buffer and retain all record blocks, and typically buffer most MST node blocks, just to traverse the tree. even if a CAR appears to be in stream-friendly block ordering, you can only safely discard record blocks if you know for sure it's actually stream-friendly.

you also cannot reliably identify MST node blocks and record blocks in an atproto CAR without walkign the tree, so you cannot discard any potentially garbage blocks from the buffered data before walking. A malicious PDS can serve a cheap-to-generate endless CAR stream of garbage blocks, and you just have to keep buffering them.

since STAR is strictly stream-ordered, there is no node/block ambiguity, and extra garbage is not allowed. CIDs commit the contents of subtrees and records, and since reading is the same as walking the tree, it might be possible to reject some kinds of malicious block-generation attacks early. (haven't thought this through)
reduced archive size.

CIDs are large, compression-unfriendly, and redundant if you have access to the CID's actual content.

for example, my atproto repo is around 5.0MB and contains 14,673 blocks with a CID prefix plus 14,675 CID links in its MST. Each CID is 32 bytes, so (14,673 + 14,675) * 32 = 0.9MB just for the CIDS, almost 20%.

from a few more samples of various sizes from real atproto repos:
```
  CIDs     CAR   potential savings
0.53KB / 3.4KB = 16%
23.2KB / 279KB = 8%
 0.9MB / 5.0MB = 18%
25.9MB / 128MB = 20%
94.8MB / 449MB = 21%
```
These calculations don't include the 4-bytes-per-CID prefix size, since that overhead will already typically be eliminated by compression.

STARs don't include content hashes before content blocks, reducing the number of CIDs immediately by half. They omit record link CIDs from layer-0 MST nodes as well, for an overall reduction of CIDs by 80%.

Note that all repository content in a STAR is still cryptographically bound to the signed root commit's CID: it's just a little more work to prove it.

scope#

STAR is specialized for atproto MST storage, and best-suited for serializing complete trees.

It should work for "CAR slices" -- contiguous narrowed partial trees that may omit content before and/or after a specific key. (CIDS referencing missing nodes at the boundaries cannot be eliminated)
- It's desireable to be able to archive complete subtrees, so enforcing a well-formed atproto commit as the header might not be sufficient on its own. (subtrees could be stored as CAR slices so this may be unnecessary)
It might be interesting to allow arbitrary sparse trees. Not sure yet.
It's not suitable for firehose commit CARs, which need to include blocks that aren't in a strict single MST.

STAR format does not aim to provide efficient access to random nodes or through other tree iteration patterns. Almost any kind of inspection requires a linear scan through the archive (especially if global key compression happens).

format#

|--------- header ---------| |---------------- optional tree ----------------|
[ '*' | ver | len | commit ] [ node ] [ node OR record ] [ node OR record ] …

* (fixed u8): The first byte of the header is always * (hex 0x2A).
ver (varint): Next is an LEB128 varint specifying the version, with a fixed value of 1 for the current format.
len (varint): The length of the proceeding atproto commit object in bytes.
commit (DAG-CBOR): An atproto commit object in DAG-CBOR derived from the repo spec:
- did (string, required): same as repo spec. may become optional for subtree archives, but it's nice to be able to inspect, for now.
- version (integer, required): corresponding CAR repo format version, currently fixed value of 3
- data (hash link, optional): CID of the first (root) node in the MST. an empty tree is represented by the absence of this key.
- rev (string, required): same as repo spec (may become optional)
- prev (hash link, optional): same as repo spec, but optional instead of nullable. only included for lossless CAR round-tripping.
- sig (byte array, optional): to enable archiving stable sub-trees which might be later stitched into full signed MSTs, the sig property is allowed to be omitted.

verifying a commit#

The commit object can be converted to a repo-spec compliant commit:

if data is absent, replace it with the CID of an empty repo-spec style MST (bafyreihmh6lpqcmyus4kt4rsypvxgvnvzkmj4aqczyewol5rsf7pdzzta4)
follow steps from the repo spec to resolve the identity and verify the signature.

When sig is absent (typically for archived MST sub-trees), this STAR cannot be converted to a repo-spec compliant CAR. however, if it's a subtree, it can be stitched back into a full MST that can be converted back to a compliant CAR, provided the complimentary sparse STAR containing it.

optional tree#

There are two kinds of blocks in the tree: node blocks and record blocks. If the tree is present, the first block must always be a node (the MST root node), and blocks follow in depth-first tree traversal order. Blocks may arbitrarily be omitted (for STAR slices, sparse trees, and key -> CID-only archives). nodes have a flag for each link to indicate its upcoming presence (or absence) in the archive.

`node`#

|----- node -----|
[ len | mst node ]

len (varint): the length of the proceeding CBOR block, in bytes.
mst node (DAG-CBOR): object with the following schema
- l (hash link, optional): reference to a subtree at a lower depth containing only keys to the left of this node. if absent, there is no left subtree.
- L (bool, optional): "archived": if true, the subtree is contained in this archive. must not be present when l is not present.
- e (array, required): ordered array of entry objects with length of at least one, each containing:
  - p (integer, required): number of bytes shared with the previous entry (TODO key compression actually)
  - k (byte string, required): key suffix remaining
  - v (hash link, optional): reference to the record data for this key.
    - for MST nodes at depth=0:
      - v must be omitted when the record is included in the archive
      - v mut not be omitted if the record is not included
    - for MST nodes at depth>0:
      - v is required (V signifies if it's in the archive)
  - V (bool, optional): "archived": if true, the record is contained in this archive. must not be present when v is not present.
  - t (hash link, optional): link to a subtree that sorts to the right of this entry's key and to the left of the next entry's key. if absent, there is no subtree.
  - T (bool, optional): "archived": if true, the subtree is contained in this archive. must not be present when t is not present.

for now see the atproto repo spec for key compression (p and k)

`record`#

An atproto record. Its CID can be computed over its bytes of its block (see below).

|--- record --|
[ len | block ]

len (varint): the length of the proceeding binary record block in bytes.
block (bytes): the raw bytes of the (DAG-CBOR) record.

verifying the whole tree#

Each MST node, including the root, must be verified to match its expected CID. To compute the CID of MST nodes:

For nodes at depth>0 (all child CID links are included): Convert the MST node into repo-spec format, compute its CID as the sha256 hash of its DAG-CBOR serialization.

For nodes at depth=0 (record CID links excluded unless omitted from archive): read all included linked records from the node into a buffer and compute their CIDs (they will immediately follow in the STAR since a depth=0 node cannot have any other children). With the record CIDs available, the MST node can be converted into repo-spec format, and its CID calculated as with depth>0 nodes.

To compute the CID of included records:

The required bytes for the CID calculation are the exact included record bytes (sha256 over them).

open questions#

how far to go with implicit CIDs?#

there is a trade-off between going fully implicit on CIDs (possible as long as the content is present to compute the CID) vs fully explicit in MST node links (CIDS still omitted before the content blocks themselves for 50% fewer vs CAR).

the problem with omitting all possible CIDS is that you then cannot verify the root node until you finish walking the entire MST. a consumer might have already written data somewhere only to find out they need to undo it all!

as a compromise we're only omitting CIDs from MST nodes for layer 0 record links:

75% reduction in record CIDs written to the STAR; 60% overall CID reduction including subtree links
MST nodes contain four records on average; a verifying streamer can buffer this small number of records before emitting, so it never omits content that later is found to be unverifiable.
no special casing required for MST subtree links, their CIDs are always included

this is probably the right balance: considering the 50% initial reduction compared CARs by dropping the hash prefix in front of blocks, the all-in CID reduction is 80%.

but we could take it one step higher: have layer1 nodes do implicit CIDs for subtrees and records:

89% overall reduction in CIDs in the STAR
a verifying streamer needs to buffer 16 records on average. an attacker gets a 5x space amplification benefit if trying to generate extra-wide bufferable bottom-level tree nodes.
layer-dependent special-casing required for CID links as well as record links, across two MST layers

with the 50% initial reduction, this would be 95% total CID reduction

CIDs make up around 20% of uncompressed CAR file sizes. the first approach gets that down to 4%; second 1%. however, CIDs are uncompressible, so it's probably worth measuring the real effect of both approaches on large repos post-compression before completely committing one way or another.

Clone this repository