Streaming Tree ARchive format

doing the gross thing

optional and nullable

ready for the hate-mail

+12 -54
+12 -54
readme.md
··· 110 110 111 111 - `record`: The atproto record. Its CID can be computed over the bytes of its `block` (see below). 112 112 113 - ### node / base 114 113 115 - ``` 116 - |----- node -----| 117 - [ len | mst node ] 118 - ``` 119 - 120 - - `len` (varint): the length of the proceeding CBOR block, in bytes. 121 - 122 - - `mst node` (DAG-CBOR): object with the following schema 123 - - `l` (hash link, nullable) 124 - 125 - note1: it's a bit tempting to redesign the MST nodes, because the _reason_ (and lack of special-ness) for `l` being separate from the entries in `e` took a long time for me to understand. but the existing format definitely works so maybe sticking close to it is the move? 126 - 127 - note2: a magic special zero hash-link is a pretty gross way to shoehorn in a sentinel! null was already taken because subtrees always are optional 128 - 129 - (this section is very much in flux) 130 - 131 - was thinking of making base (depth=0) nodes special (implicit cid) and then further simplifying to a simple array of entries since they can't have subtrees (`l` or `t`s). 132 - 133 - buuuutttt it's probably simpler just to give the node a nullable `cid` property that's required when depth=0. 134 - 135 - on the other track, i was thinking nodes could be rewritten as a pair of arrays 136 - 137 - ``` 138 - index: [ 0 , 1 , 2 , 3 ] 139 - 140 - new 141 - entries: [ (keyA, linkA) , (keyB, linkB) , (keyC, linkC) ] xxxxxxxxxxxxxxx 142 - trees: [ * tree before A , * tree before B , <null> , *tree after C ] 143 - 144 - vs old repo spec 145 - mst node:[ tree in `l` , keyA's `t` , keyB's null `t`, keyC's `t` ] 146 - ``` 147 - 148 - i think most languages can handle a pair of arrays ok with zip? but the equal-or-one-shorter length of `entries` compared to `trees` seems like asking for bugs. 149 - 150 - so let's keep it simple (similar to the repo spec), trying again: 151 - 114 + ### node 152 115 153 116 ``` 154 117 |----- node -----| ··· 158 121 - `len` (varint): the length of the proceeding CBOR block, in bytes. 159 122 160 123 - `mst node` (DAG-CBOR): object with the following schema 161 - - `cid` (hash link, nullable): the CID of this MST node. must be `null` for nodes at `depth=0`; required to be non-null for nodes at any higher `depth`. 162 - - `l` (hash link, nullable): reference to a subtree at a lower depth containing only keys to the left of this node. when the referenced node is included in the archive, it must be given a special zeroed-out link reference (all zero bytes (deal with hash link prefixes or whatever... probably can assume sha256 but careful for lossless reversibility back to CAR)) 124 + - `l` (hash link, **optional and nullable**): reference to a subtree at a lower depth containing only keys to the left of this node. 125 + - when **absent**: there is no left subtree 126 + - when **null**: the left subtree is present and will follow in the archive (implicit CID) 127 + - when **non-null**: the left subtree exists but is abset from the archive 163 128 - `e` (array, required): ordered array of entry objects, each containing: 164 129 - `p` (integer, required): number of bytes shared with the previous entry (TODO key compression actually) 165 130 - `k` (byte string, required): key suffix remaining 166 - - `v` (hash link, **nullable**): reference to the record data for this key. must be null if the STAR includes the record; must _not_ be null if the record is not included in the STAR 167 - - `t` (hash link, nullable): link to a subtree that sorts to the right of this entry's key and to the left of the next entry's key. see `l` above. 168 - 169 - NOTE: the option to not include `v` (and requiring its hash link to be present in that case) keeps the option open for `key->CID`-only archives, which can be nice for things like diffing a repo to handle a firehose `#sync` event, or perhaps to exclude large records specifically from the archive. (make this cohesive with optional vs null handling if using that) 170 - 171 - TODO: nullable vs optional? (in general??) 172 - 173 - tempting to do something like: 174 - 175 - - omitted means there is no subtree 176 - - null means there is a subtree and it's included (CID to-calculate) 177 - - non-null means there is a subtree and it's *not* included (MST slice or sparse tree) 178 - 179 - hmmm: having separate optional and null cases might make deserializing into some languages tricky. i'm not sure if serde can handle that well? omitempty + nullable => `Option<Option<T>>`? should probably check other languages. 131 + - `v` (hash link, **nullable**): reference to the record data for this key. 132 + - when **null**: the record is included in the archive and will follow (implicit CID) 133 + - when **non-null**: the record exists but is not included in the archive 134 + - `t` (hash link, nullable): link to a subtree that sorts to the right of this entry's key and to the left of the next entry's key. same rules as `l`: 135 + - when **absent**: there is no left subtree 136 + - when **null**: the left subtree is present and will follow in the archive (implicit CID) 137 + - when **non-null**: the left subtree exists but is abset from the archive 180 138 181 139 182 140 ### record