docs: devlog — firehose decoder and cross-SDK benchmarks

zat.dev / zat

fork atom

atproto utils for zig zat.dev

atproto sdk zig

fork atom

docs: devlog — firehose decoder and cross-SDK benchmarks

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

zzstoatzz.io 2 weeks ago c4f6116a 6dc0fe69

+64

1 changed file

expand all

unified split

devlog

002-firehose-and-benchmarks.md

+64

devlog/002-firehose-and-benchmarks.md

··· 1 + # consuming the firehose, then benchmarking it 2 + 3 + since the last devlog (self-publishing docs), zat grew from a collection of string parsers and HTTP clients into something that can consume the full AT Protocol event stream — both jetstream (JSON) and the raw firehose (binary DAG-CBOR). then we benchmarked it against every other AT Protocol SDK and the numbers were... surprising. 4 + 5 + ## what we built 6 + 7 + ### jetstream client (0.1.3) 8 + 9 + the easier of the two event streams. jetstream is a JSON WebSocket — you connect, receive typed events (commits, identity changes, account status updates), and process them. zat's client handles reconnection with exponential backoff, cursor tracking so you don't miss events on disconnect, and typed event parsing via the json helpers. 10 + 11 + ### firehose support (0.1.4) 12 + 13 + this was the real work. the raw firehose (`com.atproto.sync.subscribeRepos`) sends binary DAG-CBOR frames over WebSocket. each frame is two concatenated CBOR objects: a header (`{op, t}`) and a payload. commit payloads contain a CAR (Content Addressable aRchive) file embedded as a byte string, which contains the actual records. 14 + 15 + so to decode one firehose frame you need: 16 + 1. a DAG-CBOR codec (subset of CBOR with deterministic encoding rules) 17 + 2. a CAR codec (multicodec-prefixed CID + data blocks) 18 + 3. CID parsing (version, codec, multihash) 19 + 4. the actual record extraction (match CIDs from ops to CAR blocks, decode record CBOR) 20 + 21 + all of these are hand-rolled in zig. `firehose.decodeFrame(allocator, data)` does the full pipeline in one call — frame bytes in, typed `CommitEvent` with decoded records out. 22 + 23 + ### performance work (0.1.7) 24 + 25 + once the firehose decoder worked, we profiled and optimized: 26 + 27 + - **slimmed `Cid` from 56 to 16 bytes** — store only the raw byte reference, parse version/codec/digest lazily. most code paths just need to compare or look up CIDs, not inspect their internals. 28 + - **`Value` union shrunk from 64 to 24 bytes, `MapEntry` from 80 to 40 bytes** — these are the hot types in CBOR decoding. thousands per frame. smaller means better cache behavior. 29 + - **zero-copy everywhere** — CBOR strings and byte strings are slices into the input buffer, not copies. CIDs reference the raw bytes directly. the only allocations are for array/map containers (which go into the arena). 30 + - **inline map key reading** — CBOR map keys in DAG-CBOR are always text strings, so we inline the key read instead of going through the full `decodeAt` → `Value` union construction per key. 31 + 32 + ### round-robin host rotation (0.1.6) 33 + 34 + both clients now rotate through multiple hosts on reconnect. the firehose defaults to `bsky.network` plus three `firehose.network` regional endpoints. jetstream defaults to 12+ hosts. backoff resets when switching to a fresh host. 35 + 36 + ## the benchmarks 37 + 38 + we built [atproto-bench](https://tangled.sh/@zzstoatzz.io/atproto-bench) — a cross-SDK benchmark that captures ~10 seconds of live firehose traffic (~2400 frames, ~12 MB), then decodes the full corpus with four SDKs. each SDK calls its real consumer API: raw frame bytes in, typed commit with decoded records out. no synthetic shortcuts. 39 + 40 + the results on macOS arm64, 5 measured passes over the corpus: 41 + 42 + | SDK | frames/sec | MB/s | 43 + |-----|--------:|-----:| 44 + | zig (zat, arena reuse) | 1,852k | 9,079 | 45 + | zig (zat, alloc per frame) | 1,277k | 6,260 | 46 + | rust (jacquard-style) | 45k | 223 | 47 + | python (atproto) | 24k | 115 | 48 + | go (indigo) | 11k | 52 | 49 + 50 + the "alloc per frame" variant is the fair cross-language comparison — fresh allocator per frame, just like the other SDKs. even so, zat is 28x faster than rust, 54x faster than python, and 120x faster than go. 51 + 52 + ### why the gap 53 + 54 + two things compound: 55 + 56 + **zero-copy vs owned allocations.** when rust deserializes a `Commit`, serde allocates a new `String` for every string field and copies the entire CAR blob into a `Vec<u8>`. go's code-generated unmarshal does the same. zat returns slices pointing into the input buffer — the `repo` field is a pointer and a length, zero bytes copied. 57 + 58 + **sync vs async CAR parsing.** rust's `iroh-car` is an async library. every `next_block().await` goes through tokio's poll/wake state machine to read from an in-memory buffer. zat's CAR reader is synchronous and zero-copy. you can see it in the old numbers: rust did 501k frames/sec for just the CBOR decode (no CAR), but drops to 45k when CAR parsing kicks in. 59 + 60 + ### does this matter? 61 + 62 + for live firehose consumption: no. the network delivers ~500-1000 events/sec. any of these SDKs handle that easily. where it matters: backfill (replaying months of data), relays (fanning out to many consumers), and anything where you're processing stored firehose data as fast as possible. 63 + 64 + for now, we ship features. the headroom is there when we need it.