fix: correct benchmark claims in devlog

atproto utils for zig zat.dev

atproto sdk zig

address work asymmetry (zig decoded ~2.3k op-linked blocks while
rust/go decoded all ~23k), note block decode cardinality as dominant
factor over async overhead, explain python > rust result (libipld
sync vs iroh-car async), call out indigo as slowest despite being
bluesky's production relay.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

zzstoatzz.io 3 weeks ago ec7d839e a3ac1536

+17 -13

1 changed file

expand all

devlog

002-firehose-and-benchmarks.md

+17 -13

devlog/002-firehose-and-benchmarks.md

··· 35 35 36 36 ## the benchmarks 37 37 38 - we built [atproto-bench](https://tangled.sh/@zzstoatzz.io/atproto-bench) — a cross-SDK benchmark that captures ~10 seconds of live firehose traffic (~2400 frames, ~12 MB), then decodes the full corpus with four SDKs. each SDK calls its real consumer API: raw frame bytes in, typed commit with decoded records out. no synthetic shortcuts. 38 + we built [atproto-bench](https://tangled.sh/@zzstoatzz.io/atproto-bench) — a cross-SDK benchmark that captures ~10 seconds of live firehose traffic, then decodes the full corpus with four SDKs. 39 39 40 - the results on macOS arm64, 5 measured passes over the corpus: 40 + every SDK does the same work per frame: decode CBOR header → decode CBOR payload → parse CAR → decode every CAR block as DAG-CBOR. block counts, error counts, and per-pass variance (min/median/max) are reported so you can verify parity. 41 41 42 - | SDK | frames/sec | MB/s | 43 - |-----|--------:|-----:| 44 - | zig (zat, arena reuse) | 1,852k | 9,079 | 45 - | zig (zat, alloc per frame) | 1,277k | 6,260 | 46 - | rust (jacquard-style) | 45k | 223 | 47 - | python (atproto) | 24k | 115 | 48 - | go (indigo) | 11k | 52 | 42 + the corpus is captured with a minimal CBOR header peek (check `t == "#commit"` and `ops` is non-empty) — no SDK-specific decode in the capture path, so the corpus isn't biased toward any particular decoder's capabilities. 49 43 50 - the "alloc per frame" variant is the fair cross-language comparison — fresh allocator per frame, just like the other SDKs. even so, zat is 28x faster than rust, 54x faster than python, and 120x faster than go. 44 + the original version of these benchmarks had work asymmetry: zat's `firehose.decodeFrame` only decoded op-linked blocks (~2.3k per corpus), while rust and go decoded all CAR blocks (~23k). python parsed CAR structure but didn't iterate blocks. the numbers below are from the corrected version where all SDKs decode every block. 51 45 52 - ### why the gap 46 + run `just capture && just bench` in [atproto-bench](https://tangled.sh/@zzstoatzz.io/atproto-bench) to get numbers on your machine. 53 47 54 - two things compound: 48 + ### why zat is fast 49 + 50 + three things compound: 55 51 56 52 **zero-copy vs owned allocations.** when rust deserializes a `Commit`, serde allocates a new `String` for every string field and copies the entire CAR blob into a `Vec<u8>`. go's code-generated unmarshal does the same. zat returns slices pointing into the input buffer — the `repo` field is a pointer and a length, zero bytes copied. 57 53 58 - **sync vs async CAR parsing.** rust's `iroh-car` is an async library. every `next_block().await` goes through tokio's poll/wake state machine to read from an in-memory buffer. zat's CAR reader is synchronous and zero-copy. you can see it in the old numbers: rust did 501k frames/sec for just the CBOR decode (no CAR), but drops to 45k when CAR parsing kicks in. 54 + **block decode cardinality.** each firehose frame contains a CAR with ~10 blocks (MST nodes + records). decoding every block as DAG-CBOR is the dominant cost — it's where most of the per-frame CPU time goes across all SDKs. 55 + 56 + **sync vs async CAR parsing.** rust's `iroh-car` is an async library. every `next_block().await` goes through tokio's poll/wake state machine to read from an in-memory buffer. this is bad enough that python (via libipld, a *different* Rust library that works synchronously) outperforms the rust benchmark. the async overhead compounds on top of the per-block decode cost — ~10 awaits per frame adds up. 57 + 58 + ### the go result 59 + 60 + indigo — bluesky's own production relay implementation — is the slowest of the four. go-car is synchronous (no async overhead excuse), and cbor-gen is code-generated (no reflection). the cost is allocations and GC pressure: go copies every string, every byte slice, every block into heap-allocated objects, and the garbage collector has to clean it all up. at ~10 blocks/frame, that's a lot of short-lived allocations per decode. 61 + 62 + this doesn't mean indigo is bad software — it handles the live firehose fine at ~1k events/sec. but it explains why bluesky runs beefy relay infrastructure: the decode path has no room to spare at scale. 59 63 60 64 ### does this matter? 61 65