atproto utils for zig zat.dev
atproto sdk zig

docs: update devlog 006 — single port, backfill complete, spec compliance

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

+21 -8
+21 -8
devlog/006-building-a-relay.md
··· 42 42 43 43 **OS threads, not goroutines.** one thread per PDS host. predictable memory, no GC pauses, but thread count scales linearly. 2,750 threads is fine — most are blocked on WebSocket reads. per-thread RSS is modest (stack pages on demand, ~1-2 MiB when active). 44 44 45 - **split ports.** 3000 for the WebSocket firehose, 3001 for HTTP (health, stats, metrics, admin, XRPC). indigo serves everything on 2470. 45 + **single port.** everything — WebSocket firehose, HTTP API, admin endpoints — on port 3000. a second port (3001) serves only prometheus metrics. indigo does the same: 2470 for everything, 2471 for metrics. this required patching the websocket.zig fork to support HTTP fallback — when a non-WebSocket request arrives, the handshake parser routes it to an HTTP handler instead of returning an error. 46 46 47 47 ## deployment war stories 48 48 ··· 90 90 91 91 progress is tracked in postgres — cursor position and imported count per collection — so crashes resume where they left off. triggered via admin API, monitored via status endpoint. 92 92 93 - first backfill run: 1,269 collections discovered. the small ones (niche lexicons, alt clients) complete in seconds. the big ones — `app.bsky.feed.like`, `app.bsky.feed.post`, `app.bsky.actor.profile` — each have 20-30M+ DIDs and take hours to page through at 1,000 per request with a 100ms pause between pages. 93 + first backfill run: 1,287 collections discovered. the small ones (niche lexicons, alt clients) complete in seconds. the big ones — `app.bsky.feed.like`, `app.bsky.feed.post`, `app.bsky.actor.profile` — each have 20-30M+ DIDs and take hours to page through at 1,000 per request with a 100ms pause between pages. 94 94 95 - as of writing: 621 collections complete, 13.6M DIDs imported, currently grinding through `feed.like` at ~250K DIDs/minute. 95 + as of writing: backfill complete — 1,287 collections indexed, 61M DIDs imported. 96 96 97 97 ## the build pipeline 98 98 ··· 104 104 105 105 | | indigo (Go) | zlay (zig) | 106 106 |---|---|---| 107 - | code | ~50k+ lines | ~6k lines | 108 107 | dependencies | ~50 Go modules | 4 (zat, websocket, pg, rocksdb) | 109 - | memory | ~6 GiB (GOMEMLIMIT) | ~1.8 GiB (1,486 hosts) | 108 + | memory | ~6 GiB (GOMEMLIMIT) | ~2.9 GiB (~2,750 hosts) | 110 109 | collection index | sidecar process (pebble) | inline (RocksDB) | 111 110 | validation | blocking (DID resolution) | optimistic (pass-through on miss) | 112 111 | services to deploy | 2 (relay + collectiondir) | 1 | 113 112 114 - the memory difference isn't zig being "faster" — it's the absence of a garbage collector holding onto freed memory. Go's relay sets `GOMEMLIMIT=6GiB` to tell the runtime it's OK to return memory to the OS. zlay's threads use what they need and the OS pages the rest. 113 + the first measurement (1.8 GiB at 1,486 hosts) was misleading — memory climbed to 6.6 GiB as the relay connected to all ~2,750 hosts, approaching the 8 GiB OOM limit. two fixes brought it back down: 114 + 115 + 1. **thread stack sizes.** zig's default is 16 MB per thread. with ~2,750 subscriber threads that maps 44 GB of virtual memory. most threads just read WebSockets and decode CBOR — 2 MB is generous. all `Thread.spawn` calls now pass `.{ .stack_size = 2 * 1024 * 1024 }`. 116 + 117 + 2. **c_allocator instead of GeneralPurposeAllocator.** GPA is actually a debug allocator (renamed `DebugAllocator` in zig 0.15) — it tracks per-allocation metadata and never returns freed small allocations to the OS. since zlay links glibc, `std.heap.c_allocator` gives glibc malloc with per-thread arenas, `madvise`-based page return, and production-grade fragmentation mitigation. 115 118 116 119 ## what zat exercises 117 120 ··· 119 122 120 123 running at ~600 events/sec sustained, zat processes roughly 50M CBOR decodes per day. that's a different kind of test than unit vectors. 121 124 125 + ## spec compliance 126 + 127 + after the memory fixes, the next pass was checking zlay against the actual lexicon definitions for what a relay should implement. three gaps: 128 + 129 + 1. **`getHostStatus` was missing.** the lexicon says "implemented by relays" — zlay had `listHosts` but not the single-host query. straightforward handler: look up host, count accounts, map internal status values to the lexicon's `hostStatus` enum. 130 + 131 + 2. **admin takedowns didn't emit `#account` events.** `/admin/repo/ban` zeroed payloads on disk but never told downstream consumers the account was taken down. the spec says a relay's own takedown should produce an `#account` event. fix: build a CBOR frame (`active: false, status: "takendown"`), persist it, broadcast it. 132 + 133 + 3. **DID migration was unvalidated.** when an account appeared from a different PDS host, zlay blindly updated the host_id. now it queues a migration check — the validator's background threads resolve the DID document, check `pdsEndpoint()`, and only update if the new host matches. 134 + 122 135 ## what's next 123 136 124 - the backfill will finish in a few hours. after that, zlay's collection index should be at parity with bsky.network's collectiondir for the first time. the next step is a correctness audit — diff `listReposByCollection` results between zlay and bsky.network across a sample of collections and verify the sets match. 137 + the backfill is complete — 1,287 collections indexed, 61M DIDs. the next step is a correctness audit — diff `listReposByCollection` results across a sample of collections against bsky.network's collectiondir and verify the sets match. 125 138 126 - longer term: sync 1.1 support is partially implemented (zlay already handles `#sync` frames from the firehose), but full commit diff verification via MST inversion is the remaining piece. that's where zat's `verifyCommitDiff` comes in — the primitives exist, they just need to be wired into the relay's validation pipeline. 139 + longer term: full commit diff verification via MST inversion. zlay already handles `#sync` frames and validates signatures, but the inductive firehose check (`verifyCommitDiff`) isn't wired into the hot path yet. the primitives exist in zat — it's a throughput tradeoff.