···4243**OS threads, not goroutines.** one thread per PDS host. predictable memory, no GC pauses, but thread count scales linearly. 2,750 threads is fine — most are blocked on WebSocket reads. per-thread RSS is modest (stack pages on demand, ~1-2 MiB when active).
4445-**split ports.** 3000 for the WebSocket firehose, 3001 for HTTP (health, stats, metrics, admin, XRPC). indigo serves everything on 2470.
4647## deployment war stories
48···9091progress is tracked in postgres — cursor position and imported count per collection — so crashes resume where they left off. triggered via admin API, monitored via status endpoint.
9293-first backfill run: 1,269 collections discovered. the small ones (niche lexicons, alt clients) complete in seconds. the big ones — `app.bsky.feed.like`, `app.bsky.feed.post`, `app.bsky.actor.profile` — each have 20-30M+ DIDs and take hours to page through at 1,000 per request with a 100ms pause between pages.
9495-as of writing: 621 collections complete, 13.6M DIDs imported, currently grinding through `feed.like` at ~250K DIDs/minute.
9697## the build pipeline
98···104105| | indigo (Go) | zlay (zig) |
106|---|---|---|
107-| code | ~50k+ lines | ~6k lines |
108| dependencies | ~50 Go modules | 4 (zat, websocket, pg, rocksdb) |
109-| memory | ~6 GiB (GOMEMLIMIT) | ~1.8 GiB (1,486 hosts) |
110| collection index | sidecar process (pebble) | inline (RocksDB) |
111| validation | blocking (DID resolution) | optimistic (pass-through on miss) |
112| services to deploy | 2 (relay + collectiondir) | 1 |
113114-the memory difference isn't zig being "faster" — it's the absence of a garbage collector holding onto freed memory. Go's relay sets `GOMEMLIMIT=6GiB` to tell the runtime it's OK to return memory to the OS. zlay's threads use what they need and the OS pages the rest.
0000115116## what zat exercises
117···119120running at ~600 events/sec sustained, zat processes roughly 50M CBOR decodes per day. that's a different kind of test than unit vectors.
1210000000000122## what's next
123124-the backfill will finish in a few hours. after that, zlay's collection index should be at parity with bsky.network's collectiondir for the first time. the next step is a correctness audit — diff `listReposByCollection` results between zlay and bsky.network across a sample of collections and verify the sets match.
125126-longer term: sync 1.1 support is partially implemented (zlay already handles `#sync` frames from the firehose), but full commit diff verification via MST inversion is the remaining piece. that's where zat's `verifyCommitDiff` comes in — the primitives exist, they just need to be wired into the relay's validation pipeline.
···4243**OS threads, not goroutines.** one thread per PDS host. predictable memory, no GC pauses, but thread count scales linearly. 2,750 threads is fine — most are blocked on WebSocket reads. per-thread RSS is modest (stack pages on demand, ~1-2 MiB when active).
4445+**single port.** everything — WebSocket firehose, HTTP API, admin endpoints — on port 3000. a second port (3001) serves only prometheus metrics. indigo does the same: 2470 for everything, 2471 for metrics. this required patching the websocket.zig fork to support HTTP fallback — when a non-WebSocket request arrives, the handshake parser routes it to an HTTP handler instead of returning an error.
4647## deployment war stories
48···9091progress is tracked in postgres — cursor position and imported count per collection — so crashes resume where they left off. triggered via admin API, monitored via status endpoint.
9293+first backfill run: 1,287 collections discovered. the small ones (niche lexicons, alt clients) complete in seconds. the big ones — `app.bsky.feed.like`, `app.bsky.feed.post`, `app.bsky.actor.profile` — each have 20-30M+ DIDs and take hours to page through at 1,000 per request with a 100ms pause between pages.
9495+as of writing: backfill complete — 1,287 collections indexed, 61M DIDs imported.
9697## the build pipeline
98···104105| | indigo (Go) | zlay (zig) |
106|---|---|---|
0107| dependencies | ~50 Go modules | 4 (zat, websocket, pg, rocksdb) |
108+| memory | ~6 GiB (GOMEMLIMIT) | ~2.9 GiB (~2,750 hosts) |
109| collection index | sidecar process (pebble) | inline (RocksDB) |
110| validation | blocking (DID resolution) | optimistic (pass-through on miss) |
111| services to deploy | 2 (relay + collectiondir) | 1 |
112113+the first measurement (1.8 GiB at 1,486 hosts) was misleading — memory climbed to 6.6 GiB as the relay connected to all ~2,750 hosts, approaching the 8 GiB OOM limit. two fixes brought it back down:
114+115+1. **thread stack sizes.** zig's default is 16 MB per thread. with ~2,750 subscriber threads that maps 44 GB of virtual memory. most threads just read WebSockets and decode CBOR — 2 MB is generous. all `Thread.spawn` calls now pass `.{ .stack_size = 2 * 1024 * 1024 }`.
116+117+2. **c_allocator instead of GeneralPurposeAllocator.** GPA is actually a debug allocator (renamed `DebugAllocator` in zig 0.15) — it tracks per-allocation metadata and never returns freed small allocations to the OS. since zlay links glibc, `std.heap.c_allocator` gives glibc malloc with per-thread arenas, `madvise`-based page return, and production-grade fragmentation mitigation.
118119## what zat exercises
120···122123running at ~600 events/sec sustained, zat processes roughly 50M CBOR decodes per day. that's a different kind of test than unit vectors.
124125+## spec compliance
126+127+after the memory fixes, the next pass was checking zlay against the actual lexicon definitions for what a relay should implement. three gaps:
128+129+1. **`getHostStatus` was missing.** the lexicon says "implemented by relays" — zlay had `listHosts` but not the single-host query. straightforward handler: look up host, count accounts, map internal status values to the lexicon's `hostStatus` enum.
130+131+2. **admin takedowns didn't emit `#account` events.** `/admin/repo/ban` zeroed payloads on disk but never told downstream consumers the account was taken down. the spec says a relay's own takedown should produce an `#account` event. fix: build a CBOR frame (`active: false, status: "takendown"`), persist it, broadcast it.
132+133+3. **DID migration was unvalidated.** when an account appeared from a different PDS host, zlay blindly updated the host_id. now it queues a migration check — the validator's background threads resolve the DID document, check `pdsEndpoint()`, and only update if the new host matches.
134+135## what's next
136137+the backfill is complete — 1,287 collections indexed, 61M DIDs. the next step is a correctness audit — diff `listReposByCollection` results across a sample of collections against bsky.network's collectiondir and verify the sets match.
138139+longer term: full commit diff verification via MST inversion. zlay already handles `#sync` frames and validates signatures, but the inductive firehose check (`verifyCommitDiff`) isn't wired into the hot path yet. the primitives exist in zat — it's a throughput tradeoff.