atproto relay implementation in zig
zlay.waow.tech
zlay memory allocation audit#
linear RSS growth: ~290 MiB/hour, 0 → 3.5 GiB in 12h (~117 bytes/frame at 700 frames/sec)
all allocations use std.heap.c_allocator (glibc malloc) unless noted.
1. per-frame hot path (~700 frames/sec across 16 workers)#
each incoming firehose frame passes through: subscriber → thread pool → frame worker. the following allocations happen FOR EVERY FRAME.
1a. subscriber header decode arena#
- file: subscriber.zig:260-261
- what:
ArenaAllocator.init(sub.allocator)— lightweight CBOR header+payload decode - size: ~32KB chunk (default arena chunk size), actual used ~1-5KB
- freed:
defer arena.deinit()— same function - status: CLEAN
1b. frame data dupe (subscriber → worker handoff)#
- file: subscriber.zig:341
- what:
sub.allocator.dupe(u8, data)— copies raw frame bytes for async processing - size: ~2-5KB per frame (full websocket message)
- freed: frame_worker.zig:34
defer work.allocator.free(work.data) - status: CLEAN (also freed on backpressure: subscriber.zig:353)
1c. frame worker processing arena#
- file: frame_worker.zig:36-37
- what:
ArenaAllocator.init(work.allocator)— full CBOR re-decode, multibase encode, resequence - size: ~32KB chunk, actual used ~5-20KB (depends on frame complexity)
- freed:
defer arena.deinit()— same function - status: CLEAN
1d. persist data allocation#
- file: event_log.zig:592
- what:
self.allocator.alloc(u8, header_size + payload.len)— 28-byte LE header + raw CBOR payload - size: 28 + payload_len (~2-5KB)
- freed: event_log.zig:845-846 in
flushLocked()—for (self.evtbuf.items) |job| self.allocator.free(job.data) - flush trigger: every 400 events OR every 100ms (whichever comes first)
- status: CLEAN — but see note on outbuf/evtbuf below
1e. outbuf/evtbuf ArrayList growth#
- file: event_log.zig:99-100
- what:
outbuf: ArrayListUnmanaged(u8)andevtbuf: ArrayListUnmanaged(PersistJob) - growth: ArrayList uses 2x growth factor. outbuf accumulates raw bytes, evtbuf accumulates job structs.
- freed:
clearRetainingCapacity()on flush — backing memory is KEPT, only len reset to 0 - bounded: yes — max ~400 entries before flush. outbuf max ~400 × 5KB = ~2MB. backing array ~4MB max.
- status: CLEAN (bounded, retains capacity)
1f. SharedFrame for broadcast#
- file: broadcaster.zig:50-59 (
SharedFrame.create) - what: allocates SharedFrame struct + dupes frame data
- size: sizeof(SharedFrame) (~48 bytes) + data.len (~2-5KB)
- freed: ref-counted. broadcaster releases immediately (
defer frame.release()line 372). each consumer releases after send (writeLoop line 231). freed when refcount hits 0 (line 66-70). - status: CLEAN (if no consumers, created+freed immediately)
1g. ring buffer history entry#
- file: ring_buffer.zig:55 via broadcaster.zig:368
- what:
self.allocator.dupe(u8, data)— copies frame data into history - size: ~2-5KB per frame (resequenced CBOR)
- freed: on overwrite when buffer is full (ring_buffer.zig:60-62)
- bounded: 50,000 entries. once full, every push frees the oldest.
- steady state: 50K × ~3KB avg = ~150MB
- status: CLEAN (bounded)
1h. resequenceFrame temporaries#
- file: broadcaster.zig:117-156, called from frame_worker.zig:232
- what: CBOR decode + re-encode with new seq. allocates ArrayList, CBOR encode buffers, result slice.
- size: ~5-10KB temporaries
- freed: allocated on frame worker's arena (1c), freed when arena deinits
- status: CLEAN
1i. collection index keys (per commit with ops)#
- file: collection_index.zig:79-82 via trackCommitOps → addCollection
- what:
makeKey()allocates 2 keys:collection\0didanddid\0collection - size: ~100-200 bytes per key pair
- freed:
defer self.allocator.free(rbc_key)/defer self.allocator.free(cbr_key)— same function - status: CLEAN
1j. postgres queries per frame (via pg.zig)#
each frame triggers 3-5 postgres queries. each query allocates internally:
| query | file | method | per-frame? |
|---|---|---|---|
| DID→UID lookup | event_log.zig:294 | rowUnsafe | yes (cache miss only) |
| DID→UID create | event_log.zig:306 | exec | yes (first encounter only) |
| DID→UID readback | event_log.zig:315 | rowUnsafe | yes (first encounter only) |
| isAccountActive | event_log.zig:398 | rowUnsafe | yes (commits/syncs) |
| getAccountState | event_log.zig:339 | rowUnsafe | yes (commits, for rev check) |
| updateAccountState | event_log.zig:358 | exec | yes (validated commits) |
| updateUpstreamStatus | event_log.zig:389 | exec | yes (#account events) |
| updateHostSeq | event_log.zig:458 | exec | every 4s per host |
| getAccountHostId | event_log.zig:369 | rowUnsafe | yes (uidForDidFromHost) |
pg.zig internals (from karlseguin/pg.zig):
Pool.rowUnsafe(): acquires connection from pool, creates aResultwith its own ArenaAllocatorResult.deinit(): releases connection back to pool, destroys arenaQueryRowUnsafe.deinit(): callsresult.drain()THENresult.deinit()
POTENTIAL ISSUE — pg.zig deinit error path:
// pg.zig QueryRowUnsafe.deinit():
pub fn deinit(self: *QueryRowUnsafe) !void {
try self.result.drain(); // if this errors...
self.result.deinit(); // ...this NEVER runs → arena leak + connection leak
}
zlay callers use defer row.deinit() catch {}; — if drain() errors, the Result arena and connection are LEAKED. drain() errors on network failure to postgres.
- likelihood: low per-frame (postgres is local), but at 2,000+ queries/sec, even 0.01% failure = 0.2 leaks/sec
- size per leak: ~1-4KB (pg arena) + 1 connection slot
- impact at 0.01%: ~0.7 MB/hour (doesn't explain 290 MB/hour)
- impact at 0.1%: ~7 MB/hour (partial explanation?)
1k. validator arenas (per validated commit/sync)#
- file: validator.zig:161-163 (validateSync) and validator.zig:242-244 (verifyCommit)
- what:
ArenaAllocator.init(self.allocator)for CAR parsing + signature verification - size: depends on CAR size. typical commit: ~5-50KB used.
- freed:
defer arena.deinit()— same function - note: these were NOT changed in the page_allocator experiment
- status: CLEAN (assuming arena.deinit works correctly)
1l. getAccountState string dupes#
- file: event_log.zig:348-349
- what:
allocator.dupe(u8, rev)andallocator.dupe(u8, data_cid)— copies from pg result into caller's arena - size: ~20-50 bytes each (TID rev + multibase CID)
- freed: allocated on frame worker's arena (1c), freed when arena deinits
- status: CLEAN
2. per-connection allocations (~2,750 PDS hosts)#
2a. subscriber struct + hostname#
- file: slurper.zig:426-427, 423
- what:
allocator.create(Subscriber)+allocator.dupe(u8, hostname) - size: sizeof(Subscriber) + hostname (~20-50 bytes)
- freed: slurper.zig:468-469 (runWorker cleanup on exit)
- status: CLEAN (bounded by host count)
2b. websocket.Client per connection#
- file: subscriber.zig:220-227
- what:
websocket.Client.init(self.allocator, ...)— TLS buffers, read buffers - size: TLS handshake buffers (~134KB), static read buffer (4096 bytes)
- freed:
defer client.deinit()— subscriber.zig:227 - status: CLEAN (created/freed per connection attempt)
2c. websocket dynamic read buffers#
- what: for messages > 4096 bytes, BufferProvider allocates a dynamic buffer
- freed:
reader.done()→restoreStatic()→provider.release()→allocator.free() - status: CLEAN (freed after each message)
2d. shared TLS CA bundle#
- file: slurper.zig:251-253
- what:
bundle.rescan(self.allocator)— loads system CA certs - size: ~100-200KB (one-time)
- freed: slurper.zig:575
b.deinit(self.allocator) - status: CLEAN (one-time, shared)
2e. thread stacks#
- what: each subscriber thread gets 8MB virtual stack (
default_stack_size) - count: ~2,750 subscriber threads + 16 worker threads + 4 resolver threads + 3 background threads = ~2,773 threads
- total virtual: ~22 GiB (but only touched pages count as RSS)
- status: BOUNDED (pages are returned when threads exit)
3. bounded caches and data structures#
3a. DID → UID cache (LRU)#
- file: event_log.zig:96, lru.zig
- capacity: 500,000 entries
- per entry: duped key string (~30 bytes DID) + Node struct (~80 bytes) + u64 value + HashMap entry
- steady state: ~55-65 MB
- status: BOUNDED (evicts LRU on put when full)
3b. validator signing key cache (LRU)#
- file: validator.zig:50, lru.zig
- capacity: 250,000 entries
- per entry: duped key string (~30 bytes DID) + Node struct (~80 bytes) + CachedKey (fixed 42 bytes) + HashMap entry
- steady state: ~30-40 MB
- status: BOUNDED (evicts LRU on put when full)
3c. resolve queue#
- file: validator.zig:52-54
- what:
queue: ArrayListUnmanaged([]const u8)+queued_set: StringHashMapUnmanaged(void) - per entry: duped DID string (~30 bytes) + HashMap entry
- max size: 100,000 (validator.zig:63)
- freed: DID freed after resolution (validator.zig:419)
- status: BOUNDED
3d. workers map#
- file: slurper.zig:217
- what:
std.AutoHashMapUnmanaged(u64, WorkerEntry)— host_id → thread+subscriber - size: bounded by host count (~2,750)
- status: BOUNDED
3e. crawl request queue#
- file: slurper.zig:221
- what:
ArrayListUnmanaged([]const u8)— duped hostnames waiting for processing - freed: slurper.zig:522
defer self.allocator.free(h)after processing - status: BOUNDED (grows slowly, processed continuously)
3f. frame pool ring buffers#
- file: thread_pool.zig
- what: pre-allocated ring buffers per worker. zero alloc per submit.
- size: 16 workers × 4096 capacity × sizeof(FrameWork) per entry
- status: FIXED SIZE (no growth)
3g. consumer list#
- file: broadcaster.zig:291
- what:
ArrayListUnmanaged(*Consumer)— active downstream WebSocket consumers - per consumer: Consumer struct (~66KB for 8192-entry SharedFrame pointer buffer) + write thread
- status: BOUNDED by connected consumers (typically 0-10)
4. API handler allocations (cold paths)#
4a. handleAdminListHosts — ALLOCATES ON c_allocator#
- file: api/admin.zig:89-99
- what:
persist.listAllHosts(persist.allocator)— dupes hostname+status for each host - freed: defer block frees all hostname/status strings + slice
- status: CLEAN
4b. handleBan — LEAKS (cold path)#
- file: api/admin.zig:70-78
- what:
buildAccountFrame(ctx.persist.allocator, did)→ allocates CBOR frame on c_allocator - what:
broadcaster.resequenceFrame(ctx.persist.allocator, frame_bytes, relay_seq)→ allocates resequenced frame on c_allocator - freed: NEITHER frame_bytes NOR broadcast_data is freed after broadcast
- size: ~200-500 bytes per ban
- status: LEAK — but cold path (admin-only, negligible impact)
4c. handleAdminBackfillStatus#
- file: api/admin.zig:197-201
- what:
backfiller.getStatus(backfiller.allocator)→ builds JSON string on c_allocator - freed:
defer backfiller.allocator.free(body) - status: CLEAN
4d. xrpc handlers (listRepos, getRepoStatus, etc.)#
- what: all use stack-allocated fixed buffers (65536 bytes) or pg query iteration
- no heap allocation in the handler code itself
- pg queries: each creates/destroys internal pg arenas (same pattern as 1j)
- status: CLEAN
4e. handleRequestCrawl#
- file: api/xrpc.zig:429
- what:
std.json.parseFromSlice(...)onslurper.allocator— parses JSON body - freed:
defer parsed.deinit() - what:
validateHostname(slurper.allocator, ...)— allocates normalized hostname - freed:
defer slurper.allocator.free(hostname) - what:
slurper.addCrawlRequest(hostname)→ dupes hostname - freed: by crawl processor (slurper.zig:522)
- status: CLEAN
5. one-time startup allocations#
| what | file | size |
|---|---|---|
| Broadcaster struct | main.zig:140 | ~400KB (history array) |
| Validator struct | main.zig:143 | ~100 bytes |
| DiskPersist (+ pg pool) | main.zig:149 | ~10KB + pg connections |
| CollectionIndex (RocksDB) | main.zig:169 | RocksDB internal (~50-100MB) |
| Backfiller struct | main.zig:176 | ~100 bytes |
| CA bundle | slurper.zig:251 | ~100-200KB |
| Frame pool | slurper.zig:257 | pre-allocated ring buffers |
| Error frame (CBOR) | broadcaster.zig:306 | ~50 bytes |
| MetricsServer struct | main.zig:216 | ~100 bytes |
| build_options module | (compile-time) | ~100 bytes |
6. backfill allocations (when running)#
6a. discoverCollections#
- file: backfill.zig:101-148
- what: fetches lexicon garden llms.txt + RBC scan, deduplicates
- freed: all temporaries freed via defer blocks
- status: CLEAN (only runs when admin triggers backfill)
6b. backfillCollection per-page#
- file: backfill.zig:269-318
- what: per-page: HTTP fetch → JSON parse → dupe DIDs → addCollection to RocksDB
- freed: all duped strings freed in defer blocks
- http client: reused across pages for one collection, freed after collection done
- status: CLEAN
7. dependency internal allocations#
7a. RocksDB (via rocksdb-zig)#
- internal memory managed by RocksDB C library (block cache, memtables, etc.)
- not tracked by c_allocator — uses its own allocator
- bounded by RocksDB options (write_buffer_size, max_open_files, block_cache_size)
7b. pg.zig connection pool#
- file: initialized in event_log.zig:132 with
size = 5 - 5 connections, each with internal read/write buffers
- status: BOUNDED (5 connections)
7c. zat.DidResolver (per resolver thread)#
- file: validator.zig:401
- what:
zat.DidResolver.init(self.allocator)— creates HTTP client for DID resolution - long-lived: one per resolver thread (4 total), lives until shutdown
- internal: likely holds std.http.Client with connection pool
- potential issue: if std.http.Client pools connections to many unique hosts (PLC server, PDS endpoints), the pool could grow. but PLC is typically one host (plc.directory).
7d. zat CBOR decode#
- all CBOR decode operations are on arena allocators (subscriber arena or worker arena)
- freed when arena deinits
- status: CLEAN
8. summary of potential issues#
confirmed leak (cold path, negligible):#
- admin handleBan: leaks frame_bytes + broadcast_data (~500 bytes per ban)
potential leak under network errors:#
- pg.zig QueryRowUnsafe.deinit(): if drain() errors, Result arena + connection leak
- at 2,000+ queries/sec, even rare failures accumulate
- needs investigation: check postgres error rate in logs
fragmentation concerns:#
- ~700 frames/sec × ~10 alloc/free cycles per frame = ~7,000 alloc/free operations per second on c_allocator
- many different sizes (28-byte headers, 2-5KB frames, 100-byte keys, 1-4KB pg arenas)
- glibc with MALLOC_ARENA_MAX=2 concentrates fragmentation into 2 arenas
- mallinfo() only reports the MAIN arena — second arena is invisible
- malloc_trim(0) only trims the main arena — second arena is untrimmed
items NOT investigated:#
- zat library internals (CBOR allocator patterns, DID resolver HTTP client connection pooling)
- rocksdb-zig binding allocations (WriteBatch, Iterator internal state)
- std.http.Client internal connection/TLS buffer retention within zat.DidResolver
9. per-frame allocation count summary#
for a typical validated #commit frame, approximately:
| step | allocs | frees | net | size |
|---|---|---|---|---|
| subscriber arena init | 1 chunk | 0 | +1 | ~32KB |
| subscriber CBOR decode | ~5 | 0 | +5 | ~2KB |
| subscriber arena deinit | 0 | 1 chunk | -1 | ~32KB |
| frame data dupe | 1 | 0 | +1 | ~3KB |
| worker arena init | 1 chunk | 0 | +1 | ~32KB |
| worker CBOR re-decode | ~5 | 0 | +5 | ~2KB |
| pg: uidForDid (cache hit) | 0 | 0 | 0 | 0 |
| pg: isAccountActive | 1 arena | 1 arena | 0 | ~2KB |
| pg: getAccountState | 1 arena | 1 arena | 0 | ~2KB |
| multibase encode | 1 | 0 | +1 | ~50B |
| validator arena init | 1 chunk | 0 | +1 | ~32KB |
| verifyCommitCar | ~10 | 0 | +10 | ~20KB |
| validator arena deinit | 0 | 1 chunk | -1 | ~32KB |
| persist data alloc | 1 | 0 | +1 | ~3KB |
| resequenceFrame | ~5 | 0 | +5 | ~5KB |
| SharedFrame.create | 2 (struct+data) | 0 | +2 | ~3KB |
| ring buffer dupe | 1 | 1 (overwrite) | 0 | ~3KB |
| collection index keys | 2 | 2 | 0 | ~200B |
| worker arena deinit | 0 | 1 chunk | -1 | ~32KB |
| frame data free | 0 | 1 | -1 | ~3KB |
| persist flush (batched) | 0 | 1 | -1 | ~3KB |
| SharedFrame release | 0 | 2 | -2 | ~3KB |
| TOTAL per frame | ~37 | ~12 | 0 | 0 |
all allocations balance out. yet RSS grows linearly. the remaining hypotheses are:
- glibc malloc fragmentation in the per-thread arenas (invisible to mallinfo/malloc_trim)
- a leak in a dependency (pg.zig error path, zat internals, rocksdb-zig)
- a leak we haven't found in the zig code