atproto relay implementation in zig zlay.waow.tech

zlay memory allocation audit#

linear RSS growth: ~290 MiB/hour, 0 → 3.5 GiB in 12h (~117 bytes/frame at 700 frames/sec)

all allocations use std.heap.c_allocator (glibc malloc) unless noted.


1. per-frame hot path (~700 frames/sec across 16 workers)#

each incoming firehose frame passes through: subscriber → thread pool → frame worker. the following allocations happen FOR EVERY FRAME.

1a. subscriber header decode arena#

  • file: subscriber.zig:260-261
  • what: ArenaAllocator.init(sub.allocator) — lightweight CBOR header+payload decode
  • size: ~32KB chunk (default arena chunk size), actual used ~1-5KB
  • freed: defer arena.deinit() — same function
  • status: CLEAN

1b. frame data dupe (subscriber → worker handoff)#

  • file: subscriber.zig:341
  • what: sub.allocator.dupe(u8, data) — copies raw frame bytes for async processing
  • size: ~2-5KB per frame (full websocket message)
  • freed: frame_worker.zig:34 defer work.allocator.free(work.data)
  • status: CLEAN (also freed on backpressure: subscriber.zig:353)

1c. frame worker processing arena#

  • file: frame_worker.zig:36-37
  • what: ArenaAllocator.init(work.allocator) — full CBOR re-decode, multibase encode, resequence
  • size: ~32KB chunk, actual used ~5-20KB (depends on frame complexity)
  • freed: defer arena.deinit() — same function
  • status: CLEAN

1d. persist data allocation#

  • file: event_log.zig:592
  • what: self.allocator.alloc(u8, header_size + payload.len) — 28-byte LE header + raw CBOR payload
  • size: 28 + payload_len (~2-5KB)
  • freed: event_log.zig:845-846 in flushLocked()for (self.evtbuf.items) |job| self.allocator.free(job.data)
  • flush trigger: every 400 events OR every 100ms (whichever comes first)
  • status: CLEAN — but see note on outbuf/evtbuf below

1e. outbuf/evtbuf ArrayList growth#

  • file: event_log.zig:99-100
  • what: outbuf: ArrayListUnmanaged(u8) and evtbuf: ArrayListUnmanaged(PersistJob)
  • growth: ArrayList uses 2x growth factor. outbuf accumulates raw bytes, evtbuf accumulates job structs.
  • freed: clearRetainingCapacity() on flush — backing memory is KEPT, only len reset to 0
  • bounded: yes — max ~400 entries before flush. outbuf max ~400 × 5KB = ~2MB. backing array ~4MB max.
  • status: CLEAN (bounded, retains capacity)

1f. SharedFrame for broadcast#

  • file: broadcaster.zig:50-59 (SharedFrame.create)
  • what: allocates SharedFrame struct + dupes frame data
  • size: sizeof(SharedFrame) (~48 bytes) + data.len (~2-5KB)
  • freed: ref-counted. broadcaster releases immediately (defer frame.release() line 372). each consumer releases after send (writeLoop line 231). freed when refcount hits 0 (line 66-70).
  • status: CLEAN (if no consumers, created+freed immediately)

1g. ring buffer history entry#

  • file: ring_buffer.zig:55 via broadcaster.zig:368
  • what: self.allocator.dupe(u8, data) — copies frame data into history
  • size: ~2-5KB per frame (resequenced CBOR)
  • freed: on overwrite when buffer is full (ring_buffer.zig:60-62)
  • bounded: 50,000 entries. once full, every push frees the oldest.
  • steady state: 50K × ~3KB avg = ~150MB
  • status: CLEAN (bounded)

1h. resequenceFrame temporaries#

  • file: broadcaster.zig:117-156, called from frame_worker.zig:232
  • what: CBOR decode + re-encode with new seq. allocates ArrayList, CBOR encode buffers, result slice.
  • size: ~5-10KB temporaries
  • freed: allocated on frame worker's arena (1c), freed when arena deinits
  • status: CLEAN

1i. collection index keys (per commit with ops)#

  • file: collection_index.zig:79-82 via trackCommitOps → addCollection
  • what: makeKey() allocates 2 keys: collection\0did and did\0collection
  • size: ~100-200 bytes per key pair
  • freed: defer self.allocator.free(rbc_key) / defer self.allocator.free(cbr_key) — same function
  • status: CLEAN

1j. postgres queries per frame (via pg.zig)#

each frame triggers 3-5 postgres queries. each query allocates internally:

query file method per-frame?
DID→UID lookup event_log.zig:294 rowUnsafe yes (cache miss only)
DID→UID create event_log.zig:306 exec yes (first encounter only)
DID→UID readback event_log.zig:315 rowUnsafe yes (first encounter only)
isAccountActive event_log.zig:398 rowUnsafe yes (commits/syncs)
getAccountState event_log.zig:339 rowUnsafe yes (commits, for rev check)
updateAccountState event_log.zig:358 exec yes (validated commits)
updateUpstreamStatus event_log.zig:389 exec yes (#account events)
updateHostSeq event_log.zig:458 exec every 4s per host
getAccountHostId event_log.zig:369 rowUnsafe yes (uidForDidFromHost)

pg.zig internals (from karlseguin/pg.zig):

  • Pool.rowUnsafe(): acquires connection from pool, creates a Result with its own ArenaAllocator
  • Result.deinit(): releases connection back to pool, destroys arena
  • QueryRowUnsafe.deinit(): calls result.drain() THEN result.deinit()

POTENTIAL ISSUE — pg.zig deinit error path:

// pg.zig QueryRowUnsafe.deinit():
pub fn deinit(self: *QueryRowUnsafe) !void {
    try self.result.drain();   // if this errors...
    self.result.deinit();      // ...this NEVER runs → arena leak + connection leak
}

zlay callers use defer row.deinit() catch {}; — if drain() errors, the Result arena and connection are LEAKED. drain() errors on network failure to postgres.

  • likelihood: low per-frame (postgres is local), but at 2,000+ queries/sec, even 0.01% failure = 0.2 leaks/sec
  • size per leak: ~1-4KB (pg arena) + 1 connection slot
  • impact at 0.01%: ~0.7 MB/hour (doesn't explain 290 MB/hour)
  • impact at 0.1%: ~7 MB/hour (partial explanation?)

1k. validator arenas (per validated commit/sync)#

  • file: validator.zig:161-163 (validateSync) and validator.zig:242-244 (verifyCommit)
  • what: ArenaAllocator.init(self.allocator) for CAR parsing + signature verification
  • size: depends on CAR size. typical commit: ~5-50KB used.
  • freed: defer arena.deinit() — same function
  • note: these were NOT changed in the page_allocator experiment
  • status: CLEAN (assuming arena.deinit works correctly)

1l. getAccountState string dupes#

  • file: event_log.zig:348-349
  • what: allocator.dupe(u8, rev) and allocator.dupe(u8, data_cid) — copies from pg result into caller's arena
  • size: ~20-50 bytes each (TID rev + multibase CID)
  • freed: allocated on frame worker's arena (1c), freed when arena deinits
  • status: CLEAN

2. per-connection allocations (~2,750 PDS hosts)#

2a. subscriber struct + hostname#

  • file: slurper.zig:426-427, 423
  • what: allocator.create(Subscriber) + allocator.dupe(u8, hostname)
  • size: sizeof(Subscriber) + hostname (~20-50 bytes)
  • freed: slurper.zig:468-469 (runWorker cleanup on exit)
  • status: CLEAN (bounded by host count)

2b. websocket.Client per connection#

  • file: subscriber.zig:220-227
  • what: websocket.Client.init(self.allocator, ...) — TLS buffers, read buffers
  • size: TLS handshake buffers (~134KB), static read buffer (4096 bytes)
  • freed: defer client.deinit() — subscriber.zig:227
  • status: CLEAN (created/freed per connection attempt)

2c. websocket dynamic read buffers#

  • what: for messages > 4096 bytes, BufferProvider allocates a dynamic buffer
  • freed: reader.done()restoreStatic()provider.release()allocator.free()
  • status: CLEAN (freed after each message)

2d. shared TLS CA bundle#

  • file: slurper.zig:251-253
  • what: bundle.rescan(self.allocator) — loads system CA certs
  • size: ~100-200KB (one-time)
  • freed: slurper.zig:575 b.deinit(self.allocator)
  • status: CLEAN (one-time, shared)

2e. thread stacks#

  • what: each subscriber thread gets 8MB virtual stack (default_stack_size)
  • count: ~2,750 subscriber threads + 16 worker threads + 4 resolver threads + 3 background threads = ~2,773 threads
  • total virtual: ~22 GiB (but only touched pages count as RSS)
  • status: BOUNDED (pages are returned when threads exit)

3. bounded caches and data structures#

3a. DID → UID cache (LRU)#

  • file: event_log.zig:96, lru.zig
  • capacity: 500,000 entries
  • per entry: duped key string (~30 bytes DID) + Node struct (~80 bytes) + u64 value + HashMap entry
  • steady state: ~55-65 MB
  • status: BOUNDED (evicts LRU on put when full)

3b. validator signing key cache (LRU)#

  • file: validator.zig:50, lru.zig
  • capacity: 250,000 entries
  • per entry: duped key string (~30 bytes DID) + Node struct (~80 bytes) + CachedKey (fixed 42 bytes) + HashMap entry
  • steady state: ~30-40 MB
  • status: BOUNDED (evicts LRU on put when full)

3c. resolve queue#

  • file: validator.zig:52-54
  • what: queue: ArrayListUnmanaged([]const u8) + queued_set: StringHashMapUnmanaged(void)
  • per entry: duped DID string (~30 bytes) + HashMap entry
  • max size: 100,000 (validator.zig:63)
  • freed: DID freed after resolution (validator.zig:419)
  • status: BOUNDED

3d. workers map#

  • file: slurper.zig:217
  • what: std.AutoHashMapUnmanaged(u64, WorkerEntry) — host_id → thread+subscriber
  • size: bounded by host count (~2,750)
  • status: BOUNDED

3e. crawl request queue#

  • file: slurper.zig:221
  • what: ArrayListUnmanaged([]const u8) — duped hostnames waiting for processing
  • freed: slurper.zig:522 defer self.allocator.free(h) after processing
  • status: BOUNDED (grows slowly, processed continuously)

3f. frame pool ring buffers#

  • file: thread_pool.zig
  • what: pre-allocated ring buffers per worker. zero alloc per submit.
  • size: 16 workers × 4096 capacity × sizeof(FrameWork) per entry
  • status: FIXED SIZE (no growth)

3g. consumer list#

  • file: broadcaster.zig:291
  • what: ArrayListUnmanaged(*Consumer) — active downstream WebSocket consumers
  • per consumer: Consumer struct (~66KB for 8192-entry SharedFrame pointer buffer) + write thread
  • status: BOUNDED by connected consumers (typically 0-10)

4. API handler allocations (cold paths)#

4a. handleAdminListHosts — ALLOCATES ON c_allocator#

  • file: api/admin.zig:89-99
  • what: persist.listAllHosts(persist.allocator) — dupes hostname+status for each host
  • freed: defer block frees all hostname/status strings + slice
  • status: CLEAN

4b. handleBan — LEAKS (cold path)#

  • file: api/admin.zig:70-78
  • what: buildAccountFrame(ctx.persist.allocator, did) → allocates CBOR frame on c_allocator
  • what: broadcaster.resequenceFrame(ctx.persist.allocator, frame_bytes, relay_seq) → allocates resequenced frame on c_allocator
  • freed: NEITHER frame_bytes NOR broadcast_data is freed after broadcast
  • size: ~200-500 bytes per ban
  • status: LEAK — but cold path (admin-only, negligible impact)

4c. handleAdminBackfillStatus#

  • file: api/admin.zig:197-201
  • what: backfiller.getStatus(backfiller.allocator) → builds JSON string on c_allocator
  • freed: defer backfiller.allocator.free(body)
  • status: CLEAN

4d. xrpc handlers (listRepos, getRepoStatus, etc.)#

  • what: all use stack-allocated fixed buffers (65536 bytes) or pg query iteration
  • no heap allocation in the handler code itself
  • pg queries: each creates/destroys internal pg arenas (same pattern as 1j)
  • status: CLEAN

4e. handleRequestCrawl#

  • file: api/xrpc.zig:429
  • what: std.json.parseFromSlice(...) on slurper.allocator — parses JSON body
  • freed: defer parsed.deinit()
  • what: validateHostname(slurper.allocator, ...) — allocates normalized hostname
  • freed: defer slurper.allocator.free(hostname)
  • what: slurper.addCrawlRequest(hostname) → dupes hostname
  • freed: by crawl processor (slurper.zig:522)
  • status: CLEAN

5. one-time startup allocations#

what file size
Broadcaster struct main.zig:140 ~400KB (history array)
Validator struct main.zig:143 ~100 bytes
DiskPersist (+ pg pool) main.zig:149 ~10KB + pg connections
CollectionIndex (RocksDB) main.zig:169 RocksDB internal (~50-100MB)
Backfiller struct main.zig:176 ~100 bytes
CA bundle slurper.zig:251 ~100-200KB
Frame pool slurper.zig:257 pre-allocated ring buffers
Error frame (CBOR) broadcaster.zig:306 ~50 bytes
MetricsServer struct main.zig:216 ~100 bytes
build_options module (compile-time) ~100 bytes

6. backfill allocations (when running)#

6a. discoverCollections#

  • file: backfill.zig:101-148
  • what: fetches lexicon garden llms.txt + RBC scan, deduplicates
  • freed: all temporaries freed via defer blocks
  • status: CLEAN (only runs when admin triggers backfill)

6b. backfillCollection per-page#

  • file: backfill.zig:269-318
  • what: per-page: HTTP fetch → JSON parse → dupe DIDs → addCollection to RocksDB
  • freed: all duped strings freed in defer blocks
  • http client: reused across pages for one collection, freed after collection done
  • status: CLEAN

7. dependency internal allocations#

7a. RocksDB (via rocksdb-zig)#

  • internal memory managed by RocksDB C library (block cache, memtables, etc.)
  • not tracked by c_allocator — uses its own allocator
  • bounded by RocksDB options (write_buffer_size, max_open_files, block_cache_size)

7b. pg.zig connection pool#

  • file: initialized in event_log.zig:132 with size = 5
  • 5 connections, each with internal read/write buffers
  • status: BOUNDED (5 connections)

7c. zat.DidResolver (per resolver thread)#

  • file: validator.zig:401
  • what: zat.DidResolver.init(self.allocator) — creates HTTP client for DID resolution
  • long-lived: one per resolver thread (4 total), lives until shutdown
  • internal: likely holds std.http.Client with connection pool
  • potential issue: if std.http.Client pools connections to many unique hosts (PLC server, PDS endpoints), the pool could grow. but PLC is typically one host (plc.directory).

7d. zat CBOR decode#

  • all CBOR decode operations are on arena allocators (subscriber arena or worker arena)
  • freed when arena deinits
  • status: CLEAN

8. summary of potential issues#

confirmed leak (cold path, negligible):#

  • admin handleBan: leaks frame_bytes + broadcast_data (~500 bytes per ban)

potential leak under network errors:#

  • pg.zig QueryRowUnsafe.deinit(): if drain() errors, Result arena + connection leak
  • at 2,000+ queries/sec, even rare failures accumulate
  • needs investigation: check postgres error rate in logs

fragmentation concerns:#

  • ~700 frames/sec × ~10 alloc/free cycles per frame = ~7,000 alloc/free operations per second on c_allocator
  • many different sizes (28-byte headers, 2-5KB frames, 100-byte keys, 1-4KB pg arenas)
  • glibc with MALLOC_ARENA_MAX=2 concentrates fragmentation into 2 arenas
  • mallinfo() only reports the MAIN arena — second arena is invisible
  • malloc_trim(0) only trims the main arena — second arena is untrimmed

items NOT investigated:#

  • zat library internals (CBOR allocator patterns, DID resolver HTTP client connection pooling)
  • rocksdb-zig binding allocations (WriteBatch, Iterator internal state)
  • std.http.Client internal connection/TLS buffer retention within zat.DidResolver

9. per-frame allocation count summary#

for a typical validated #commit frame, approximately:

step allocs frees net size
subscriber arena init 1 chunk 0 +1 ~32KB
subscriber CBOR decode ~5 0 +5 ~2KB
subscriber arena deinit 0 1 chunk -1 ~32KB
frame data dupe 1 0 +1 ~3KB
worker arena init 1 chunk 0 +1 ~32KB
worker CBOR re-decode ~5 0 +5 ~2KB
pg: uidForDid (cache hit) 0 0 0 0
pg: isAccountActive 1 arena 1 arena 0 ~2KB
pg: getAccountState 1 arena 1 arena 0 ~2KB
multibase encode 1 0 +1 ~50B
validator arena init 1 chunk 0 +1 ~32KB
verifyCommitCar ~10 0 +10 ~20KB
validator arena deinit 0 1 chunk -1 ~32KB
persist data alloc 1 0 +1 ~3KB
resequenceFrame ~5 0 +5 ~5KB
SharedFrame.create 2 (struct+data) 0 +2 ~3KB
ring buffer dupe 1 1 (overwrite) 0 ~3KB
collection index keys 2 2 0 ~200B
worker arena deinit 0 1 chunk -1 ~32KB
frame data free 0 1 -1 ~3KB
persist flush (batched) 0 1 -1 ~3KB
SharedFrame release 0 2 -2 ~3KB
TOTAL per frame ~37 ~12 0 0

all allocations balance out. yet RSS grows linearly. the remaining hypotheses are:

  1. glibc malloc fragmentation in the per-thread arenas (invisible to mallinfo/malloc_trim)
  2. a leak in a dependency (pg.zig error path, zat internals, rocksdb-zig)
  3. a leak we haven't found in the zig code