zlay memory allocation audit#

linear RSS growth: ~290 MiB/hour, 0 → 3.5 GiB in 12h (~117 bytes/frame at 700 frames/sec)

all allocations use std.heap.c_allocator (glibc malloc) unless noted.

1. per-frame hot path (~700 frames/sec across 16 workers)#

each incoming firehose frame passes through: subscriber → thread pool → frame worker. the following allocations happen FOR EVERY FRAME.

1a. subscriber header decode arena#

file: subscriber.zig:260-261
what: ArenaAllocator.init(sub.allocator) — lightweight CBOR header+payload decode
size: ~32KB chunk (default arena chunk size), actual used ~1-5KB
freed: defer arena.deinit() — same function
status: CLEAN

1b. frame data dupe (subscriber → worker handoff)#

file: subscriber.zig:341
what: sub.allocator.dupe(u8, data) — copies raw frame bytes for async processing
size: ~2-5KB per frame (full websocket message)
freed: frame_worker.zig:34 defer work.allocator.free(work.data)
status: CLEAN (also freed on backpressure: subscriber.zig:353)

1c. frame worker processing arena#

file: frame_worker.zig:36-37
what: ArenaAllocator.init(work.allocator) — full CBOR re-decode, multibase encode, resequence
size: ~32KB chunk, actual used ~5-20KB (depends on frame complexity)
freed: defer arena.deinit() — same function
status: CLEAN

1d. persist data allocation#

file: event_log.zig:592
what: self.allocator.alloc(u8, header_size + payload.len) — 28-byte LE header + raw CBOR payload
size: 28 + payload_len (~2-5KB)
freed: event_log.zig:845-846 in flushLocked() — for (self.evtbuf.items) |job| self.allocator.free(job.data)
flush trigger: every 400 events OR every 100ms (whichever comes first)
status: CLEAN — but see note on outbuf/evtbuf below

1e. outbuf/evtbuf ArrayList growth#

file: event_log.zig:99-100
what: outbuf: ArrayListUnmanaged(u8) and evtbuf: ArrayListUnmanaged(PersistJob)
growth: ArrayList uses 2x growth factor. outbuf accumulates raw bytes, evtbuf accumulates job structs.
freed: clearRetainingCapacity() on flush — backing memory is KEPT, only len reset to 0
bounded: yes — max ~400 entries before flush. outbuf max ~400 × 5KB = ~2MB. backing array ~4MB max.
status: CLEAN (bounded, retains capacity)

1f. SharedFrame for broadcast#

file: broadcaster.zig:50-59 (SharedFrame.create)
what: allocates SharedFrame struct + dupes frame data
size: sizeof(SharedFrame) (~48 bytes) + data.len (~2-5KB)
freed: ref-counted. broadcaster releases immediately (defer frame.release() line 372). each consumer releases after send (writeLoop line 231). freed when refcount hits 0 (line 66-70).
status: CLEAN (if no consumers, created+freed immediately)

1g. ring buffer history entry#

file: ring_buffer.zig:55 via broadcaster.zig:368
what: self.allocator.dupe(u8, data) — copies frame data into history
size: ~2-5KB per frame (resequenced CBOR)
freed: on overwrite when buffer is full (ring_buffer.zig:60-62)
bounded: 50,000 entries. once full, every push frees the oldest.
steady state: 50K × ~3KB avg = ~150MB
status: CLEAN (bounded)

1h. resequenceFrame temporaries#

file: broadcaster.zig:117-156, called from frame_worker.zig:232
what: CBOR decode + re-encode with new seq. allocates ArrayList, CBOR encode buffers, result slice.
size: ~5-10KB temporaries
freed: allocated on frame worker's arena (1c), freed when arena deinits
status: CLEAN

1i. collection index keys (per commit with ops)#

file: collection_index.zig:79-82 via trackCommitOps → addCollection
what: makeKey() allocates 2 keys: collection\0did and did\0collection
size: ~100-200 bytes per key pair
freed: defer self.allocator.free(rbc_key) / defer self.allocator.free(cbr_key) — same function
status: CLEAN

1j. postgres queries per frame (via pg.zig)#

each frame triggers 3-5 postgres queries. each query allocates internally:

query	file	method	per-frame?
DID→UID lookup	event_log.zig:294	rowUnsafe	yes (cache miss only)
DID→UID create	event_log.zig:306	exec	yes (first encounter only)
DID→UID readback	event_log.zig:315	rowUnsafe	yes (first encounter only)
isAccountActive	event_log.zig:398	rowUnsafe	yes (commits/syncs)
getAccountState	event_log.zig:339	rowUnsafe	yes (commits, for rev check)
updateAccountState	event_log.zig:358	exec	yes (validated commits)
updateUpstreamStatus	event_log.zig:389	exec	yes (#account events)
updateHostSeq	event_log.zig:458	exec	every 4s per host
getAccountHostId	event_log.zig:369	rowUnsafe	yes (uidForDidFromHost)

pg.zig internals (from karlseguin/pg.zig):

Pool.rowUnsafe(): acquires connection from pool, creates a Result with its own ArenaAllocator
Result.deinit(): releases connection back to pool, destroys arena
QueryRowUnsafe.deinit(): calls result.drain() THEN result.deinit()

POTENTIAL ISSUE — pg.zig deinit error path:

// pg.zig QueryRowUnsafe.deinit():
pub fn deinit(self: *QueryRowUnsafe) !void {
    try self.result.drain();   // if this errors...
    self.result.deinit();      // ...this NEVER runs → arena leak + connection leak
}

zlay callers use defer row.deinit() catch {}; — if drain() errors, the Result arena and connection are LEAKED. drain() errors on network failure to postgres.

likelihood: low per-frame (postgres is local), but at 2,000+ queries/sec, even 0.01% failure = 0.2 leaks/sec
size per leak: ~1-4KB (pg arena) + 1 connection slot
impact at 0.01%: ~0.7 MB/hour (doesn't explain 290 MB/hour)
impact at 0.1%: ~7 MB/hour (partial explanation?)

1k. validator arenas (per validated commit/sync)#

file: validator.zig:161-163 (validateSync) and validator.zig:242-244 (verifyCommit)
what: ArenaAllocator.init(self.allocator) for CAR parsing + signature verification
size: depends on CAR size. typical commit: ~5-50KB used.
freed: defer arena.deinit() — same function
note: these were NOT changed in the page_allocator experiment
status: CLEAN (assuming arena.deinit works correctly)

1l. getAccountState string dupes#

file: event_log.zig:348-349
what: allocator.dupe(u8, rev) and allocator.dupe(u8, data_cid) — copies from pg result into caller's arena
size: ~20-50 bytes each (TID rev + multibase CID)
freed: allocated on frame worker's arena (1c), freed when arena deinits
status: CLEAN

2. per-connection allocations (~2,750 PDS hosts)#

2a. subscriber struct + hostname#

file: slurper.zig:426-427, 423
what: allocator.create(Subscriber) + allocator.dupe(u8, hostname)
size: sizeof(Subscriber) + hostname (~20-50 bytes)
freed: slurper.zig:468-469 (runWorker cleanup on exit)
status: CLEAN (bounded by host count)

2b. websocket.Client per connection#

file: subscriber.zig:220-227
what: websocket.Client.init(self.allocator, ...) — TLS buffers, read buffers
size: TLS handshake buffers (~134KB), static read buffer (4096 bytes)
freed: defer client.deinit() — subscriber.zig:227
status: CLEAN (created/freed per connection attempt)

2c. websocket dynamic read buffers#

what: for messages > 4096 bytes, BufferProvider allocates a dynamic buffer
freed: reader.done() → restoreStatic() → provider.release() → allocator.free()
status: CLEAN (freed after each message)

2d. shared TLS CA bundle#

file: slurper.zig:251-253
what: bundle.rescan(self.allocator) — loads system CA certs
size: ~100-200KB (one-time)
freed: slurper.zig:575 b.deinit(self.allocator)
status: CLEAN (one-time, shared)

2e. thread stacks#

what: each subscriber thread gets 8MB virtual stack (default_stack_size)
count: ~2,750 subscriber threads + 16 worker threads + 4 resolver threads + 3 background threads = ~2,773 threads
total virtual: ~22 GiB (but only touched pages count as RSS)
status: BOUNDED (pages are returned when threads exit)

3. bounded caches and data structures#

3a. DID → UID cache (LRU)#

file: event_log.zig:96, lru.zig
capacity: 500,000 entries
per entry: duped key string (~30 bytes DID) + Node struct (~80 bytes) + u64 value + HashMap entry
steady state: ~55-65 MB
status: BOUNDED (evicts LRU on put when full)

3b. validator signing key cache (LRU)#

file: validator.zig:50, lru.zig
capacity: 250,000 entries
per entry: duped key string (~30 bytes DID) + Node struct (~80 bytes) + CachedKey (fixed 42 bytes) + HashMap entry
steady state: ~30-40 MB
status: BOUNDED (evicts LRU on put when full)

3c. resolve queue#

file: validator.zig:52-54
what: queue: ArrayListUnmanaged([]const u8) + queued_set: StringHashMapUnmanaged(void)
per entry: duped DID string (~30 bytes) + HashMap entry
max size: 100,000 (validator.zig:63)
freed: DID freed after resolution (validator.zig:419)
status: BOUNDED

3d. workers map#

file: slurper.zig:217
what: std.AutoHashMapUnmanaged(u64, WorkerEntry) — host_id → thread+subscriber
size: bounded by host count (~2,750)
status: BOUNDED

3e. crawl request queue#

file: slurper.zig:221
what: ArrayListUnmanaged([]const u8) — duped hostnames waiting for processing
freed: slurper.zig:522 defer self.allocator.free(h) after processing
status: BOUNDED (grows slowly, processed continuously)

3f. frame pool ring buffers#

file: thread_pool.zig
what: pre-allocated ring buffers per worker. zero alloc per submit.
size: 16 workers × 4096 capacity × sizeof(FrameWork) per entry
status: FIXED SIZE (no growth)

3g. consumer list#

file: broadcaster.zig:291
what: ArrayListUnmanaged(*Consumer) — active downstream WebSocket consumers
per consumer: Consumer struct (~66KB for 8192-entry SharedFrame pointer buffer) + write thread
status: BOUNDED by connected consumers (typically 0-10)

4. API handler allocations (cold paths)#

4a. handleAdminListHosts — ALLOCATES ON c_allocator#

file: api/admin.zig:89-99
what: persist.listAllHosts(persist.allocator) — dupes hostname+status for each host
freed: defer block frees all hostname/status strings + slice
status: CLEAN

4b. handleBan — LEAKS (cold path)#

file: api/admin.zig:70-78
what: buildAccountFrame(ctx.persist.allocator, did) → allocates CBOR frame on c_allocator
what: broadcaster.resequenceFrame(ctx.persist.allocator, frame_bytes, relay_seq) → allocates resequenced frame on c_allocator
freed: NEITHER frame_bytes NOR broadcast_data is freed after broadcast
size: ~200-500 bytes per ban
status: LEAK — but cold path (admin-only, negligible impact)

4c. handleAdminBackfillStatus#

file: api/admin.zig:197-201
what: backfiller.getStatus(backfiller.allocator) → builds JSON string on c_allocator
freed: defer backfiller.allocator.free(body)
status: CLEAN

4d. xrpc handlers (listRepos, getRepoStatus, etc.)#

what: all use stack-allocated fixed buffers (65536 bytes) or pg query iteration
no heap allocation in the handler code itself
pg queries: each creates/destroys internal pg arenas (same pattern as 1j)
status: CLEAN

4e. handleRequestCrawl#

file: api/xrpc.zig:429
what: std.json.parseFromSlice(...) on slurper.allocator — parses JSON body
freed: defer parsed.deinit()
what: validateHostname(slurper.allocator, ...) — allocates normalized hostname
freed: defer slurper.allocator.free(hostname)
what: slurper.addCrawlRequest(hostname) → dupes hostname
freed: by crawl processor (slurper.zig:522)
status: CLEAN

5. one-time startup allocations#

what	file	size
Broadcaster struct	main.zig:140	~400KB (history array)
Validator struct	main.zig:143	~100 bytes
DiskPersist (+ pg pool)	main.zig:149	~10KB + pg connections
CollectionIndex (RocksDB)	main.zig:169	RocksDB internal (~50-100MB)
Backfiller struct	main.zig:176	~100 bytes
CA bundle	slurper.zig:251	~100-200KB
Frame pool	slurper.zig:257	pre-allocated ring buffers
Error frame (CBOR)	broadcaster.zig:306	~50 bytes
MetricsServer struct	main.zig:216	~100 bytes
build_options module	(compile-time)	~100 bytes

6. backfill allocations (when running)#

6a. discoverCollections#

file: backfill.zig:101-148
what: fetches lexicon garden llms.txt + RBC scan, deduplicates
freed: all temporaries freed via defer blocks
status: CLEAN (only runs when admin triggers backfill)

6b. backfillCollection per-page#

file: backfill.zig:269-318
what: per-page: HTTP fetch → JSON parse → dupe DIDs → addCollection to RocksDB
freed: all duped strings freed in defer blocks
http client: reused across pages for one collection, freed after collection done
status: CLEAN

7. dependency internal allocations#

7a. RocksDB (via rocksdb-zig)#

internal memory managed by RocksDB C library (block cache, memtables, etc.)
not tracked by c_allocator — uses its own allocator
bounded by RocksDB options (write_buffer_size, max_open_files, block_cache_size)

7b. pg.zig connection pool#

file: initialized in event_log.zig:132 with size = 5
5 connections, each with internal read/write buffers
status: BOUNDED (5 connections)

7c. zat.DidResolver (per resolver thread)#

file: validator.zig:401
what: zat.DidResolver.init(self.allocator) — creates HTTP client for DID resolution
long-lived: one per resolver thread (4 total), lives until shutdown
internal: likely holds std.http.Client with connection pool
potential issue: if std.http.Client pools connections to many unique hosts (PLC server, PDS endpoints), the pool could grow. but PLC is typically one host (plc.directory).

7d. zat CBOR decode#

all CBOR decode operations are on arena allocators (subscriber arena or worker arena)
freed when arena deinits
status: CLEAN

8. summary of potential issues#

confirmed leak (cold path, negligible):#

admin handleBan: leaks frame_bytes + broadcast_data (~500 bytes per ban)

potential leak under network errors:#

pg.zig QueryRowUnsafe.deinit(): if drain() errors, Result arena + connection leak
at 2,000+ queries/sec, even rare failures accumulate
needs investigation: check postgres error rate in logs

fragmentation concerns:#

~700 frames/sec × ~10 alloc/free cycles per frame = ~7,000 alloc/free operations per second on c_allocator
many different sizes (28-byte headers, 2-5KB frames, 100-byte keys, 1-4KB pg arenas)
glibc with MALLOC_ARENA_MAX=2 concentrates fragmentation into 2 arenas
mallinfo() only reports the MAIN arena — second arena is invisible
malloc_trim(0) only trims the main arena — second arena is untrimmed

items NOT investigated:#

zat library internals (CBOR allocator patterns, DID resolver HTTP client connection pooling)
rocksdb-zig binding allocations (WriteBatch, Iterator internal state)
std.http.Client internal connection/TLS buffer retention within zat.DidResolver

9. per-frame allocation count summary#

for a typical validated #commit frame, approximately:

step	allocs	frees	net	size
subscriber arena init	1 chunk	0	+1	~32KB
subscriber CBOR decode	~5	0	+5	~2KB
subscriber arena deinit	0	1 chunk	-1	~32KB
frame data dupe	1	0	+1	~3KB
worker arena init	1 chunk	0	+1	~32KB
worker CBOR re-decode	~5	0	+5	~2KB
pg: uidForDid (cache hit)	0	0	0	0
pg: isAccountActive	1 arena	1 arena	0	~2KB
pg: getAccountState	1 arena	1 arena	0	~2KB
multibase encode	1	0	+1	~50B
validator arena init	1 chunk	0	+1	~32KB
verifyCommitCar	~10	0	+10	~20KB
validator arena deinit	0	1 chunk	-1	~32KB
persist data alloc	1	0	+1	~3KB
resequenceFrame	~5	0	+5	~5KB
SharedFrame.create	2 (struct+data)	0	+2	~3KB
ring buffer dupe	1	1 (overwrite)	0	~3KB
collection index keys	2	2	0	~200B
worker arena deinit	0	1 chunk	-1	~32KB
frame data free	0	1	-1	~3KB
persist flush (batched)	0	1	-1	~3KB
SharedFrame release	0	2	-2	~3KB
TOTAL per frame	~37	~12	0	0

all allocations balance out. yet RSS grows linearly. the remaining hypotheses are:

glibc malloc fragmentation in the per-thread arenas (invisible to mallinfo/malloc_trim)
a leak in a dependency (pg.zig error path, zat internals, rocksdb-zig)
a leak we haven't found in the zig code