zlay_stat.md · by zzstoatzz.io

zlay memory leak investigation — status update#

Setup#

zlay is a Zig 0.15 AT Protocol relay crawling ~2,700 PDS hosts. Steady-state memory growth of ~240 MiB/hr with both LRU caches (validator 250K, DID 500K) already full. All bounded data structures have been ruled out — the leak is somewhere in the per-resolve allocation path.

Experiments Run#

1. RESOLVER_RECYCLE_INTERVAL=0 (disable periodic resolver destroy/recreate) Steady-state slope ~313 MiB/hr — worse than baseline ~240 MiB/hr. Confirms recycling was partially containing the problem by periodically destroying accumulated state.

2. RESOLVER_THREADS=0 (disable DID resolution entirely) Malloc flat at ~277 MiB after 49 minutes. The entire firehose pipeline (2,681 PDS connections, CBOR decode, RocksDB writes, broadcasting) runs without memory growth. This definitively isolates the leak to the resolver path.

3. RESOLVER_KEEP_ALIVE=false (current, running ~75 min) Disables HTTP keepalive on the resolver's std.http.Client, so TLS connections are not reused across resolves. Early signal is promising — the initial slope is visibly flatter than any previous run at the same stage. Caches are still filling (DID cache at 47%, validator at 19%), so steady-state slope can't be measured yet.

What We Think Is Happening#

The leak is inside Zig's std.http.Client connection reuse path. When keepalive is enabled, the client maintains a connection pool (bounded at 32 slots) with TLS sessions. Something in the TLS connection lifecycle — possibly session tickets, certificate chain buffers, or connection metadata — accumulates and is never fully reclaimed even when connections are evicted from the pool.

Evidence:

THREADS=0 (no HTTP at all) → flat memory
RECYCLE=0 (never destroy the client) → faster growth than baseline
baseline with RECYCLE=1000 (destroy client every 1000 resolves) → slower growth, but still leaking
KEEP_ALIVE=false (no connection reuse) → early signal looks flat, pending confirmation

What We're Waiting For#

Both caches need to fill completely (~2–3 hours from deploy), then 1–2 hours of steady-state to measure the residual slope. If the slope drops to near-zero with keepalive disabled, the fix is confirmed — options are to either ship keepalive=false as the production config (at the cost of slightly higher resolve latency from per-request TLS handshakes), or dig into std.http.Client to find the actual leak.

ETA for conclusive data: ~3 hours from now.