chasing mem leaks? or being silly
zlay_stat.md
37 lines 2.6 kB view raw view rendered
1## zlay memory leak investigation — status update 2 3### Setup 4zlay is a Zig 0.15 AT Protocol relay crawling ~2,700 PDS hosts. Steady-state memory growth of ~240 MiB/hr with both LRU caches (validator 250K, DID 500K) already full. All bounded data structures have been ruled out — the leak is somewhere in the per-resolve allocation path. 5 6--- 7 8### Experiments Run 9 10**1. `RESOLVER_RECYCLE_INTERVAL=0` (disable periodic resolver destroy/recreate)** 11Steady-state slope ~313 MiB/hr — worse than baseline ~240 MiB/hr. Confirms recycling was partially containing the problem by periodically destroying accumulated state. 12 13**2. `RESOLVER_THREADS=0` (disable DID resolution entirely)** 14Malloc flat at ~277 MiB after 49 minutes. The entire firehose pipeline (2,681 PDS connections, CBOR decode, RocksDB writes, broadcasting) runs without memory growth. **This definitively isolates the leak to the resolver path.** 15 16**3. `RESOLVER_KEEP_ALIVE=false` (current, running ~75 min)** 17Disables HTTP keepalive on the resolver's `std.http.Client`, so TLS connections are not reused across resolves. Early signal is promising — the initial slope is visibly flatter than any previous run at the same stage. Caches are still filling (DID cache at 47%, validator at 19%), so steady-state slope can't be measured yet. 18 19--- 20 21### What We Think Is Happening 22 23The leak is inside Zig's `std.http.Client` connection reuse path. When keepalive is enabled, the client maintains a connection pool (bounded at 32 slots) with TLS sessions. Something in the TLS connection lifecycle — possibly session tickets, certificate chain buffers, or connection metadata — accumulates and is never fully reclaimed even when connections are evicted from the pool. 24 25**Evidence:** 26- `THREADS=0` (no HTTP at all) → flat memory 27- `RECYCLE=0` (never destroy the client) → faster growth than baseline 28- baseline with `RECYCLE=1000` (destroy client every 1000 resolves) → slower growth, but still leaking 29- `KEEP_ALIVE=false` (no connection reuse) → early signal looks flat, pending confirmation 30 31--- 32 33### What We're Waiting For 34 35Both caches need to fill completely (~2–3 hours from deploy), then 1–2 hours of steady-state to measure the residual slope. If the slope drops to near-zero with keepalive disabled, the fix is confirmed — options are to either ship `keepalive=false` as the production config (at the cost of slightly higher resolve latency from per-request TLS handshakes), or dig into `std.http.Client` to find the actual leak. 36 37**ETA for conclusive data: ~3 hours from now.**