chasing mem leaks? or being silly
zlay_stat.md
1## zlay memory leak investigation — status update
2
3### Setup
4zlay is a Zig 0.15 AT Protocol relay crawling ~2,700 PDS hosts. Steady-state memory growth of ~240 MiB/hr with both LRU caches (validator 250K, DID 500K) already full. All bounded data structures have been ruled out — the leak is somewhere in the per-resolve allocation path.
5
6---
7
8### Experiments Run
9
10**1. `RESOLVER_RECYCLE_INTERVAL=0` (disable periodic resolver destroy/recreate)**
11Steady-state slope ~313 MiB/hr — worse than baseline ~240 MiB/hr. Confirms recycling was partially containing the problem by periodically destroying accumulated state.
12
13**2. `RESOLVER_THREADS=0` (disable DID resolution entirely)**
14Malloc flat at ~277 MiB after 49 minutes. The entire firehose pipeline (2,681 PDS connections, CBOR decode, RocksDB writes, broadcasting) runs without memory growth. **This definitively isolates the leak to the resolver path.**
15
16**3. `RESOLVER_KEEP_ALIVE=false` (current, running ~75 min)**
17Disables HTTP keepalive on the resolver's `std.http.Client`, so TLS connections are not reused across resolves. Early signal is promising — the initial slope is visibly flatter than any previous run at the same stage. Caches are still filling (DID cache at 47%, validator at 19%), so steady-state slope can't be measured yet.
18
19---
20
21### What We Think Is Happening
22
23The leak is inside Zig's `std.http.Client` connection reuse path. When keepalive is enabled, the client maintains a connection pool (bounded at 32 slots) with TLS sessions. Something in the TLS connection lifecycle — possibly session tickets, certificate chain buffers, or connection metadata — accumulates and is never fully reclaimed even when connections are evicted from the pool.
24
25**Evidence:**
26- `THREADS=0` (no HTTP at all) → flat memory
27- `RECYCLE=0` (never destroy the client) → faster growth than baseline
28- baseline with `RECYCLE=1000` (destroy client every 1000 resolves) → slower growth, but still leaking
29- `KEEP_ALIVE=false` (no connection reuse) → early signal looks flat, pending confirmation
30
31---
32
33### What We're Waiting For
34
35Both caches need to fill completely (~2–3 hours from deploy), then 1–2 hours of steady-state to measure the residual slope. If the slope drops to near-zero with keepalive disabled, the fix is confirmed — options are to either ship `keepalive=false` as the production config (at the cost of slightly higher resolve latency from per-request TLS handshakes), or dig into `std.http.Client` to find the actual leak.
36
37**ETA for conclusive data: ~3 hours from now.**