# experiments active experiments on the deployed relay. each entry tracks what changed, why, how to verify, and how to revert. --- ## exp-001: SmpAllocator instead of glibc malloc (2026-03-06) **hypothesis**: zlay's linear RSS growth (~290 MiB/hour) is caused by glibc malloc fragmentation under cross-thread alloc/free patterns. ~2750 subscriber threads allocate frame data that's freed by 16 worker threads. glibc's per-thread arenas (even with MALLOC_ARENA_MAX=2) don't return these cross-thread freed pages to the OS. **what changed**: - `src/main.zig`: `std.heap.c_allocator` → `std.heap.smp_allocator` - `src/main.zig`: removed `malloc_trim(0)` from GC loop (SmpAllocator doesn't use glibc heap) **why SmpAllocator**: - zig's built-in multi-threaded allocator (since 0.14) - uses mmap/munmap directly — no glibc malloc involvement - thread-local freelists with cross-thread reclamation (exactly our problem) - zero new dependencies **evidence supporting this**: - rsky (Rust relay) uses mimalloc globally and doesn't have this problem - indigo (Go relay) uses Go GC which has no per-thread arena fragmentation - page_allocator experiment (per-frame arenas only) didn't help — leak is in cross-thread c_allocator paths - malloc_trim(0) didn't help — only trims main glibc arena - mallinfo() was misleading — only reports main arena, not per-thread arenas **verification**: 1. build succeeds 2. deploy, pod starts, `/_health` returns 200 3. firehose streams, `listReposByCollection` works 4. watch grafana over 12-24 hours: - `relay_process_rss_bytes` should plateau (not climb linearly) - `relay_malloc_arena_bytes` should be near-zero (glibc no longer in use) 5. if RSS stabilizes under ~1.5 GiB after caches fill, experiment succeeded **revert**: ```zig // src/main.zig — change allocator back: const allocator = std.heap.c_allocator; // src/main.zig — restore malloc_trim in gcLoop: _ = malloc_h.malloc_trim(0); log.info("gc: malloc_trim complete", .{}); ``` **result**: FAILED — RSS grew at ~670 MiB/hour (worse than c_allocator's ~290 MiB/hour). this disproves glibc fragmentation as the root cause. the leak is genuine — memory is allocated and never freed. reverted to c_allocator. **status**: reverted (2026-03-07) --- ## exp-002: GPA leak detection (2026-03-07) **goal**: identify exactly which allocations are leaking by using zig's GeneralPurposeAllocator as a wrapper. GPA tracks every alloc/free and reports unfreed allocations with stack traces on clean shutdown. **what changed**: - `build.zig`: added `-Duse_gpa=true` build option - `src/main.zig`: conditional GPA wrapper — when enabled, all allocations go through GPA backed by c_allocator. on SIGTERM, after all components deinit, GPA reports leaks. **how to use**: ```bash # build with GPA enabled (on the server): just zlay publish-remote ReleaseSafe --gpa # or manually: zig build -Doptimize=ReleaseSafe -Duse_gpa=true -Dtarget=x86_64-linux-gnu # let it run for 10-30 minutes, then: kubectl exec -n zlay deploy/zlay -- kill -TERM 1 # read the leak report: kubectl logs -n zlay deploy/zlay --previous | grep -A5 "GPA" ``` **performance impact**: GPA adds a mutex + metadata tracking per alloc/free. expect ~2-5x slower throughput. this is a diagnostic build, not for production. **what to look for in output**: - GPA logs to stderr on deinit. each leaked allocation shows the stack trace of where it was allocated. - look for the most frequently repeated stack traces — those are the hot leak sites. **revert**: just rebuild without `-Duse_gpa=true` (default is false, zero overhead). **deployment attempt (2026-03-07)**: - GPA's per-allocation metadata tracking consumed memory ~55x faster than the base leak (~16 GiB/hour vs ~290 MiB/hour). at ~700 frames/sec × ~37 allocs/frame = ~26K tracked allocations/sec, the metadata itself dominates. - caused severe sawtooth pattern: ~7-8 OOM kills in ~3 hours (8 GiB limit) - first pod: logs lost when `kubectl delete pod` was used (should have used `kubectl scale --replicas=0`) - second pod: RocksDB lock file stale after first crash, had to clear manually - reverted to normal ReleaseSafe build after ~4 hours (relay was submitted for testing) **learnings for next attempt**: - need to reduce incoming load (fewer PDS hosts) to slow memory growth enough that GPA overhead doesn't OOM - or increase memory limit temporarily (e.g. 16 GiB) for the diagnostic window - use `kubectl scale deployment/zlay -n zlay --replicas=0` to preserve logs (not `kubectl delete pod`) - container lacks `kill` binary — need an admin endpoint or install procps in the image - consider adding `/admin/shutdown` HTTP endpoint to trigger graceful shutdown without `kill` **status**: paused — code merged (compiled out by default), needs better deployment strategy