# experiments

active experiments on the deployed relay. each entry tracks what changed, why,
how to verify, and how to revert.

---

## exp-001: SmpAllocator instead of glibc malloc (2026-03-06)

**hypothesis**: zlay's linear RSS growth (~290 MiB/hour) is caused by glibc
malloc fragmentation under cross-thread alloc/free patterns. ~2750 subscriber
threads allocate frame data that's freed by 16 worker threads. glibc's per-thread
arenas (even with MALLOC_ARENA_MAX=2) don't return these cross-thread freed pages
to the OS.

**what changed**:
- `src/main.zig`: `std.heap.c_allocator` → `std.heap.smp_allocator`
- `src/main.zig`: removed `malloc_trim(0)` from GC loop (SmpAllocator doesn't use glibc heap)

**why SmpAllocator**:
- zig's built-in multi-threaded allocator (since 0.14)
- uses mmap/munmap directly — no glibc malloc involvement
- thread-local freelists with cross-thread reclamation (exactly our problem)
- zero new dependencies

**evidence supporting this**:
- rsky (Rust relay) uses mimalloc globally and doesn't have this problem
- indigo (Go relay) uses Go GC which has no per-thread arena fragmentation
- page_allocator experiment (per-frame arenas only) didn't help — leak is in cross-thread c_allocator paths
- malloc_trim(0) didn't help — only trims main glibc arena
- mallinfo() was misleading — only reports main arena, not per-thread arenas

**verification**:
1. build succeeds
2. deploy, pod starts, `/_health` returns 200
3. firehose streams, `listReposByCollection` works
4. watch grafana over 12-24 hours:
   - `relay_process_rss_bytes` should plateau (not climb linearly)
   - `relay_malloc_arena_bytes` should be near-zero (glibc no longer in use)
5. if RSS stabilizes under ~1.5 GiB after caches fill, experiment succeeded

**revert**:
```zig
// src/main.zig — change allocator back:
const allocator = std.heap.c_allocator;

// src/main.zig — restore malloc_trim in gcLoop:
_ = malloc_h.malloc_trim(0);
log.info("gc: malloc_trim complete", .{});
```

**result**: FAILED — RSS grew at ~670 MiB/hour (worse than c_allocator's ~290 MiB/hour).
this disproves glibc fragmentation as the root cause. the leak is genuine — memory is
allocated and never freed. reverted to c_allocator.

**status**: reverted (2026-03-07)

---

## exp-002: GPA leak detection (2026-03-07)

**goal**: identify exactly which allocations are leaking by using zig's
GeneralPurposeAllocator as a wrapper. GPA tracks every alloc/free and reports
unfreed allocations with stack traces on clean shutdown.

**what changed**:
- `build.zig`: added `-Duse_gpa=true` build option
- `src/main.zig`: conditional GPA wrapper — when enabled, all allocations go through
  GPA backed by c_allocator. on SIGTERM, after all components deinit, GPA reports leaks.

**how to use**:
```bash
# build with GPA enabled (on the server):
just zlay publish-remote ReleaseSafe --gpa
# or manually:
zig build -Doptimize=ReleaseSafe -Duse_gpa=true -Dtarget=x86_64-linux-gnu

# let it run for 10-30 minutes, then:
kubectl exec -n zlay deploy/zlay -- kill -TERM 1

# read the leak report:
kubectl logs -n zlay deploy/zlay --previous | grep -A5 "GPA"
```

**performance impact**: GPA adds a mutex + metadata tracking per alloc/free.
expect ~2-5x slower throughput. this is a diagnostic build, not for production.

**what to look for in output**:
- GPA logs to stderr on deinit. each leaked allocation shows the stack trace
  of where it was allocated.
- look for the most frequently repeated stack traces — those are the hot leak sites.

**revert**: just rebuild without `-Duse_gpa=true` (default is false, zero overhead).

**deployment attempt (2026-03-07)**:
- GPA's per-allocation metadata tracking consumed memory ~55x faster than the base leak
  (~16 GiB/hour vs ~290 MiB/hour). at ~700 frames/sec × ~37 allocs/frame = ~26K tracked
  allocations/sec, the metadata itself dominates.
- caused severe sawtooth pattern: ~7-8 OOM kills in ~3 hours (8 GiB limit)
- first pod: logs lost when `kubectl delete pod` was used (should have used `kubectl scale --replicas=0`)
- second pod: RocksDB lock file stale after first crash, had to clear manually
- reverted to normal ReleaseSafe build after ~4 hours (relay was submitted for testing)

**learnings for next attempt**:
- need to reduce incoming load (fewer PDS hosts) to slow memory growth enough that GPA overhead doesn't OOM
- or increase memory limit temporarily (e.g. 16 GiB) for the diagnostic window
- use `kubectl scale deployment/zlay -n zlay --replicas=0` to preserve logs (not `kubectl delete pod`)
- container lacks `kill` binary — need an admin endpoint or install procps in the image
- consider adding `/admin/shutdown` HTTP endpoint to trigger graceful shutdown without `kill`

**status**: paused — code merged (compiled out by default), needs better deployment strategy