atproto relay implementation in zig zlay.waow.tech

experiments#

active experiments on the deployed relay. each entry tracks what changed, why, how to verify, and how to revert.


exp-001: SmpAllocator instead of glibc malloc (2026-03-06)#

hypothesis: zlay's linear RSS growth (~290 MiB/hour) is caused by glibc malloc fragmentation under cross-thread alloc/free patterns. ~2750 subscriber threads allocate frame data that's freed by 16 worker threads. glibc's per-thread arenas (even with MALLOC_ARENA_MAX=2) don't return these cross-thread freed pages to the OS.

what changed:

  • src/main.zig: std.heap.c_allocatorstd.heap.smp_allocator
  • src/main.zig: removed malloc_trim(0) from GC loop (SmpAllocator doesn't use glibc heap)

why SmpAllocator:

  • zig's built-in multi-threaded allocator (since 0.14)
  • uses mmap/munmap directly — no glibc malloc involvement
  • thread-local freelists with cross-thread reclamation (exactly our problem)
  • zero new dependencies

evidence supporting this:

  • rsky (Rust relay) uses mimalloc globally and doesn't have this problem
  • indigo (Go relay) uses Go GC which has no per-thread arena fragmentation
  • page_allocator experiment (per-frame arenas only) didn't help — leak is in cross-thread c_allocator paths
  • malloc_trim(0) didn't help — only trims main glibc arena
  • mallinfo() was misleading — only reports main arena, not per-thread arenas

verification:

  1. build succeeds
  2. deploy, pod starts, /_health returns 200
  3. firehose streams, listReposByCollection works
  4. watch grafana over 12-24 hours:
    • relay_process_rss_bytes should plateau (not climb linearly)
    • relay_malloc_arena_bytes should be near-zero (glibc no longer in use)
  5. if RSS stabilizes under ~1.5 GiB after caches fill, experiment succeeded

revert:

// src/main.zig — change allocator back:
const allocator = std.heap.c_allocator;

// src/main.zig — restore malloc_trim in gcLoop:
_ = malloc_h.malloc_trim(0);
log.info("gc: malloc_trim complete", .{});

result: FAILED — RSS grew at ~670 MiB/hour (worse than c_allocator's ~290 MiB/hour). this disproves glibc fragmentation as the root cause. the leak is genuine — memory is allocated and never freed. reverted to c_allocator.

status: reverted (2026-03-07)


exp-002: GPA leak detection (2026-03-07)#

goal: identify exactly which allocations are leaking by using zig's GeneralPurposeAllocator as a wrapper. GPA tracks every alloc/free and reports unfreed allocations with stack traces on clean shutdown.

what changed:

  • build.zig: added -Duse_gpa=true build option
  • src/main.zig: conditional GPA wrapper — when enabled, all allocations go through GPA backed by c_allocator. on SIGTERM, after all components deinit, GPA reports leaks.

how to use:

# build with GPA enabled (on the server):
just zlay publish-remote ReleaseSafe --gpa
# or manually:
zig build -Doptimize=ReleaseSafe -Duse_gpa=true -Dtarget=x86_64-linux-gnu

# let it run for 10-30 minutes, then:
kubectl exec -n zlay deploy/zlay -- kill -TERM 1

# read the leak report:
kubectl logs -n zlay deploy/zlay --previous | grep -A5 "GPA"

performance impact: GPA adds a mutex + metadata tracking per alloc/free. expect ~2-5x slower throughput. this is a diagnostic build, not for production.

what to look for in output:

  • GPA logs to stderr on deinit. each leaked allocation shows the stack trace of where it was allocated.
  • look for the most frequently repeated stack traces — those are the hot leak sites.

revert: just rebuild without -Duse_gpa=true (default is false, zero overhead).

deployment attempt (2026-03-07):

  • GPA's per-allocation metadata tracking consumed memory ~55x faster than the base leak (~16 GiB/hour vs ~290 MiB/hour). at ~700 frames/sec × ~37 allocs/frame = ~26K tracked allocations/sec, the metadata itself dominates.
  • caused severe sawtooth pattern: ~7-8 OOM kills in ~3 hours (8 GiB limit)
  • first pod: logs lost when kubectl delete pod was used (should have used kubectl scale --replicas=0)
  • second pod: RocksDB lock file stale after first crash, had to clear manually
  • reverted to normal ReleaseSafe build after ~4 hours (relay was submitted for testing)

learnings for next attempt:

  • need to reduce incoming load (fewer PDS hosts) to slow memory growth enough that GPA overhead doesn't OOM
  • or increase memory limit temporarily (e.g. 16 GiB) for the diagnostic window
  • use kubectl scale deployment/zlay -n zlay --replicas=0 to preserve logs (not kubectl delete pod)
  • container lacks kill binary — need an admin endpoint or install procps in the image
  • consider adding /admin/shutdown HTTP endpoint to trigger graceful shutdown without kill

status: paused — code merged (compiled out by default), needs better deployment strategy