atproto relay implementation in zig zlay.waow.tech

cleanup: remove TODO.md and EXPERIMENTS.md from tracking

content migrated to relay repo's docs/ops-changelog.md.
memory leak is fixed (zat v0.2.14), experiments are closed,
open items tracked centrally.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

+4 -167
+4
.gitignore
··· 3 3 data/ 4 4 .env 5 5 .env.* 6 + 7 + # local scratch (content migrated to relay repo ops-changelog) 8 + TODO.md 9 + EXPERIMENTS.md
-111
EXPERIMENTS.md
··· 1 - # experiments 2 - 3 - active experiments on the deployed relay. each entry tracks what changed, why, 4 - how to verify, and how to revert. 5 - 6 - --- 7 - 8 - ## exp-001: SmpAllocator instead of glibc malloc (2026-03-06) 9 - 10 - **hypothesis**: zlay's linear RSS growth (~290 MiB/hour) is caused by glibc 11 - malloc fragmentation under cross-thread alloc/free patterns. ~2750 subscriber 12 - threads allocate frame data that's freed by 16 worker threads. glibc's per-thread 13 - arenas (even with MALLOC_ARENA_MAX=2) don't return these cross-thread freed pages 14 - to the OS. 15 - 16 - **what changed**: 17 - - `src/main.zig`: `std.heap.c_allocator` → `std.heap.smp_allocator` 18 - - `src/main.zig`: removed `malloc_trim(0)` from GC loop (SmpAllocator doesn't use glibc heap) 19 - 20 - **why SmpAllocator**: 21 - - zig's built-in multi-threaded allocator (since 0.14) 22 - - uses mmap/munmap directly — no glibc malloc involvement 23 - - thread-local freelists with cross-thread reclamation (exactly our problem) 24 - - zero new dependencies 25 - 26 - **evidence supporting this**: 27 - - rsky (Rust relay) uses mimalloc globally and doesn't have this problem 28 - - indigo (Go relay) uses Go GC which has no per-thread arena fragmentation 29 - - page_allocator experiment (per-frame arenas only) didn't help — leak is in cross-thread c_allocator paths 30 - - malloc_trim(0) didn't help — only trims main glibc arena 31 - - mallinfo() was misleading — only reports main arena, not per-thread arenas 32 - 33 - **verification**: 34 - 1. build succeeds 35 - 2. deploy, pod starts, `/_health` returns 200 36 - 3. firehose streams, `listReposByCollection` works 37 - 4. watch grafana over 12-24 hours: 38 - - `relay_process_rss_bytes` should plateau (not climb linearly) 39 - - `relay_malloc_arena_bytes` should be near-zero (glibc no longer in use) 40 - 5. if RSS stabilizes under ~1.5 GiB after caches fill, experiment succeeded 41 - 42 - **revert**: 43 - ```zig 44 - // src/main.zig — change allocator back: 45 - const allocator = std.heap.c_allocator; 46 - 47 - // src/main.zig — restore malloc_trim in gcLoop: 48 - _ = malloc_h.malloc_trim(0); 49 - log.info("gc: malloc_trim complete", .{}); 50 - ``` 51 - 52 - **result**: FAILED — RSS grew at ~670 MiB/hour (worse than c_allocator's ~290 MiB/hour). 53 - this disproves glibc fragmentation as the root cause. the leak is genuine — memory is 54 - allocated and never freed. reverted to c_allocator. 55 - 56 - **status**: reverted (2026-03-07) 57 - 58 - --- 59 - 60 - ## exp-002: GPA leak detection (2026-03-07) 61 - 62 - **goal**: identify exactly which allocations are leaking by using zig's 63 - GeneralPurposeAllocator as a wrapper. GPA tracks every alloc/free and reports 64 - unfreed allocations with stack traces on clean shutdown. 65 - 66 - **what changed**: 67 - - `build.zig`: added `-Duse_gpa=true` build option 68 - - `src/main.zig`: conditional GPA wrapper — when enabled, all allocations go through 69 - GPA backed by c_allocator. on SIGTERM, after all components deinit, GPA reports leaks. 70 - 71 - **how to use**: 72 - ```bash 73 - # build with GPA enabled (on the server): 74 - just zlay publish-remote ReleaseSafe --gpa 75 - # or manually: 76 - zig build -Doptimize=ReleaseSafe -Duse_gpa=true -Dtarget=x86_64-linux-gnu 77 - 78 - # let it run for 10-30 minutes, then: 79 - kubectl exec -n zlay deploy/zlay -- kill -TERM 1 80 - 81 - # read the leak report: 82 - kubectl logs -n zlay deploy/zlay --previous | grep -A5 "GPA" 83 - ``` 84 - 85 - **performance impact**: GPA adds a mutex + metadata tracking per alloc/free. 86 - expect ~2-5x slower throughput. this is a diagnostic build, not for production. 87 - 88 - **what to look for in output**: 89 - - GPA logs to stderr on deinit. each leaked allocation shows the stack trace 90 - of where it was allocated. 91 - - look for the most frequently repeated stack traces — those are the hot leak sites. 92 - 93 - **revert**: just rebuild without `-Duse_gpa=true` (default is false, zero overhead). 94 - 95 - **deployment attempt (2026-03-07)**: 96 - - GPA's per-allocation metadata tracking consumed memory ~55x faster than the base leak 97 - (~16 GiB/hour vs ~290 MiB/hour). at ~700 frames/sec × ~37 allocs/frame = ~26K tracked 98 - allocations/sec, the metadata itself dominates. 99 - - caused severe sawtooth pattern: ~7-8 OOM kills in ~3 hours (8 GiB limit) 100 - - first pod: logs lost when `kubectl delete pod` was used (should have used `kubectl scale --replicas=0`) 101 - - second pod: RocksDB lock file stale after first crash, had to clear manually 102 - - reverted to normal ReleaseSafe build after ~4 hours (relay was submitted for testing) 103 - 104 - **learnings for next attempt**: 105 - - need to reduce incoming load (fewer PDS hosts) to slow memory growth enough that GPA overhead doesn't OOM 106 - - or increase memory limit temporarily (e.g. 16 GiB) for the diagnostic window 107 - - use `kubectl scale deployment/zlay -n zlay --replicas=0` to preserve logs (not `kubectl delete pod`) 108 - - container lacks `kill` binary — need an admin endpoint or install procps in the image 109 - - consider adding `/admin/shutdown` HTTP endpoint to trigger graceful shutdown without `kill` 110 - 111 - **status**: paused — code merged (compiled out by default), needs better deployment strategy
-56
TODO.md
··· 1 - # zlay TODO 2 - 3 - third-party review now scores zlay B (84/100) vs indigo A- (89/100). 4 - all findings below checked against indigo source and atproto spec. 5 - 6 - ## now: error frame cleanup 7 - 8 - ### OutdatedCursor info frame 9 - 10 - `replayTo` (broadcaster.zig) needs to detect when the requested cursor 11 - is older than the replay buffer and send an OutdatedCursor error frame 12 - before closing. currently silently starts from oldest available. 13 - 14 - ## done 15 - 16 - - ~~TCP split-write fix~~ (zat v0.2.16 / websocket 395d0f4 — HTTP body now fully read when headers and body arrive in separate TCP segments) 17 - - ~~error frame `t` field~~ (committed 9a22a23, not yet deployed) 18 - - ~~host authority enforcement~~ (committed ac3f10a, not yet deployed) 19 - - ~~getHostStatus 400 → 404~~ (deployed 5baf376) 20 - - ~~listReposByCollection limit 1000 → 2000~~ (deployed 5baf376) 21 - - ~~requestCrawl XRPC error format~~ (deployed 5baf376) 22 - - ~~listRepos always emit head/rev~~ (deployed 5baf376) 23 - - ~~InvalidCursor frame on bad cursor parse~~ (deployed 5baf376) 24 - 25 - ## later: architectural 26 - 27 - ### broadcast after flush 28 - 29 - indigo's `flushLog` writes to disk first, then calls `broadcast()` for 30 - each event in the buffer. zlay broadcasts immediately after buffering to 31 - memory, flush happens async in `flushLoop`. 32 - 33 - crash between broadcast and flush = consumers received events that replay 34 - can't produce. indigo doesn't have this window. 35 - 36 - fix: move broadcast into `flushLocked()`, after the disk write succeeds. 37 - 38 - files: event_log.zig (flushLocked, persist), frame_worker.zig:227, 39 - subscriber.zig:633 40 - 41 - ### getRepo redirect 42 - 43 - router.zig:68 has no getRepo. should redirect to the PDS hosting the 44 - repo (like indigo service.go:153). needs DID → host lookup + new handler. 45 - 46 - ## not a gap (indigo same behavior) 47 - 48 - ### validation skip on cache miss 49 - 50 - indigo also forwards unverified on identity resolution failure 51 - (verify.go:95-113). not a priority unless indigo changes. 52 - 53 - ### requestCrawl error name 54 - 55 - reviewer flagged `InvalidRequest` vs lexicon's `HostBanned`. indigo 56 - also diverges (uses `DomainBan`). not a differentiator.