declarative relay deployment on hetzner relay.waow.tech
atproto

add TODO.md: chain plumbing context + memory investigation handoff

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

+73
+73
TODO.md
··· 1 + # zlay: next steps 2 + 3 + ## context 4 + 5 + we just shipped inductive proof chain plumbing to zlay (commit `1ecf365` in 6 + `@zzstoatzz.io/zlay`). this session was run from the zat repo directory — going 7 + forward, work directly from this relay repo instead. 8 + 9 + ### what landed 10 + 11 + - **phase 1**: fixed `extractOps`/`checkCommitStructure` in `validator.zig` to 12 + read the firehose `path` field instead of nonexistent `collection`/`rkey`. 13 + `verify_commit_diff` was dead code before this fix. 14 + - **phase 2**: added `since`/`prevData` chain continuity checks and future-rev 15 + rejection in `frame_worker.zig` and `subscriber.zig`. log-only + `chain_breaks` 16 + metric (no commits dropped). panel added to grafana dashboard. 17 + - **phase 3**: conditional upsert (`WHERE rev < EXCLUDED.rev`) on 18 + `updateAccountState` in `event_log.zig` to prevent concurrent workers from 19 + rolling back rev on same DID. 20 + 21 + ### where to find more context 22 + 23 + - plan transcript: `~/.claude/projects/-Users-nate-tangled-sh--zzstoatzz-io-zat/5335a98d-69e2-44d6-9596-5832272df710.jsonl` 24 + - memory file: `~/.claude/projects/-Users-nate-tangled-sh--zzstoatzz-io-zat/memory/MEMORY.md` 25 + - zlay CLAUDE.md: `@zzstoatzz.io/zlay/CLAUDE.md` (deploy instructions, known issues) 26 + - dashboard: `zlay/deploy/zlay-dashboard.json` (in this repo) 27 + - grafana: `https://zlay-metrics.waow.tech` 28 + 29 + ### phase 4 (not yet implemented) 30 + 31 + desync + resync on chain breaks. collect `chain_breaks` data from phase 2 first 32 + to understand frequency and patterns before building the resync machinery. 33 + design sketch is in the plan transcript. 34 + 35 + ## investigate: memory fragmentation 36 + 37 + **this is the main open item.** pre-existing, unrelated to the chain plumbing work. 38 + 39 + ### what we observed (3h window, grafana) 40 + 41 + - malloc arena (claimed from OS) climbing steadily toward ~3.5 GiB 42 + - malloc in-use much lower (~500 MB) — the gap is fragmentation 43 + - process RSS stairstepping upward to ~4 GiB, never returning pages to OS 44 + - memory limit is 8 GiB (`zlay/deploy/zlay-values.yaml:57`) 45 + 46 + ### what's already been tried 47 + 48 + `zlay/deploy/zlay-values.yaml` already sets: 49 + - `MALLOC_ARENA_MAX=2` (limit glibc arena count) 50 + - `MALLOC_TRIM_THRESHOLD_=131072` (return freed pages when free > 128 KB) 51 + 52 + fragmentation is happening despite these tunings. 53 + 54 + ### likely cause 55 + 56 + arena-per-frame pattern: every websocket frame creates an `ArenaAllocator`, 57 + decodes CBOR into it, processes, then frees. at ~700 frames/sec with ~2,748 58 + concurrent PDS connections, this creates massive allocation churn. glibc ptmalloc 59 + is known to be bad at returning pages to the OS under this pattern. 60 + 61 + ### things to investigate 62 + 63 + - periodic `malloc_trim(0)` call (e.g. every 10s on a timer thread) to force 64 + page return. cheapest fix if it works. 65 + - `MALLOC_MMAP_THRESHOLD_` tuning — lower it so more allocations go through 66 + mmap (which gets returned to OS on free) instead of sbrk 67 + - switch to jemalloc or mimalloc via `LD_PRELOAD` — both handle arena churn 68 + better than ptmalloc. mimalloc in particular is good at returning pages. 69 + - zig's `std.heap.c_allocator` vs `std.heap.page_allocator` — check what the 70 + ArenaAllocator is backed by and whether switching the backing allocator helps 71 + - profile allocation sizes: if most arenas are small and similar-sized, a 72 + fixed-size pool allocator that reuses arenas across frames would eliminate 73 + the churn entirely