# zlay: next steps ## context we just shipped inductive proof chain plumbing to zlay (commit `1ecf365` in `@zzstoatzz.io/zlay`). this session was run from the zat repo directory — going forward, work directly from this relay repo instead. ### what landed - **phase 1**: fixed `extractOps`/`checkCommitStructure` in `validator.zig` to read the firehose `path` field instead of nonexistent `collection`/`rkey`. `verify_commit_diff` was dead code before this fix. - **phase 2**: added `since`/`prevData` chain continuity checks and future-rev rejection in `frame_worker.zig` and `subscriber.zig`. log-only + `chain_breaks` metric (no commits dropped). panel added to grafana dashboard. - **phase 3**: conditional upsert (`WHERE rev < EXCLUDED.rev`) on `updateAccountState` in `event_log.zig` to prevent concurrent workers from rolling back rev on same DID. ### where to find more context - plan transcript: `~/.claude/projects/-Users-nate-tangled-sh--zzstoatzz-io-zat/5335a98d-69e2-44d6-9596-5832272df710.jsonl` - memory file: `~/.claude/projects/-Users-nate-tangled-sh--zzstoatzz-io-zat/memory/MEMORY.md` - zlay CLAUDE.md: `@zzstoatzz.io/zlay/CLAUDE.md` (deploy instructions, known issues) - dashboard: `zlay/deploy/zlay-dashboard.json` (in this repo) - grafana: `https://zlay-metrics.waow.tech` ### phase 4 (not yet implemented) desync + resync on chain breaks. collect `chain_breaks` data from phase 2 first to understand frequency and patterns before building the resync machinery. design sketch is in the plan transcript. ## investigate: memory fragmentation **this is the main open item.** pre-existing, unrelated to the chain plumbing work. ### what we observed (3h window, grafana) - malloc arena (claimed from OS) climbing steadily toward ~3.5 GiB - malloc in-use much lower (~500 MB) — the gap is fragmentation - process RSS stairstepping upward to ~4 GiB, never returning pages to OS - memory limit is 8 GiB (`zlay/deploy/zlay-values.yaml:57`) ### what's already been tried `zlay/deploy/zlay-values.yaml` already sets: - `MALLOC_ARENA_MAX=2` (limit glibc arena count) - `MALLOC_TRIM_THRESHOLD_=131072` (return freed pages when free > 128 KB) fragmentation is happening despite these tunings. ### likely cause arena-per-frame pattern: every websocket frame creates an `ArenaAllocator`, decodes CBOR into it, processes, then frees. at ~700 frames/sec with ~2,748 concurrent PDS connections, this creates massive allocation churn. glibc ptmalloc is known to be bad at returning pages to the OS under this pattern. ### things to investigate - periodic `malloc_trim(0)` call (e.g. every 10s on a timer thread) to force page return. cheapest fix if it works. - `MALLOC_MMAP_THRESHOLD_` tuning — lower it so more allocations go through mmap (which gets returned to OS on free) instead of sbrk - switch to jemalloc or mimalloc via `LD_PRELOAD` — both handle arena churn better than ptmalloc. mimalloc in particular is good at returning pages. - zig's `std.heap.c_allocator` vs `std.heap.page_allocator` — check what the ArenaAllocator is backed by and whether switching the backing allocator helps - profile allocation sizes: if most arenas are small and similar-sized, a fixed-size pool allocator that reuses arenas across frames would eliminate the churn entirely