···11+# zlay: next steps
22+33+## context
44+55+we just shipped inductive proof chain plumbing to zlay (commit `1ecf365` in
66+`@zzstoatzz.io/zlay`). this session was run from the zat repo directory — going
77+forward, work directly from this relay repo instead.
88+99+### what landed
1010+1111+- **phase 1**: fixed `extractOps`/`checkCommitStructure` in `validator.zig` to
1212+ read the firehose `path` field instead of nonexistent `collection`/`rkey`.
1313+ `verify_commit_diff` was dead code before this fix.
1414+- **phase 2**: added `since`/`prevData` chain continuity checks and future-rev
1515+ rejection in `frame_worker.zig` and `subscriber.zig`. log-only + `chain_breaks`
1616+ metric (no commits dropped). panel added to grafana dashboard.
1717+- **phase 3**: conditional upsert (`WHERE rev < EXCLUDED.rev`) on
1818+ `updateAccountState` in `event_log.zig` to prevent concurrent workers from
1919+ rolling back rev on same DID.
2020+2121+### where to find more context
2222+2323+- plan transcript: `~/.claude/projects/-Users-nate-tangled-sh--zzstoatzz-io-zat/5335a98d-69e2-44d6-9596-5832272df710.jsonl`
2424+- memory file: `~/.claude/projects/-Users-nate-tangled-sh--zzstoatzz-io-zat/memory/MEMORY.md`
2525+- zlay CLAUDE.md: `@zzstoatzz.io/zlay/CLAUDE.md` (deploy instructions, known issues)
2626+- dashboard: `zlay/deploy/zlay-dashboard.json` (in this repo)
2727+- grafana: `https://zlay-metrics.waow.tech`
2828+2929+### phase 4 (not yet implemented)
3030+3131+desync + resync on chain breaks. collect `chain_breaks` data from phase 2 first
3232+to understand frequency and patterns before building the resync machinery.
3333+design sketch is in the plan transcript.
3434+3535+## investigate: memory fragmentation
3636+3737+**this is the main open item.** pre-existing, unrelated to the chain plumbing work.
3838+3939+### what we observed (3h window, grafana)
4040+4141+- malloc arena (claimed from OS) climbing steadily toward ~3.5 GiB
4242+- malloc in-use much lower (~500 MB) — the gap is fragmentation
4343+- process RSS stairstepping upward to ~4 GiB, never returning pages to OS
4444+- memory limit is 8 GiB (`zlay/deploy/zlay-values.yaml:57`)
4545+4646+### what's already been tried
4747+4848+`zlay/deploy/zlay-values.yaml` already sets:
4949+- `MALLOC_ARENA_MAX=2` (limit glibc arena count)
5050+- `MALLOC_TRIM_THRESHOLD_=131072` (return freed pages when free > 128 KB)
5151+5252+fragmentation is happening despite these tunings.
5353+5454+### likely cause
5555+5656+arena-per-frame pattern: every websocket frame creates an `ArenaAllocator`,
5757+decodes CBOR into it, processes, then frees. at ~700 frames/sec with ~2,748
5858+concurrent PDS connections, this creates massive allocation churn. glibc ptmalloc
5959+is known to be bad at returning pages to the OS under this pattern.
6060+6161+### things to investigate
6262+6363+- periodic `malloc_trim(0)` call (e.g. every 10s on a timer thread) to force
6464+ page return. cheapest fix if it works.
6565+- `MALLOC_MMAP_THRESHOLD_` tuning — lower it so more allocations go through
6666+ mmap (which gets returned to OS on free) instead of sbrk
6767+- switch to jemalloc or mimalloc via `LD_PRELOAD` — both handle arena churn
6868+ better than ptmalloc. mimalloc in particular is good at returning pages.
6969+- zig's `std.heap.c_allocator` vs `std.heap.page_allocator` — check what the
7070+ ArenaAllocator is backed by and whether switching the backing allocator helps
7171+- profile allocation sizes: if most arenas are small and similar-sized, a
7272+ fixed-size pool allocator that reuses arenas across frames would eliminate
7373+ the churn entirely