zlay: next steps#
context#
we just shipped inductive proof chain plumbing to zlay (commit 1ecf365 in
@zzstoatzz.io/zlay). this session was run from the zat repo directory — going
forward, work directly from this relay repo instead.
what landed#
- phase 1: fixed
extractOps/checkCommitStructureinvalidator.zigto read the firehosepathfield instead of nonexistentcollection/rkey.verify_commit_diffwas dead code before this fix. - phase 2: added
since/prevDatachain continuity checks and future-rev rejection inframe_worker.zigandsubscriber.zig. log-only +chain_breaksmetric (no commits dropped). panel added to grafana dashboard. - phase 3: conditional upsert (
WHERE rev < EXCLUDED.rev) onupdateAccountStateinevent_log.zigto prevent concurrent workers from rolling back rev on same DID.
where to find more context#
- plan transcript:
~/.claude/projects/-Users-nate-tangled-sh--zzstoatzz-io-zat/5335a98d-69e2-44d6-9596-5832272df710.jsonl - memory file:
~/.claude/projects/-Users-nate-tangled-sh--zzstoatzz-io-zat/memory/MEMORY.md - zlay CLAUDE.md:
@zzstoatzz.io/zlay/CLAUDE.md(deploy instructions, known issues) - dashboard:
zlay/deploy/zlay-dashboard.json(in this repo) - grafana:
https://zlay-metrics.waow.tech
phase 4 (not yet implemented)#
desync + resync on chain breaks. collect chain_breaks data from phase 2 first
to understand frequency and patterns before building the resync machinery.
design sketch is in the plan transcript.
investigate: memory fragmentation#
this is the main open item. pre-existing, unrelated to the chain plumbing work.
what we observed (3h window, grafana)#
- malloc arena (claimed from OS) climbing steadily toward ~3.5 GiB
- malloc in-use much lower (~500 MB) — the gap is fragmentation
- process RSS stairstepping upward to ~4 GiB, never returning pages to OS
- memory limit is 8 GiB (
zlay/deploy/zlay-values.yaml:57)
what's already been tried#
zlay/deploy/zlay-values.yaml already sets:
MALLOC_ARENA_MAX=2(limit glibc arena count)MALLOC_TRIM_THRESHOLD_=131072(return freed pages when free > 128 KB)
fragmentation is happening despite these tunings.
likely cause#
arena-per-frame pattern: every websocket frame creates an ArenaAllocator,
decodes CBOR into it, processes, then frees. at ~700 frames/sec with ~2,748
concurrent PDS connections, this creates massive allocation churn. glibc ptmalloc
is known to be bad at returning pages to the OS under this pattern.
things to investigate#
- periodic
malloc_trim(0)call (e.g. every 10s on a timer thread) to force page return. cheapest fix if it works. MALLOC_MMAP_THRESHOLD_tuning — lower it so more allocations go through mmap (which gets returned to OS on free) instead of sbrk- switch to jemalloc or mimalloc via
LD_PRELOAD— both handle arena churn better than ptmalloc. mimalloc in particular is good at returning pages. - zig's
std.heap.c_allocatorvsstd.heap.page_allocator— check what the ArenaAllocator is backed by and whether switching the backing allocator helps - profile allocation sizes: if most arenas are small and similar-sized, a fixed-size pool allocator that reuses arenas across frames would eliminate the churn entirely