declarative relay deployment on hetzner relay.waow.tech
atproto

zlay: next steps#

context#

we just shipped inductive proof chain plumbing to zlay (commit 1ecf365 in @zzstoatzz.io/zlay). this session was run from the zat repo directory — going forward, work directly from this relay repo instead.

what landed#

  • phase 1: fixed extractOps/checkCommitStructure in validator.zig to read the firehose path field instead of nonexistent collection/rkey. verify_commit_diff was dead code before this fix.
  • phase 2: added since/prevData chain continuity checks and future-rev rejection in frame_worker.zig and subscriber.zig. log-only + chain_breaks metric (no commits dropped). panel added to grafana dashboard.
  • phase 3: conditional upsert (WHERE rev < EXCLUDED.rev) on updateAccountState in event_log.zig to prevent concurrent workers from rolling back rev on same DID.

where to find more context#

  • plan transcript: ~/.claude/projects/-Users-nate-tangled-sh--zzstoatzz-io-zat/5335a98d-69e2-44d6-9596-5832272df710.jsonl
  • memory file: ~/.claude/projects/-Users-nate-tangled-sh--zzstoatzz-io-zat/memory/MEMORY.md
  • zlay CLAUDE.md: @zzstoatzz.io/zlay/CLAUDE.md (deploy instructions, known issues)
  • dashboard: zlay/deploy/zlay-dashboard.json (in this repo)
  • grafana: https://zlay-metrics.waow.tech

phase 4 (not yet implemented)#

desync + resync on chain breaks. collect chain_breaks data from phase 2 first to understand frequency and patterns before building the resync machinery. design sketch is in the plan transcript.

investigate: memory fragmentation#

this is the main open item. pre-existing, unrelated to the chain plumbing work.

what we observed (3h window, grafana)#

  • malloc arena (claimed from OS) climbing steadily toward ~3.5 GiB
  • malloc in-use much lower (~500 MB) — the gap is fragmentation
  • process RSS stairstepping upward to ~4 GiB, never returning pages to OS
  • memory limit is 8 GiB (zlay/deploy/zlay-values.yaml:57)

what's already been tried#

zlay/deploy/zlay-values.yaml already sets:

  • MALLOC_ARENA_MAX=2 (limit glibc arena count)
  • MALLOC_TRIM_THRESHOLD_=131072 (return freed pages when free > 128 KB)

fragmentation is happening despite these tunings.

likely cause#

arena-per-frame pattern: every websocket frame creates an ArenaAllocator, decodes CBOR into it, processes, then frees. at ~700 frames/sec with ~2,748 concurrent PDS connections, this creates massive allocation churn. glibc ptmalloc is known to be bad at returning pages to the OS under this pattern.

things to investigate#

  • periodic malloc_trim(0) call (e.g. every 10s on a timer thread) to force page return. cheapest fix if it works.
  • MALLOC_MMAP_THRESHOLD_ tuning — lower it so more allocations go through mmap (which gets returned to OS on free) instead of sbrk
  • switch to jemalloc or mimalloc via LD_PRELOAD — both handle arena churn better than ptmalloc. mimalloc in particular is good at returning pages.
  • zig's std.heap.c_allocator vs std.heap.page_allocator — check what the ArenaAllocator is backed by and whether switching the backing allocator helps
  • profile allocation sizes: if most arenas are small and similar-sized, a fixed-size pool allocator that reuses arenas across frames would eliminate the churn entirely