deployment#
zlay runs on a Hetzner CPX41 in Hillsboro OR, managed via k3s. all deployment is orchestrated from the relay repo using just recipes.
build and deploy#
the preferred method builds natively on the server (fast, no cross-compilation):
just zlay-publish-remote
this SSHs into the server and:
git pull --ff-onlyin/opt/zlayzig build -Doptimize=ReleaseSafe -Dtarget=x86_64-linux-gnubuildah bud -f Dockerfile.runtime .— thin runtime image with SHA tag- pushes to k3s containerd via
buildah push→ctr images import kubectl set image deployment/zlay -n zlay main=<sha-tagged-image>+kubectl rollout status
the runtime image (Dockerfile.runtime) is minimal: debian bookworm-slim + ca-certificates + the binary.
why not Docker build?#
the full Dockerfile exists for CI/standalone builds but is slow on Mac (cross-compilation + QEMU). zlay-publish-remote skips all of that by building on the target architecture.
build flags#
-Dtarget=x86_64-linux-gnu— must use glibc, not musl. zig 0.15's C++ codegen for musl produces illegal instructions in RocksDB's LRU cache.-Dcpu=baseline— required when building inside Docker/QEMU (not needed forzlay-publish-remotesince it builds natively).-Doptimize=ReleaseSafe— safety checks on, optimizations on. production default since 2026-03-05. previously caused OOM (see incident-2026-03-04.md) — resolved by the frame pool moving heavy work off reader threads.
initial setup#
just zlay-init # terraform init
just zlay-infra # create Hetzner server with k3s
just zlay-kubeconfig # pull kubeconfig (~2 min after creation)
just zlay-deploy # full deploy: cert-manager, postgres, relay, monitoring
point DNS A record for ZLAY_DOMAIN at the server IP (just zlay-server-ip) before deploying.
environment variables#
set in .env in the relay repo:
| variable | required | description |
|---|---|---|
HCLOUD_TOKEN |
yes | Hetzner Cloud API token |
ZLAY_DOMAIN |
yes | public domain (e.g. zlay.waow.tech) |
ZLAY_ADMIN_PASSWORD |
yes | bearer token for admin endpoints |
ZLAY_POSTGRES_PASSWORD |
yes | postgres password |
LETSENCRYPT_EMAIL |
yes | email for TLS certificates |
operations#
just zlay-status # nodes, pods, health
just zlay-logs # tail relay logs
just zlay-health # curl public health endpoint
just zlay-ssh # ssh into server
infrastructure#
- server: Hetzner CPX41 — 16 vCPU (AMD), 32 GB RAM, 240 GB NVMe
- k3s: single-node kubernetes with traefik ingress
- cert-manager: automatic TLS via Let's Encrypt
- postgres: bitnami/postgresql helm chart (relay state, backfill progress)
- monitoring: prometheus + grafana via kube-prometheus-stack
- terraform:
infra/zlay/in the relay repo
memory tuning#
four changes brought steady-state memory from ~6.6 GiB down to ~1.1 GiB at ~2,250 connected hosts (ReleaseSafe):
shared TLS CA bundle. the biggest single win. websocket.zig's TLS client calls Bundle.rescan() per connection, loading the system CA certificates into a per-connection arena. with ~2,750 PDS connections, that's ~2,750 copies of the CA bundle in memory (~800 KB each = ~2.2 GiB). fix: load the bundle once in the slurper, pass it to all subscribers via config.ca_bundle. memory dropped from ~3.3 GiB to ~1.2 GiB (~65% reduction).
thread stack sizes. zig's default thread stack is 16 MB. with ~2,750 subscriber threads that maps 44 GB of virtual memory. all Thread.spawn calls use main.default_stack_size (8 MB). this is virtual memory — only touched pages count as RSS. 8 MB supports ReleaseSafe's TLS handshake path (~134 KiB peak stack).
c_allocator instead of GeneralPurposeAllocator. GPA is a debug allocator — it tracks per-allocation metadata and never returns freed small allocations to the OS. since zlay links glibc (build.zig:42), std.heap.c_allocator gives us glibc malloc with per-thread arenas, madvise-based page return, and production-grade fragmentation mitigation.
frame processing pool. reader threads (one per PDS) now only do TLS read, header decode, cursor tracking, and rate limiting — then queue raw frames to a shared pool of 16 workers. this dramatically reduced per-thread RSS in ReleaseSafe (from ~3.9 MiB to ~0.45 MiB) by keeping crypto, DB, and broadcast off reader thread stacks.
resource usage#
| metric | value |
|---|---|
| memory | ~1.1 GiB at ~2,250 hosts (ReleaseSafe), projected ~1.3 GiB steady state |
| CPU | ~1.5 cores peak |
| requests | 1 GiB memory, 1000m CPU |
| limits | 8 GiB memory |
| PVC | 20 GiB (events + RocksDB collection index) |
| postgres | ~238 MiB |
git push#
the zlay repo is hosted on tangled. pushing requires the tangled SSH key:
GIT_SSH_COMMAND="ssh -i ~/.ssh/tangled_ed25519 -o IdentitiesOnly=yes" git push