atproto relay implementation in zig zlay.waow.tech

deployment#

zlay runs on a Hetzner CPX41 in Hillsboro OR, managed via k3s. all deployment is orchestrated from the relay repo using just recipes.

build and deploy#

the preferred method builds natively on the server (fast, no cross-compilation):

just zlay-publish-remote

this SSHs into the server and:

  1. git pull --ff-only in /opt/zlay
  2. zig build -Doptimize=ReleaseSafe -Dtarget=x86_64-linux-gnu — native x86_64 build
  3. buildah bud -t atcr.io/zzstoatzz.io/zlay:latest -f Dockerfile.runtime . — thin runtime image
  4. pushes to k3s containerd via buildah pushctr images import
  5. kubectl rollout restart deployment/zlay -n zlay

the runtime image (Dockerfile.runtime) is minimal: debian bookworm-slim + ca-certificates + the binary.

why not Docker build?#

the full Dockerfile exists for CI/standalone builds but is slow on Mac (cross-compilation + QEMU). zlay-publish-remote skips all of that by building on the target architecture.

build flags#

  • -Dtarget=x86_64-linux-gnumust use glibc, not musl. zig 0.15's C++ codegen for musl produces illegal instructions in RocksDB's LRU cache.
  • -Dcpu=baseline — required when building inside Docker/QEMU (not needed for zlay-publish-remote since it builds natively).
  • -Doptimize=ReleaseSafe — safety checks on, optimizations on.

initial setup#

just zlay-init          # terraform init
just zlay-infra         # create Hetzner server with k3s
just zlay-kubeconfig    # pull kubeconfig (~2 min after creation)
just zlay-deploy        # full deploy: cert-manager, postgres, relay, monitoring

point DNS A record for ZLAY_DOMAIN at the server IP (just zlay-server-ip) before deploying.

environment variables#

set in .env in the relay repo:

variable required description
HCLOUD_TOKEN yes Hetzner Cloud API token
ZLAY_DOMAIN yes public domain (e.g. zlay.waow.tech)
ZLAY_ADMIN_PASSWORD yes bearer token for admin endpoints
ZLAY_POSTGRES_PASSWORD yes postgres password
LETSENCRYPT_EMAIL yes email for TLS certificates

operations#

just zlay-status        # nodes, pods, health
just zlay-logs          # tail relay logs
just zlay-health        # curl public health endpoint
just zlay-ssh           # ssh into server

infrastructure#

  • server: Hetzner CPX41 — 16 vCPU (AMD), 32 GB RAM, 240 GB NVMe
  • k3s: single-node kubernetes with traefik ingress
  • cert-manager: automatic TLS via Let's Encrypt
  • postgres: bitnami/postgresql helm chart (relay state, backfill progress)
  • monitoring: prometheus + grafana via kube-prometheus-stack
  • terraform: infra/zlay/ in the relay repo

memory tuning#

two changes brought steady-state memory from ~6.6 GiB down to ~2.9 GiB at 2,738 connected hosts:

thread stack sizes. zig's default thread stack is 16 MB. with ~2,750 subscriber threads that maps 44 GB of virtual memory. most threads just read websockets and decode CBOR — 2 MB is generous. all Thread.spawn calls now pass .{ .stack_size = 2 * 1024 * 1024 }. the constant is defined in main.zig as default_stack_size for the threads spawned there; other modules use the literal directly.

c_allocator instead of GeneralPurposeAllocator. GPA is a debug allocator — it tracks per-allocation metadata and never returns freed small allocations to the OS. since zlay links glibc (build.zig:42), std.heap.c_allocator gives us glibc malloc with per-thread arenas, madvise-based page return, and production-grade fragmentation mitigation.

resource usage#

metric value
memory ~2.9 GiB steady state (~2,750 hosts)
CPU ~1.5 cores peak
limits 8 GiB memory, 250m CPU request
PVC 20 GiB (events + RocksDB collection index)
postgres ~238 MiB

git push#

the zlay repo is hosted on tangled. pushing requires the tangled SSH key:

GIT_SSH_COMMAND="ssh -i ~/.ssh/tangled_ed25519 -o IdentitiesOnly=yes" git push