atproto relay implementation in zig zlay.waow.tech

docs: add backfill and deployment documentation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

+195
+105
docs/backfill.md
··· 1 + # collection index backfill 2 + 3 + the collection index (RocksDB) only has entries for accounts that have created records since live indexing was deployed. the backfill imports historical data from a source relay that already has a complete `listReposByCollection` endpoint. 4 + 5 + ## how it works 6 + 7 + the backfiller runs as a background thread, triggered via admin API. it: 8 + 9 + 1. **discovers collections** from two sources (unioned): 10 + - [lexicon garden](https://lexicon.garden/llms.txt) `llms.txt` — ~700 known NSIDs, parsed from markdown links 11 + - RocksDB scan — collections already observed from the live firehose (RBC column family prefix scan) 12 + 13 + 2. **inserts progress rows** into postgres (`backfill_progress` table) for each collection. existing rows are skipped (`ON CONFLICT DO NOTHING`), so re-triggering is safe. 14 + 15 + 3. **pages through each collection** sequentially, calling `com.atproto.sync.listReposByCollection` on the source relay (default: `bsky.network`) with `limit=1000`. each DID in the response is added to the collection index via `addCollection`. cursor and imported count are persisted after each page for resumability. 16 + 17 + 4. **marks complete** when a page returns no cursor (no more results). 18 + 19 + ## progress tracking 20 + 21 + ```sql 22 + CREATE TABLE backfill_progress ( 23 + collection TEXT PRIMARY KEY, 24 + source TEXT NOT NULL, 25 + cursor TEXT NOT NULL DEFAULT '', 26 + imported_count BIGINT NOT NULL DEFAULT 0, 27 + completed_at TIMESTAMPTZ, 28 + created_at TIMESTAMPTZ NOT NULL DEFAULT now() 29 + ); 30 + ``` 31 + 32 + - cursor = last pagination cursor from source relay 33 + - imported_count = total DIDs added for this collection 34 + - completed_at = null while in progress, set when done 35 + - if the process crashes or restarts, it resumes from the saved cursor 36 + 37 + ## admin API 38 + 39 + requires bearer token auth (`RELAY_ADMIN_PASSWORD`). 40 + 41 + ### trigger backfill 42 + 43 + ``` 44 + POST /admin/backfill-collections?source=bsky.network 45 + ``` 46 + 47 + returns 200 with collection count if started, 409 if already running. only one backfill can run at a time. 48 + 49 + ### check status 50 + 51 + ``` 52 + GET /admin/backfill-collections 53 + ``` 54 + 55 + returns JSON: 56 + 57 + ```json 58 + { 59 + "running": true, 60 + "total": 1269, 61 + "completed": 621, 62 + "in_progress": 648, 63 + "total_imported": 13628818, 64 + "collections": [ 65 + {"collection": "app.bsky.feed.post", "source": "bsky.network", "imported": 28000000, "completed": true}, 66 + {"collection": "app.bsky.feed.like", "source": "bsky.network", "imported": 6732000, "completed": false, "cursor": "did:plc:..."} 67 + ] 68 + } 69 + ``` 70 + 71 + ## using the script 72 + 73 + the relay repo has a convenience script at `scripts/backfill-status`: 74 + 75 + ```bash 76 + # check progress (summary with recent incomplete) 77 + ./scripts/backfill-status status 78 + 79 + # trigger a new backfill 80 + ./scripts/backfill-status start [source] 81 + 82 + # full JSON output 83 + ./scripts/backfill-status full 84 + ``` 85 + 86 + requires `ZLAY_ADMIN_PASSWORD` and `ZLAY_DOMAIN` in `.env`. 87 + 88 + ## performance characteristics 89 + 90 + - collections are processed sequentially (one at a time) 91 + - 100ms pause between pages to avoid hammering the source relay 92 + - one HTTP client is reused across all pages for a given collection 93 + - large collections like `app.bsky.feed.like` (~30M+ DIDs) take 1-2 hours each 94 + - small/niche collections complete in seconds 95 + - full backfill of ~1269 collections takes several hours, dominated by the 5-6 largest `app.bsky.*` collections 96 + 97 + ## re-running 98 + 99 + safe to trigger again after completion — existing progress rows are preserved, completed collections are skipped. useful if new collections appear (e.g. new lexicons published to the network). 100 + 101 + ## source code 102 + 103 + - `src/backfill.zig` — Backfiller struct with all backfill logic 104 + - `src/event_log.zig` — backfill_progress table creation (in `init()`) 105 + - `src/main.zig` — admin route handlers
+90
docs/deployment.md
··· 1 + # deployment 2 + 3 + zlay runs on a Hetzner CPX41 in Hillsboro OR, managed via k3s. all deployment is orchestrated from the [relay repo](https://tangled.org/zzstoatzz.io/relay) using `just` recipes. 4 + 5 + ## build and deploy 6 + 7 + the preferred method builds natively on the server (fast, no cross-compilation): 8 + 9 + ```bash 10 + just zlay-publish-remote 11 + ``` 12 + 13 + this SSHs into the server and: 14 + 15 + 1. `git pull --ff-only` in `/opt/zlay` 16 + 2. `zig build -Doptimize=ReleaseSafe -Dtarget=x86_64-linux-gnu` — native x86_64 build 17 + 3. `buildah bud -t atcr.io/zzstoatzz.io/zlay:latest -f Dockerfile.runtime .` — thin runtime image 18 + 4. pushes to k3s containerd via `buildah push` → `ctr images import` 19 + 5. `kubectl rollout restart deployment/zlay -n zlay` 20 + 21 + the runtime image (`Dockerfile.runtime`) is minimal: debian bookworm-slim + ca-certificates + the binary. 22 + 23 + ### why not Docker build? 24 + 25 + the full `Dockerfile` exists for CI/standalone builds but is slow on Mac (cross-compilation + QEMU). `zlay-publish-remote` skips all of that by building on the target architecture. 26 + 27 + ### build flags 28 + 29 + - `-Dtarget=x86_64-linux-gnu` — **must use glibc**, not musl. zig 0.15's C++ codegen for musl produces illegal instructions in RocksDB's LRU cache. 30 + - `-Dcpu=baseline` — required when building inside Docker/QEMU (not needed for `zlay-publish-remote` since it builds natively). 31 + - `-Doptimize=ReleaseSafe` — safety checks on, optimizations on. 32 + 33 + ## initial setup 34 + 35 + ```bash 36 + just zlay-init # terraform init 37 + just zlay-infra # create Hetzner server with k3s 38 + just zlay-kubeconfig # pull kubeconfig (~2 min after creation) 39 + just zlay-deploy # full deploy: cert-manager, postgres, relay, monitoring 40 + ``` 41 + 42 + point DNS A record for `ZLAY_DOMAIN` at the server IP (`just zlay-server-ip`) before deploying. 43 + 44 + ## environment variables 45 + 46 + set in `.env` in the relay repo: 47 + 48 + | variable | required | description | 49 + |----------|----------|-------------| 50 + | `HCLOUD_TOKEN` | yes | Hetzner Cloud API token | 51 + | `ZLAY_DOMAIN` | yes | public domain (e.g. `zlay.waow.tech`) | 52 + | `ZLAY_ADMIN_PASSWORD` | yes | bearer token for admin endpoints | 53 + | `ZLAY_POSTGRES_PASSWORD` | yes | postgres password | 54 + | `LETSENCRYPT_EMAIL` | yes | email for TLS certificates | 55 + 56 + ## operations 57 + 58 + ```bash 59 + just zlay-status # nodes, pods, health 60 + just zlay-logs # tail relay logs 61 + just zlay-health # curl public health endpoint 62 + just zlay-ssh # ssh into server 63 + ``` 64 + 65 + ## infrastructure 66 + 67 + - **server**: Hetzner CPX41 — 16 vCPU (AMD), 32 GB RAM, 240 GB NVMe 68 + - **k3s**: single-node kubernetes with traefik ingress 69 + - **cert-manager**: automatic TLS via Let's Encrypt 70 + - **postgres**: bitnami/postgresql helm chart (relay state, backfill progress) 71 + - **monitoring**: prometheus + grafana via kube-prometheus-stack 72 + - **terraform**: `infra/zlay/` in the relay repo 73 + 74 + ## resource usage 75 + 76 + | metric | value | 77 + |--------|-------| 78 + | memory | ~1.8 GiB steady state (1486 subscribers) | 79 + | CPU | ~1.5 cores peak | 80 + | limits | 8 GiB memory, 250m CPU request | 81 + | PVC | 20 GiB (events + RocksDB collection index) | 82 + | postgres | ~131 MiB | 83 + 84 + ## git push 85 + 86 + the zlay repo is hosted on tangled. pushing requires the tangled SSH key: 87 + 88 + ```bash 89 + GIT_SSH_COMMAND="ssh -i ~/.ssh/tangled_ed25519 -o IdentitiesOnly=yes" git push 90 + ```