···11+# collection index backfill
22+33+the collection index (RocksDB) only has entries for accounts that have created records since live indexing was deployed. the backfill imports historical data from a source relay that already has a complete `listReposByCollection` endpoint.
44+55+## how it works
66+77+the backfiller runs as a background thread, triggered via admin API. it:
88+99+1. **discovers collections** from two sources (unioned):
1010+ - [lexicon garden](https://lexicon.garden/llms.txt) `llms.txt` — ~700 known NSIDs, parsed from markdown links
1111+ - RocksDB scan — collections already observed from the live firehose (RBC column family prefix scan)
1212+1313+2. **inserts progress rows** into postgres (`backfill_progress` table) for each collection. existing rows are skipped (`ON CONFLICT DO NOTHING`), so re-triggering is safe.
1414+1515+3. **pages through each collection** sequentially, calling `com.atproto.sync.listReposByCollection` on the source relay (default: `bsky.network`) with `limit=1000`. each DID in the response is added to the collection index via `addCollection`. cursor and imported count are persisted after each page for resumability.
1616+1717+4. **marks complete** when a page returns no cursor (no more results).
1818+1919+## progress tracking
2020+2121+```sql
2222+CREATE TABLE backfill_progress (
2323+ collection TEXT PRIMARY KEY,
2424+ source TEXT NOT NULL,
2525+ cursor TEXT NOT NULL DEFAULT '',
2626+ imported_count BIGINT NOT NULL DEFAULT 0,
2727+ completed_at TIMESTAMPTZ,
2828+ created_at TIMESTAMPTZ NOT NULL DEFAULT now()
2929+);
3030+```
3131+3232+- cursor = last pagination cursor from source relay
3333+- imported_count = total DIDs added for this collection
3434+- completed_at = null while in progress, set when done
3535+- if the process crashes or restarts, it resumes from the saved cursor
3636+3737+## admin API
3838+3939+requires bearer token auth (`RELAY_ADMIN_PASSWORD`).
4040+4141+### trigger backfill
4242+4343+```
4444+POST /admin/backfill-collections?source=bsky.network
4545+```
4646+4747+returns 200 with collection count if started, 409 if already running. only one backfill can run at a time.
4848+4949+### check status
5050+5151+```
5252+GET /admin/backfill-collections
5353+```
5454+5555+returns JSON:
5656+5757+```json
5858+{
5959+ "running": true,
6060+ "total": 1269,
6161+ "completed": 621,
6262+ "in_progress": 648,
6363+ "total_imported": 13628818,
6464+ "collections": [
6565+ {"collection": "app.bsky.feed.post", "source": "bsky.network", "imported": 28000000, "completed": true},
6666+ {"collection": "app.bsky.feed.like", "source": "bsky.network", "imported": 6732000, "completed": false, "cursor": "did:plc:..."}
6767+ ]
6868+}
6969+```
7070+7171+## using the script
7272+7373+the relay repo has a convenience script at `scripts/backfill-status`:
7474+7575+```bash
7676+# check progress (summary with recent incomplete)
7777+./scripts/backfill-status status
7878+7979+# trigger a new backfill
8080+./scripts/backfill-status start [source]
8181+8282+# full JSON output
8383+./scripts/backfill-status full
8484+```
8585+8686+requires `ZLAY_ADMIN_PASSWORD` and `ZLAY_DOMAIN` in `.env`.
8787+8888+## performance characteristics
8989+9090+- collections are processed sequentially (one at a time)
9191+- 100ms pause between pages to avoid hammering the source relay
9292+- one HTTP client is reused across all pages for a given collection
9393+- large collections like `app.bsky.feed.like` (~30M+ DIDs) take 1-2 hours each
9494+- small/niche collections complete in seconds
9595+- full backfill of ~1269 collections takes several hours, dominated by the 5-6 largest `app.bsky.*` collections
9696+9797+## re-running
9898+9999+safe to trigger again after completion — existing progress rows are preserved, completed collections are skipped. useful if new collections appear (e.g. new lexicons published to the network).
100100+101101+## source code
102102+103103+- `src/backfill.zig` — Backfiller struct with all backfill logic
104104+- `src/event_log.zig` — backfill_progress table creation (in `init()`)
105105+- `src/main.zig` — admin route handlers
+90
docs/deployment.md
···11+# deployment
22+33+zlay runs on a Hetzner CPX41 in Hillsboro OR, managed via k3s. all deployment is orchestrated from the [relay repo](https://tangled.org/zzstoatzz.io/relay) using `just` recipes.
44+55+## build and deploy
66+77+the preferred method builds natively on the server (fast, no cross-compilation):
88+99+```bash
1010+just zlay-publish-remote
1111+```
1212+1313+this SSHs into the server and:
1414+1515+1. `git pull --ff-only` in `/opt/zlay`
1616+2. `zig build -Doptimize=ReleaseSafe -Dtarget=x86_64-linux-gnu` — native x86_64 build
1717+3. `buildah bud -t atcr.io/zzstoatzz.io/zlay:latest -f Dockerfile.runtime .` — thin runtime image
1818+4. pushes to k3s containerd via `buildah push` → `ctr images import`
1919+5. `kubectl rollout restart deployment/zlay -n zlay`
2020+2121+the runtime image (`Dockerfile.runtime`) is minimal: debian bookworm-slim + ca-certificates + the binary.
2222+2323+### why not Docker build?
2424+2525+the full `Dockerfile` exists for CI/standalone builds but is slow on Mac (cross-compilation + QEMU). `zlay-publish-remote` skips all of that by building on the target architecture.
2626+2727+### build flags
2828+2929+- `-Dtarget=x86_64-linux-gnu` — **must use glibc**, not musl. zig 0.15's C++ codegen for musl produces illegal instructions in RocksDB's LRU cache.
3030+- `-Dcpu=baseline` — required when building inside Docker/QEMU (not needed for `zlay-publish-remote` since it builds natively).
3131+- `-Doptimize=ReleaseSafe` — safety checks on, optimizations on.
3232+3333+## initial setup
3434+3535+```bash
3636+just zlay-init # terraform init
3737+just zlay-infra # create Hetzner server with k3s
3838+just zlay-kubeconfig # pull kubeconfig (~2 min after creation)
3939+just zlay-deploy # full deploy: cert-manager, postgres, relay, monitoring
4040+```
4141+4242+point DNS A record for `ZLAY_DOMAIN` at the server IP (`just zlay-server-ip`) before deploying.
4343+4444+## environment variables
4545+4646+set in `.env` in the relay repo:
4747+4848+| variable | required | description |
4949+|----------|----------|-------------|
5050+| `HCLOUD_TOKEN` | yes | Hetzner Cloud API token |
5151+| `ZLAY_DOMAIN` | yes | public domain (e.g. `zlay.waow.tech`) |
5252+| `ZLAY_ADMIN_PASSWORD` | yes | bearer token for admin endpoints |
5353+| `ZLAY_POSTGRES_PASSWORD` | yes | postgres password |
5454+| `LETSENCRYPT_EMAIL` | yes | email for TLS certificates |
5555+5656+## operations
5757+5858+```bash
5959+just zlay-status # nodes, pods, health
6060+just zlay-logs # tail relay logs
6161+just zlay-health # curl public health endpoint
6262+just zlay-ssh # ssh into server
6363+```
6464+6565+## infrastructure
6666+6767+- **server**: Hetzner CPX41 — 16 vCPU (AMD), 32 GB RAM, 240 GB NVMe
6868+- **k3s**: single-node kubernetes with traefik ingress
6969+- **cert-manager**: automatic TLS via Let's Encrypt
7070+- **postgres**: bitnami/postgresql helm chart (relay state, backfill progress)
7171+- **monitoring**: prometheus + grafana via kube-prometheus-stack
7272+- **terraform**: `infra/zlay/` in the relay repo
7373+7474+## resource usage
7575+7676+| metric | value |
7777+|--------|-------|
7878+| memory | ~1.8 GiB steady state (1486 subscribers) |
7979+| CPU | ~1.5 cores peak |
8080+| limits | 8 GiB memory, 250m CPU request |
8181+| PVC | 20 GiB (events + RocksDB collection index) |
8282+| postgres | ~131 MiB |
8383+8484+## git push
8585+8686+the zlay repo is hosted on tangled. pushing requires the tangled SSH key:
8787+8888+```bash
8989+GIT_SSH_COMMAND="ssh -i ~/.ssh/tangled_ed25519 -o IdentitiesOnly=yes" git push
9090+```