declarative relay deployment on hetzner relay.waow.tech
atproto
at main 127 lines 8.4 kB view raw view rendered
1# context: what's going on here 2 3this file is a companion to [TODO.md](TODO.md). the TODO focuses on zlay's internal implementation work — validation plumbing, memory fragmentation, backfill mechanics. this file covers the operational picture from the outside: what the relays are doing in production, how they compare, and what a new operator should know. 4 5## the two relays 6 7we run two independent ATProto relays on separate Hetzner nodes: 8 9- **relay.waow.tech** — the Go implementation ([indigo](https://github.com/bluesky-social/indigo)), Ashburn VA. this is the "known quantity." it's been running longer, it's the reference codebase, and it's well understood. collectiondir runs as a sidecar with a pebble DB. 10- **zlay.waow.tech** — the Zig implementation ([zlay](https://tangled.org/zzstoatzz.io/zlay)), Hillsboro OR. this is the experimental one. has an inline RocksDB collection index and uses about half the memory of indigo for the same workload. 11 12both relays work the same way architecturally: they connect directly to each PDS on the network via individual `subscribeRepos` WebSocket connections (one per host, ~2,800 concurrent). neither is a "fan-out relay" that re-serves an upstream firehose. `RELAY_UPSTREAM` in zlay (and the equivalent in indigo) is used once at bootstrap to call `listHosts` for PDS discovery — no event data flows through it after that. 13 14both serve the full ATProto firehose and `listReposByCollection`. they're deployed on independent k3s clusters and don't depend on each other. 15 16## what we've been measuring 17 18we run coverage comparisons periodically (see `docs/coverage-comparison-*.md`) using: 19 20- **[pulsar](https://tangled.org/mackuba.eu/pulsar)** — subscribes to multiple relay firehoses at once, counts events and unique DIDs over a 2-minute window. tells us whether the relays are seeing the same traffic. 21- **coldir-compare** (a bash script at `/tmp/coldir-compare.sh`) — queries `listReposByCollection` across relay, zlay, and bsky.network for every indie NSID from [lexicon garden](https://lexicon.garden). tells us whether the collection directories are complete. 22- **[tap](https://github.com/bluesky-social/indigo/tree/main/cmd/tap)** — an indigo tool that discovers repos via `listReposByCollection`, backfills from PDS, and subscribes to the firehose. tells us whether the data is correct, not just present. 23 24the reference relay for comparison is always **bsky.network** (Bluesky's own relay). we also include third-party relays — see "other public relays" below. 25 26## what the numbers say (as of 2026-03-05) 27 28### firehose: parity 29 30relay and zlay are within ~2% of each other on both event count and unique DIDs. third-party relays (firehose.network, atproto.africa) show comparable numbers when included in tests. all relays on the network are seeing the same traffic. 31 32### collection directory: zlay leads on most indie collections 33 34zlay's backfill imported DIDs from bsky.network, so it starts with at least what bsky has. on many collections it has 30-60% more DIDs than the indigo relay, and matches or slightly trails bsky.network. 35 36the indigo relay's collectiondir trails because: 371. its crawl-based backfill is slower and more fragile (bsky PDS shards rate-limit aggressively, crawl state is in-memory only, pod restart = progress lost) 382. it still carries ghost DIDs (accounts whose PDS is gone) and deactivated accounts that inflate some counts while leaving real gaps in others 39 40zlay has its own gaps — 6 long-tail collections where it returns 0 results. these are collections not in lexicon garden's llms.txt and not present on bsky.network, so the backfill never found them. the live firehose will pick them up if anyone creates new records in those collections. 41 42### the one collection where indigo leads 43 44`xyz.statusphere.status` — indigo has ~848 DIDs, zlay has 815. statusphere accounts were probably indexed during an early crawl that zlay's backfill didn't cover. not a systemic issue. 45 46## things a new operator should know 47 48### zlay restarts 49 50zlay's RSS grows over time due to glibc memory fragmentation (see TODO.md for details). the memory limit is 5 GiB. when it gets OOM-killed, k8s restarts it automatically. this is the main operational concern right now. the relay recovers cleanly — postgres has the cursor, RocksDB has the index — but there's a brief outage window. 51 52the TODO has the investigation plan. short version: the arena-per-frame allocation pattern doesn't play well with ptmalloc. jemalloc/mimalloc via `LD_PRELOAD` or periodic `malloc_trim(0)` are the most promising fixes. 53 54### indigo collectiondir is fragile 55 56the indigo collectiondir's backfill has several sharp edges: 57- bsky PDS shards share an IP-based rate limit. >2 concurrent crawls = HTTP 429, which kills the crawl with no retry. 58- crawl state is in-memory. pod restart = all progress lost. 59- port-forwards to the collectiondir die after ~80 minutes. crawls continue server-side but you can't monitor or submit new batches. 60 61we built a [micro-PDS trick](docs/hacks.md) to work around the rate limit for targeted backfills — stand up a fake PDS serving just the DIDs you need, point `requestCrawl` at it. works great for small gaps, doesn't scale to full-network backfill. 62 63submit bsky shards one at a time (`--batch-size 1`). indie PDS hosts can be batched freely. 64 65### zlay's collection index backfill is better 66 67zlay imports from bsky.network via `listReposByCollection` — no PDS crawling, no rate limit issues. progress is tracked in postgres (crash-resumable). it's currently at ~13.6M+ DIDs across ~1,000 collections. triggered via admin API, not a crawl job. 68 69### admin auth differs 70 71- **indigo relay**: HTTP basic auth (`admin:$RELAY_ADMIN_PASSWORD`) 72- **indigo collectiondir**: bearer token (`Authorization: Bearer $COLLECTIONDIR_ADMIN_TOKEN`) 73- **zlay**: bearer token on admin endpoints (port 3001) 74 75### split ports on zlay 76 77zlay serves the WebSocket firehose on port 3000 and HTTP (health, metrics, admin, XRPC) on port 3001. indigo serves everything on 2470 (metrics on 2471). this matters for health checks, port-forwards, and ServiceMonitor configuration. 78 79### `listReposByCollection` limits 80 81- indigo collectiondir: max limit 2000 82- zlay: max limit 1000 83- bsky.network: max limit 1000 84 85if you're writing tooling that paginates, use limit ≤ 1000 to work against all three. 86 87### DNS is manual 88 89both relay domains are managed in Cloudflare. there's no terraform for DNS — just A records pointing at the server IPs (`just indigo server-ip` / `just zlay server-ip`). 90 91### monitoring 92 93each cluster has its own kube-prometheus-stack: 94- `relay-metrics.waow.tech` — indigo grafana 95- `zlay-metrics.waow.tech` — zlay grafana 96 97dashboards are provisioned via configmaps. the layouts are aligned (events/sec, hosts, memory, threads/goroutines) so you can compare side-by-side. 98 99## other public relays 100 101we're not the only independent relay operators. worth knowing about: 102 103### firehose.network (sri) 104 105sri ([firehose.network](https://firehose.network), [status](https://status.vayumandala.com)) runs 3 full indigo relays globally with 72-hour replay windows: 106 107| region | hostname | 108|---|---| 109| North America | `northamerica.firehose.network` | 110| Europe | `europe.firehose.network` | 111| Asia | `asia.firehose.network` | 112 113plus 6 public Jetstream instances at `*.firehose.stream` (NYC, SFO, London, Frankfurt, Chennai, Canada). mature ops — automated PDS discovery, monitoring, public status page. uptime is consistently 99.9%+. these are firehose-only relays (no `listReposByCollection`). 114 115sri also builds [lexicon.store](https://lexicon.store) and [goals.garden](https://goals.garden). 116 117### atproto.africa (BlackSky) 118 119firehose-only relay, no collection directory. comparable firehose coverage in our tests. 120 121## what's next 122 123the TODO covers zlay-specific implementation work. from the operational side: 124 1251. **fix zlay's memory growth** — this is the most impactful item. the relay works correctly; it just can't stay up indefinitely. 1262. **close zlay's collection gaps** — the 6 zero-result collections need their NSIDs added to the backfill source list, or the collections need to appear on bsky.network so the importer picks them up. 1273. **keep running coverage comparisons** — the `docs/coverage-comparison-*.md` files track progress over time. run pulsar + coldir-compare periodically to catch regressions.