context.md at main · zzstoatzz.io/relay

zzstoatzz.io / relay
fork atom
declarative relay deployment on hetzner relay.waow.tech
atproto
fork atom
relay / context.md
at main 127 lines 8.4 kB view raw view rendered
wrap content
zzstoatzz.io add context.md for new operators 8d ago
d726581c
  1# context: what's going on here
  2
  3this file is a companion to [TODO.md](TODO.md). the TODO focuses on zlay's internal implementation work — validation plumbing, memory fragmentation, backfill mechanics. this file covers the operational picture from the outside: what the relays are doing in production, how they compare, and what a new operator should know.
  4
  5## the two relays
  6
  7we run two independent ATProto relays on separate Hetzner nodes:
  8
  9- **relay.waow.tech** — the Go implementation ([indigo](https://github.com/bluesky-social/indigo)), Ashburn VA. this is the "known quantity." it's been running longer, it's the reference codebase, and it's well understood. collectiondir runs as a sidecar with a pebble DB.
 10- **zlay.waow.tech** — the Zig implementation ([zlay](https://tangled.org/zzstoatzz.io/zlay)), Hillsboro OR. this is the experimental one. has an inline RocksDB collection index and uses about half the memory of indigo for the same workload.
 11
 12both relays work the same way architecturally: they connect directly to each PDS on the network via individual `subscribeRepos` WebSocket connections (one per host, ~2,800 concurrent). neither is a "fan-out relay" that re-serves an upstream firehose. `RELAY_UPSTREAM` in zlay (and the equivalent in indigo) is used once at bootstrap to call `listHosts` for PDS discovery — no event data flows through it after that.
 13
 14both serve the full ATProto firehose and `listReposByCollection`. they're deployed on independent k3s clusters and don't depend on each other.
 15
 16## what we've been measuring
 17
 18we run coverage comparisons periodically (see `docs/coverage-comparison-*.md`) using:
 19
 20- **[pulsar](https://tangled.org/mackuba.eu/pulsar)** — subscribes to multiple relay firehoses at once, counts events and unique DIDs over a 2-minute window. tells us whether the relays are seeing the same traffic.
 21- **coldir-compare** (a bash script at `/tmp/coldir-compare.sh`) — queries `listReposByCollection` across relay, zlay, and bsky.network for every indie NSID from [lexicon garden](https://lexicon.garden). tells us whether the collection directories are complete.
 22- **[tap](https://github.com/bluesky-social/indigo/tree/main/cmd/tap)** — an indigo tool that discovers repos via `listReposByCollection`, backfills from PDS, and subscribes to the firehose. tells us whether the data is correct, not just present.
 23
 24the reference relay for comparison is always **bsky.network** (Bluesky's own relay). we also include third-party relays — see "other public relays" below.
 25
 26## what the numbers say (as of 2026-03-05)
 27
 28### firehose: parity
 29
 30relay and zlay are within ~2% of each other on both event count and unique DIDs. third-party relays (firehose.network, atproto.africa) show comparable numbers when included in tests. all relays on the network are seeing the same traffic.
 31
 32### collection directory: zlay leads on most indie collections
 33
 34zlay's backfill imported DIDs from bsky.network, so it starts with at least what bsky has. on many collections it has 30-60% more DIDs than the indigo relay, and matches or slightly trails bsky.network.
 35
 36the indigo relay's collectiondir trails because:
 371. its crawl-based backfill is slower and more fragile (bsky PDS shards rate-limit aggressively, crawl state is in-memory only, pod restart = progress lost)
 382. it still carries ghost DIDs (accounts whose PDS is gone) and deactivated accounts that inflate some counts while leaving real gaps in others
 39
 40zlay has its own gaps — 6 long-tail collections where it returns 0 results. these are collections not in lexicon garden's llms.txt and not present on bsky.network, so the backfill never found them. the live firehose will pick them up if anyone creates new records in those collections.
 41
 42### the one collection where indigo leads
 43
 44`xyz.statusphere.status` — indigo has ~848 DIDs, zlay has 815. statusphere accounts were probably indexed during an early crawl that zlay's backfill didn't cover. not a systemic issue.
 45
 46## things a new operator should know
 47
 48### zlay restarts
 49
 50zlay's RSS grows over time due to glibc memory fragmentation (see TODO.md for details). the memory limit is 5 GiB. when it gets OOM-killed, k8s restarts it automatically. this is the main operational concern right now. the relay recovers cleanly — postgres has the cursor, RocksDB has the index — but there's a brief outage window.
 51
 52the TODO has the investigation plan. short version: the arena-per-frame allocation pattern doesn't play well with ptmalloc. jemalloc/mimalloc via `LD_PRELOAD` or periodic `malloc_trim(0)` are the most promising fixes.
 53
 54### indigo collectiondir is fragile
 55
 56the indigo collectiondir's backfill has several sharp edges:
 57- bsky PDS shards share an IP-based rate limit. >2 concurrent crawls = HTTP 429, which kills the crawl with no retry.
 58- crawl state is in-memory. pod restart = all progress lost.
 59- port-forwards to the collectiondir die after ~80 minutes. crawls continue server-side but you can't monitor or submit new batches.
 60
 61we built a [micro-PDS trick](docs/hacks.md) to work around the rate limit for targeted backfills — stand up a fake PDS serving just the DIDs you need, point `requestCrawl` at it. works great for small gaps, doesn't scale to full-network backfill.
 62
 63submit bsky shards one at a time (`--batch-size 1`). indie PDS hosts can be batched freely.
 64
 65### zlay's collection index backfill is better
 66
 67zlay imports from bsky.network via `listReposByCollection` — no PDS crawling, no rate limit issues. progress is tracked in postgres (crash-resumable). it's currently at ~13.6M+ DIDs across ~1,000 collections. triggered via admin API, not a crawl job.
 68
 69### admin auth differs
 70
 71- **indigo relay**: HTTP basic auth (`admin:$RELAY_ADMIN_PASSWORD`)
 72- **indigo collectiondir**: bearer token (`Authorization: Bearer $COLLECTIONDIR_ADMIN_TOKEN`)
 73- **zlay**: bearer token on admin endpoints (port 3001)
 74
 75### split ports on zlay
 76
 77zlay serves the WebSocket firehose on port 3000 and HTTP (health, metrics, admin, XRPC) on port 3001. indigo serves everything on 2470 (metrics on 2471). this matters for health checks, port-forwards, and ServiceMonitor configuration.
 78
 79### `listReposByCollection` limits
 80
 81- indigo collectiondir: max limit 2000
 82- zlay: max limit 1000
 83- bsky.network: max limit 1000
 84
 85if you're writing tooling that paginates, use limit ≤ 1000 to work against all three.
 86
 87### DNS is manual
 88
 89both relay domains are managed in Cloudflare. there's no terraform for DNS — just A records pointing at the server IPs (`just indigo server-ip` / `just zlay server-ip`).
 90
 91### monitoring
 92
 93each cluster has its own kube-prometheus-stack:
 94- `relay-metrics.waow.tech` — indigo grafana
 95- `zlay-metrics.waow.tech` — zlay grafana
 96
 97dashboards are provisioned via configmaps. the layouts are aligned (events/sec, hosts, memory, threads/goroutines) so you can compare side-by-side.
 98
 99## other public relays
100
101we're not the only independent relay operators. worth knowing about:
102
103### firehose.network (sri)
104
105sri ([firehose.network](https://firehose.network), [status](https://status.vayumandala.com)) runs 3 full indigo relays globally with 72-hour replay windows:
106
107| region | hostname |
108|---|---|
109| North America | `northamerica.firehose.network` |
110| Europe | `europe.firehose.network` |
111| Asia | `asia.firehose.network` |
112
113plus 6 public Jetstream instances at `*.firehose.stream` (NYC, SFO, London, Frankfurt, Chennai, Canada). mature ops — automated PDS discovery, monitoring, public status page. uptime is consistently 99.9%+. these are firehose-only relays (no `listReposByCollection`).
114
115sri also builds [lexicon.store](https://lexicon.store) and [goals.garden](https://goals.garden).
116
117### atproto.africa (BlackSky)
118
119firehose-only relay, no collection directory. comparable firehose coverage in our tests.
120
121## what's next
122
123the TODO covers zlay-specific implementation work. from the operational side:
124
1251. **fix zlay's memory growth** — this is the most impactful item. the relay works correctly; it just can't stay up indefinitely.
1262. **close zlay's collection gaps** — the 6 zero-result collections need their NSIDs added to the backfill source list, or the collections need to appear on bsky.network so the importer picks them up.
1273. **keep running coverage comparisons** — the `docs/coverage-comparison-*.md` files track progress over time. run pulsar + coldir-compare periodically to catch regressions.