# context: what's going on here

this file is a companion to [TODO.md](TODO.md). the TODO focuses on zlay's internal implementation work — validation plumbing, memory fragmentation, backfill mechanics. this file covers the operational picture from the outside: what the relays are doing in production, how they compare, and what a new operator should know.

## the two relays

we run two independent ATProto relays on separate Hetzner nodes:

- **relay.waow.tech** — the Go implementation ([indigo](https://github.com/bluesky-social/indigo)), Ashburn VA. this is the "known quantity." it's been running longer, it's the reference codebase, and it's well understood. collectiondir runs as a sidecar with a pebble DB.
- **zlay.waow.tech** — the Zig implementation ([zlay](https://tangled.org/zzstoatzz.io/zlay)), Hillsboro OR. this is the experimental one. has an inline RocksDB collection index and uses about half the memory of indigo for the same workload.

both relays work the same way architecturally: they connect directly to each PDS on the network via individual `subscribeRepos` WebSocket connections (one per host, ~2,800 concurrent). neither is a "fan-out relay" that re-serves an upstream firehose. `RELAY_UPSTREAM` in zlay (and the equivalent in indigo) is used once at bootstrap to call `listHosts` for PDS discovery — no event data flows through it after that.

both serve the full ATProto firehose and `listReposByCollection`. they're deployed on independent k3s clusters and don't depend on each other.

## what we've been measuring

we run coverage comparisons periodically (see `docs/coverage-comparison-*.md`) using:

- **[pulsar](https://tangled.org/mackuba.eu/pulsar)** — subscribes to multiple relay firehoses at once, counts events and unique DIDs over a 2-minute window. tells us whether the relays are seeing the same traffic.
- **coldir-compare** (a bash script at `/tmp/coldir-compare.sh`) — queries `listReposByCollection` across relay, zlay, and bsky.network for every indie NSID from [lexicon garden](https://lexicon.garden). tells us whether the collection directories are complete.
- **[tap](https://github.com/bluesky-social/indigo/tree/main/cmd/tap)** — an indigo tool that discovers repos via `listReposByCollection`, backfills from PDS, and subscribes to the firehose. tells us whether the data is correct, not just present.

the reference relay for comparison is always **bsky.network** (Bluesky's own relay). we also include third-party relays — see "other public relays" below.

## what the numbers say (as of 2026-03-05)

### firehose: parity

relay and zlay are within ~2% of each other on both event count and unique DIDs. third-party relays (firehose.network, atproto.africa) show comparable numbers when included in tests. all relays on the network are seeing the same traffic.

### collection directory: zlay leads on most indie collections

zlay's backfill imported DIDs from bsky.network, so it starts with at least what bsky has. on many collections it has 30-60% more DIDs than the indigo relay, and matches or slightly trails bsky.network.

the indigo relay's collectiondir trails because:
1. its crawl-based backfill is slower and more fragile (bsky PDS shards rate-limit aggressively, crawl state is in-memory only, pod restart = progress lost)
2. it still carries ghost DIDs (accounts whose PDS is gone) and deactivated accounts that inflate some counts while leaving real gaps in others

zlay has its own gaps — 6 long-tail collections where it returns 0 results. these are collections not in lexicon garden's llms.txt and not present on bsky.network, so the backfill never found them. the live firehose will pick them up if anyone creates new records in those collections.

### the one collection where indigo leads

`xyz.statusphere.status` — indigo has ~848 DIDs, zlay has 815. statusphere accounts were probably indexed during an early crawl that zlay's backfill didn't cover. not a systemic issue.

## things a new operator should know

### zlay restarts

zlay's RSS grows over time due to glibc memory fragmentation (see TODO.md for details). the memory limit is 5 GiB. when it gets OOM-killed, k8s restarts it automatically. this is the main operational concern right now. the relay recovers cleanly — postgres has the cursor, RocksDB has the index — but there's a brief outage window.

the TODO has the investigation plan. short version: the arena-per-frame allocation pattern doesn't play well with ptmalloc. jemalloc/mimalloc via `LD_PRELOAD` or periodic `malloc_trim(0)` are the most promising fixes.

### indigo collectiondir is fragile

the indigo collectiondir's backfill has several sharp edges:
- bsky PDS shards share an IP-based rate limit. >2 concurrent crawls = HTTP 429, which kills the crawl with no retry.
- crawl state is in-memory. pod restart = all progress lost.
- port-forwards to the collectiondir die after ~80 minutes. crawls continue server-side but you can't monitor or submit new batches.

we built a [micro-PDS trick](docs/hacks.md) to work around the rate limit for targeted backfills — stand up a fake PDS serving just the DIDs you need, point `requestCrawl` at it. works great for small gaps, doesn't scale to full-network backfill.

submit bsky shards one at a time (`--batch-size 1`). indie PDS hosts can be batched freely.

### zlay's collection index backfill is better

zlay imports from bsky.network via `listReposByCollection` — no PDS crawling, no rate limit issues. progress is tracked in postgres (crash-resumable). it's currently at ~13.6M+ DIDs across ~1,000 collections. triggered via admin API, not a crawl job.

### admin auth differs

- **indigo relay**: HTTP basic auth (`admin:$RELAY_ADMIN_PASSWORD`)
- **indigo collectiondir**: bearer token (`Authorization: Bearer $COLLECTIONDIR_ADMIN_TOKEN`)
- **zlay**: bearer token on admin endpoints (port 3001)

### split ports on zlay

zlay serves the WebSocket firehose on port 3000 and HTTP (health, metrics, admin, XRPC) on port 3001. indigo serves everything on 2470 (metrics on 2471). this matters for health checks, port-forwards, and ServiceMonitor configuration.

### `listReposByCollection` limits

- indigo collectiondir: max limit 2000
- zlay: max limit 1000
- bsky.network: max limit 1000

if you're writing tooling that paginates, use limit ≤ 1000 to work against all three.

### DNS is manual

both relay domains are managed in Cloudflare. there's no terraform for DNS — just A records pointing at the server IPs (`just indigo server-ip` / `just zlay server-ip`).

### monitoring

each cluster has its own kube-prometheus-stack:
- `relay-metrics.waow.tech` — indigo grafana
- `zlay-metrics.waow.tech` — zlay grafana

dashboards are provisioned via configmaps. the layouts are aligned (events/sec, hosts, memory, threads/goroutines) so you can compare side-by-side.

## other public relays

we're not the only independent relay operators. worth knowing about:

### firehose.network (sri)

sri ([firehose.network](https://firehose.network), [status](https://status.vayumandala.com)) runs 3 full indigo relays globally with 72-hour replay windows:

| region | hostname |
|---|---|
| North America | `northamerica.firehose.network` |
| Europe | `europe.firehose.network` |
| Asia | `asia.firehose.network` |

plus 6 public Jetstream instances at `*.firehose.stream` (NYC, SFO, London, Frankfurt, Chennai, Canada). mature ops — automated PDS discovery, monitoring, public status page. uptime is consistently 99.9%+. these are firehose-only relays (no `listReposByCollection`).

sri also builds [lexicon.store](https://lexicon.store) and [goals.garden](https://goals.garden).

### atproto.africa (BlackSky)

firehose-only relay, no collection directory. comparable firehose coverage in our tests.

## what's next

the TODO covers zlay-specific implementation work. from the operational side:

1. **fix zlay's memory growth** — this is the most impactful item. the relay works correctly; it just can't stay up indefinitely.
2. **close zlay's collection gaps** — the 6 zero-result collections need their NSIDs added to the backfill source list, or the collections need to appear on bsky.network so the importer picks them up.
3. **keep running coverage comparisons** — the `docs/coverage-comparison-*.md` files track progress over time. run pulsar + coldir-compare periodically to catch regressions.