···1+# collection index backfill
2+3+the collection index (RocksDB) only has entries for accounts that have created records since live indexing was deployed. the backfill imports historical data from a source relay that already has a complete `listReposByCollection` endpoint.
4+5+## how it works
6+7+the backfiller runs as a background thread, triggered via admin API. it:
8+9+1. **discovers collections** from two sources (unioned):
10+ - [lexicon garden](https://lexicon.garden/llms.txt) `llms.txt` — ~700 known NSIDs, parsed from markdown links
11+ - RocksDB scan — collections already observed from the live firehose (RBC column family prefix scan)
12+13+2. **inserts progress rows** into postgres (`backfill_progress` table) for each collection. existing rows are skipped (`ON CONFLICT DO NOTHING`), so re-triggering is safe.
14+15+3. **pages through each collection** sequentially, calling `com.atproto.sync.listReposByCollection` on the source relay (default: `bsky.network`) with `limit=1000`. each DID in the response is added to the collection index via `addCollection`. cursor and imported count are persisted after each page for resumability.
16+17+4. **marks complete** when a page returns no cursor (no more results).
18+19+## progress tracking
20+21+```sql
22+CREATE TABLE backfill_progress (
23+ collection TEXT PRIMARY KEY,
24+ source TEXT NOT NULL,
25+ cursor TEXT NOT NULL DEFAULT '',
26+ imported_count BIGINT NOT NULL DEFAULT 0,
27+ completed_at TIMESTAMPTZ,
28+ created_at TIMESTAMPTZ NOT NULL DEFAULT now()
29+);
30+```
31+32+- cursor = last pagination cursor from source relay
33+- imported_count = total DIDs added for this collection
34+- completed_at = null while in progress, set when done
35+- if the process crashes or restarts, it resumes from the saved cursor
36+37+## admin API
38+39+requires bearer token auth (`RELAY_ADMIN_PASSWORD`).
40+41+### trigger backfill
42+43+```
44+POST /admin/backfill-collections?source=bsky.network
45+```
46+47+returns 200 with collection count if started, 409 if already running. only one backfill can run at a time.
48+49+### check status
50+51+```
52+GET /admin/backfill-collections
53+```
54+55+returns JSON:
56+57+```json
58+{
59+ "running": true,
60+ "total": 1269,
61+ "completed": 621,
62+ "in_progress": 648,
63+ "total_imported": 13628818,
64+ "collections": [
65+ {"collection": "app.bsky.feed.post", "source": "bsky.network", "imported": 28000000, "completed": true},
66+ {"collection": "app.bsky.feed.like", "source": "bsky.network", "imported": 6732000, "completed": false, "cursor": "did:plc:..."}
67+ ]
68+}
69+```
70+71+## using the script
72+73+the relay repo has a convenience script at `scripts/backfill-status`:
74+75+```bash
76+# check progress (summary with recent incomplete)
77+./scripts/backfill-status status
78+79+# trigger a new backfill
80+./scripts/backfill-status start [source]
81+82+# full JSON output
83+./scripts/backfill-status full
84+```
85+86+requires `ZLAY_ADMIN_PASSWORD` and `ZLAY_DOMAIN` in `.env`.
87+88+## performance characteristics
89+90+- collections are processed sequentially (one at a time)
91+- 100ms pause between pages to avoid hammering the source relay
92+- one HTTP client is reused across all pages for a given collection
93+- large collections like `app.bsky.feed.like` (~30M+ DIDs) take 1-2 hours each
94+- small/niche collections complete in seconds
95+- full backfill of ~1269 collections takes several hours, dominated by the 5-6 largest `app.bsky.*` collections
96+97+## re-running
98+99+safe to trigger again after completion — existing progress rows are preserved, completed collections are skipped. useful if new collections appear (e.g. new lexicons published to the network).
100+101+## source code
102+103+- `src/backfill.zig` — Backfiller struct with all backfill logic
104+- `src/event_log.zig` — backfill_progress table creation (in `init()`)
105+- `src/main.zig` — admin route handlers