atproto relay implementation in zig zlay.waow.tech
at main 105 lines 3.7 kB view raw view rendered
1# collection index backfill 2 3the collection index (RocksDB) only has entries for accounts that have created records since live indexing was deployed. the backfill imports historical data from a source relay that already has a complete `listReposByCollection` endpoint. 4 5## how it works 6 7the backfiller runs as a background thread, triggered via admin API. it: 8 91. **discovers collections** from two sources (unioned): 10 - [lexicon garden](https://lexicon.garden/llms.txt) `llms.txt` — ~700 known NSIDs, parsed from markdown links 11 - RocksDB scan — collections already observed from the live firehose (RBC column family prefix scan) 12 132. **inserts progress rows** into postgres (`backfill_progress` table) for each collection. existing rows are skipped (`ON CONFLICT DO NOTHING`), so re-triggering is safe. 14 153. **pages through each collection** sequentially, calling `com.atproto.sync.listReposByCollection` on the source relay (default: `bsky.network`) with `limit=1000`. each DID in the response is added to the collection index via `addCollection`. cursor and imported count are persisted after each page for resumability. 16 174. **marks complete** when a page returns no cursor (no more results). 18 19## progress tracking 20 21```sql 22CREATE TABLE backfill_progress ( 23 collection TEXT PRIMARY KEY, 24 source TEXT NOT NULL, 25 cursor TEXT NOT NULL DEFAULT '', 26 imported_count BIGINT NOT NULL DEFAULT 0, 27 completed_at TIMESTAMPTZ, 28 created_at TIMESTAMPTZ NOT NULL DEFAULT now() 29); 30``` 31 32- cursor = last pagination cursor from source relay 33- imported_count = total DIDs added for this collection 34- completed_at = null while in progress, set when done 35- if the process crashes or restarts, it resumes from the saved cursor 36 37## admin API 38 39requires bearer token auth (`RELAY_ADMIN_PASSWORD`). 40 41### trigger backfill 42 43``` 44POST /admin/backfill-collections?source=bsky.network 45``` 46 47returns 200 with collection count if started, 409 if already running. only one backfill can run at a time. 48 49### check status 50 51``` 52GET /admin/backfill-collections 53``` 54 55returns JSON: 56 57```json 58{ 59 "running": true, 60 "total": 1269, 61 "completed": 621, 62 "in_progress": 648, 63 "total_imported": 13628818, 64 "collections": [ 65 {"collection": "app.bsky.feed.post", "source": "bsky.network", "imported": 28000000, "completed": true}, 66 {"collection": "app.bsky.feed.like", "source": "bsky.network", "imported": 6732000, "completed": false, "cursor": "did:plc:..."} 67 ] 68} 69``` 70 71## using the script 72 73the relay repo has a convenience script at `scripts/backfill-status`: 74 75```bash 76# check progress (summary with recent incomplete) 77./scripts/backfill-status status 78 79# trigger a new backfill 80./scripts/backfill-status start [source] 81 82# full JSON output 83./scripts/backfill-status full 84``` 85 86requires `ZLAY_ADMIN_PASSWORD` and `ZLAY_DOMAIN` in `.env`. 87 88## performance characteristics 89 90- collections are processed sequentially (one at a time) 91- 100ms pause between pages to avoid hammering the source relay 92- one HTTP client is reused across all pages for a given collection 93- large collections like `app.bsky.feed.like` (~30M+ DIDs) take 1-2 hours each 94- small/niche collections complete in seconds 95- full backfill of ~1269 collections takes several hours, dominated by the 5-6 largest `app.bsky.*` collections 96 97## re-running 98 99safe to trigger again after completion — existing progress rows are preserved, completed collections are skipped. useful if new collections appear (e.g. new lexicons published to the network). 100 101## source code 102 103- `src/backfill.zig` — Backfiller struct with all backfill logic 104- `src/event_log.zig` — backfill_progress table creation (in `init()`) 105- `src/main.zig` — admin route handlers