collection index backfill#
the collection index (RocksDB) only has entries for accounts that have created records since live indexing was deployed. the backfill imports historical data from a source relay that already has a complete listReposByCollection endpoint.
how it works#
the backfiller runs as a background thread, triggered via admin API. it:
-
discovers collections from two sources (unioned):
- lexicon garden
llms.txt— ~700 known NSIDs, parsed from markdown links - RocksDB scan — collections already observed from the live firehose (RBC column family prefix scan)
- lexicon garden
-
inserts progress rows into postgres (
backfill_progresstable) for each collection. existing rows are skipped (ON CONFLICT DO NOTHING), so re-triggering is safe. -
pages through each collection sequentially, calling
com.atproto.sync.listReposByCollectionon the source relay (default:bsky.network) withlimit=1000. each DID in the response is added to the collection index viaaddCollection. cursor and imported count are persisted after each page for resumability. -
marks complete when a page returns no cursor (no more results).
progress tracking#
CREATE TABLE backfill_progress (
collection TEXT PRIMARY KEY,
source TEXT NOT NULL,
cursor TEXT NOT NULL DEFAULT '',
imported_count BIGINT NOT NULL DEFAULT 0,
completed_at TIMESTAMPTZ,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
- cursor = last pagination cursor from source relay
- imported_count = total DIDs added for this collection
- completed_at = null while in progress, set when done
- if the process crashes or restarts, it resumes from the saved cursor
admin API#
requires bearer token auth (RELAY_ADMIN_PASSWORD).
trigger backfill#
POST /admin/backfill-collections?source=bsky.network
returns 200 with collection count if started, 409 if already running. only one backfill can run at a time.
check status#
GET /admin/backfill-collections
returns JSON:
{
"running": true,
"total": 1269,
"completed": 621,
"in_progress": 648,
"total_imported": 13628818,
"collections": [
{"collection": "app.bsky.feed.post", "source": "bsky.network", "imported": 28000000, "completed": true},
{"collection": "app.bsky.feed.like", "source": "bsky.network", "imported": 6732000, "completed": false, "cursor": "did:plc:..."}
]
}
using the script#
the relay repo has a convenience script at scripts/backfill-status:
# check progress (summary with recent incomplete)
./scripts/backfill-status status
# trigger a new backfill
./scripts/backfill-status start [source]
# full JSON output
./scripts/backfill-status full
requires ZLAY_ADMIN_PASSWORD and ZLAY_DOMAIN in .env.
performance characteristics#
- collections are processed sequentially (one at a time)
- 100ms pause between pages to avoid hammering the source relay
- one HTTP client is reused across all pages for a given collection
- large collections like
app.bsky.feed.like(~30M+ DIDs) take 1-2 hours each - small/niche collections complete in seconds
- full backfill of ~1269 collections takes several hours, dominated by the 5-6 largest
app.bsky.*collections
re-running#
safe to trigger again after completion — existing progress rows are preserved, completed collections are skipped. useful if new collections appear (e.g. new lexicons published to the network).
source code#
src/backfill.zig— Backfiller struct with all backfill logicsrc/event_log.zig— backfill_progress table creation (ininit())src/main.zig— admin route handlers