atproto relay implementation in zig zlay.waow.tech

collection index backfill#

the collection index (RocksDB) only has entries for accounts that have created records since live indexing was deployed. the backfill imports historical data from a source relay that already has a complete listReposByCollection endpoint.

how it works#

the backfiller runs as a background thread, triggered via admin API. it:

  1. discovers collections from two sources (unioned):

    • lexicon garden llms.txt — ~700 known NSIDs, parsed from markdown links
    • RocksDB scan — collections already observed from the live firehose (RBC column family prefix scan)
  2. inserts progress rows into postgres (backfill_progress table) for each collection. existing rows are skipped (ON CONFLICT DO NOTHING), so re-triggering is safe.

  3. pages through each collection sequentially, calling com.atproto.sync.listReposByCollection on the source relay (default: bsky.network) with limit=1000. each DID in the response is added to the collection index via addCollection. cursor and imported count are persisted after each page for resumability.

  4. marks complete when a page returns no cursor (no more results).

progress tracking#

CREATE TABLE backfill_progress (
    collection TEXT PRIMARY KEY,
    source TEXT NOT NULL,
    cursor TEXT NOT NULL DEFAULT '',
    imported_count BIGINT NOT NULL DEFAULT 0,
    completed_at TIMESTAMPTZ,
    created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
  • cursor = last pagination cursor from source relay
  • imported_count = total DIDs added for this collection
  • completed_at = null while in progress, set when done
  • if the process crashes or restarts, it resumes from the saved cursor

admin API#

requires bearer token auth (RELAY_ADMIN_PASSWORD).

trigger backfill#

POST /admin/backfill-collections?source=bsky.network

returns 200 with collection count if started, 409 if already running. only one backfill can run at a time.

check status#

GET /admin/backfill-collections

returns JSON:

{
  "running": true,
  "total": 1269,
  "completed": 621,
  "in_progress": 648,
  "total_imported": 13628818,
  "collections": [
    {"collection": "app.bsky.feed.post", "source": "bsky.network", "imported": 28000000, "completed": true},
    {"collection": "app.bsky.feed.like", "source": "bsky.network", "imported": 6732000, "completed": false, "cursor": "did:plc:..."}
  ]
}

using the script#

the relay repo has a convenience script at scripts/backfill-status:

# check progress (summary with recent incomplete)
./scripts/backfill-status status

# trigger a new backfill
./scripts/backfill-status start [source]

# full JSON output
./scripts/backfill-status full

requires ZLAY_ADMIN_PASSWORD and ZLAY_DOMAIN in .env.

performance characteristics#

  • collections are processed sequentially (one at a time)
  • 100ms pause between pages to avoid hammering the source relay
  • one HTTP client is reused across all pages for a given collection
  • large collections like app.bsky.feed.like (~30M+ DIDs) take 1-2 hours each
  • small/niche collections complete in seconds
  • full backfill of ~1269 collections takes several hours, dominated by the 5-6 largest app.bsky.* collections

re-running#

safe to trigger again after completion — existing progress rows are preserved, completed collections are skipped. useful if new collections appear (e.g. new lexicons published to the network).

source code#

  • src/backfill.zig — Backfiller struct with all backfill logic
  • src/event_log.zig — backfill_progress table creation (in init())
  • src/main.zig — admin route handlers