Running an ATProto Relay for ATCR Hold Discovery#
This document explains what it takes to run an ATProto relay for indexing ATCR hold records, including infrastructure requirements, configuration, and trade-offs.
Overview#
What is an ATProto Relay?#
An ATProto relay is a service that:
- Subscribes to multiple PDS hosts and aggregates their data streams
- Outputs a combined "firehose" event stream for real-time network updates
- Validates data integrity and identity signatures
- Provides discovery endpoints like
com.atproto.sync.listReposByCollection
The relay acts as a network-wide indexer, making it possible to discover which DIDs have records of specific types (collections).
Why ATCR Needs a Relay#
ATCR uses hold captain records (io.atcr.hold.captain) stored in hold PDSs to enable hold discovery. The listReposByCollection endpoint allows AppViews to efficiently discover all holds in the network without crawling every PDS individually.
The problem: Standard Bluesky relays appear to only index collections from did:plc DIDs, not did:web DIDs. Since ATCR holds use did:web (e.g., did:web:hold01.atcr.io), they aren't discoverable via Bluesky's public relays.
Recommended Approach: Phased Implementation#
ATCR's discovery needs evolve as the network grows. Start simple, scale as needed.
MVP: Minimal Discovery Service#
For initial deployment with a small number of holds (dozens, not thousands), build a lightweight custom discovery service focused solely on io.atcr.* collections.
Why Minimal Service for MVP?#
- Scope: Only index
io.atcr.*collections (manifests, tags, captain/crew, sailor profiles) - Opt-in: Only crawls PDSs that explicitly call
requestCrawl - Small scale: Dozens of holds, not millions of users
- Simple storage: SQLite sufficient for current scale
- Cost-effective: $5-10/month VPS
Architecture#
Inbound endpoints:
POST /xrpc/com.atproto.sync.requestCrawl
→ Hold registers itself for crawling
GET /xrpc/com.atproto.sync.listReposByCollection?collection=io.atcr.hold.captain
→ AppView discovers holds
Outbound (client to PDS):
1. com.atproto.repo.describeRepo → verify PDS exists
2. com.atproto.sync.getRepo → fetch full CAR file (initial backfill)
3. com.atproto.sync.subscribeRepos → WebSocket for real-time updates
4. Parse events → extract io.atcr.* records → index in SQLite
Data flow:
Initial crawl (on requestCrawl):
1. Hold POSTs requestCrawl → service queues crawl job
2. Service fetches getRepo (CAR file) from hold's PDS for backfill
3. Service parses CAR using indigo libraries
4. Service extracts io.atcr.* records (captain, crew, manifests, etc.)
5. Service stores: (did, collection, rkey, record_data) in SQLite
6. Service opens WebSocket to subscribeRepos for this DID
7. Service stores cursor for reconnection handling
Ongoing updates (WebSocket):
1. Receive commit events via subscribeRepos WebSocket
2. Parse event, filter to io.atcr.* collections only
3. Update indexed_records incrementally (insert/update/delete)
4. Update cursor after processing each event
5. On disconnect: reconnect with stored cursor to resume
Discovery (AppView query):
1. AppView GETs listReposByCollection?collection=io.atcr.hold.captain
2. Service queries SQLite WHERE collection='io.atcr.hold.captain'
3. Service returns list of DIDs with that collection
Implementation Requirements#
Technologies:
- Go (reuse indigo libraries for CAR parsing and WebSocket)
- SQLite (sufficient for dozens/hundreds of holds)
- Standard HTTP server + WebSocket client
Core components:
-
HTTP handlers (
cmd/atcr-discovery/handlers/):requestCrawl- queue crawl jobslistReposByCollection- query indexed collections
-
Crawler (
pkg/discovery/crawler.go):- Fetch CAR files from PDSs for initial backfill
- Parse with
github.com/bluesky-social/indigo/repo - Extract records, filter to
io.atcr.*only
-
WebSocket subscriber (
pkg/discovery/subscriber.go):- WebSocket client for
com.atproto.sync.subscribeRepos - Event parsing and filtering
- Cursor management and persistence
- Automatic reconnection with resume
- WebSocket client for
-
Storage (
pkg/discovery/storage.go):- SQLite schema for indexed records
- Indexes on (collection, did) for fast queries
- Cursor storage for reconnection
-
Worker (
pkg/discovery/worker.go):- Background crawl job processor
- WebSocket connection manager
- Health monitoring for subscriptions
Database schema:
CREATE TABLE indexed_records (
did TEXT NOT NULL,
collection TEXT NOT NULL,
rkey TEXT NOT NULL,
record_data TEXT NOT NULL, -- JSON
indexed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (did, collection, rkey)
);
CREATE INDEX idx_collection ON indexed_records(collection);
CREATE INDEX idx_did ON indexed_records(did);
CREATE TABLE crawl_queue (
id INTEGER PRIMARY KEY AUTOINCREMENT,
hostname TEXT NOT NULL UNIQUE,
did TEXT,
status TEXT DEFAULT 'pending', -- pending, in_progress, subscribed, failed
last_crawled_at TIMESTAMP,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE subscriptions (
did TEXT PRIMARY KEY,
hostname TEXT NOT NULL,
cursor INTEGER, -- Last processed sequence number
status TEXT DEFAULT 'active', -- active, disconnected, failed
last_event_at TIMESTAMP,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
Leveraging indigo libraries:
import (
"github.com/bluesky-social/indigo/repo"
"github.com/bluesky-social/indigo/atproto/syntax"
"github.com/bluesky-social/indigo/events"
"github.com/gorilla/websocket"
"github.com/ipfs/go-cid"
)
// Initial backfill: Parse CAR file
r, err := repo.ReadRepoFromCar(ctx, bytes.NewReader(carData))
if err != nil {
return err
}
// Iterate records
err = r.ForEach(ctx, "", func(path string, nodeCid cid.Cid) error {
// Parse collection from path (e.g., "io.atcr.hold.captain/self")
parts := strings.Split(path, "/")
if len(parts) != 2 {
return nil // skip invalid paths
}
collection := parts[0]
rkey := parts[1]
// Filter to io.atcr.* only
if !strings.HasPrefix(collection, "io.atcr.") {
return nil
}
// Get record data
recordBytes, err := r.GetRecord(ctx, path)
if err != nil {
return err
}
// Store in database
return store.IndexRecord(did, collection, rkey, recordBytes)
})
// WebSocket subscription: Listen for updates
wsURL := fmt.Sprintf("wss://%s/xrpc/com.atproto.sync.subscribeRepos", hostname)
conn, _, err := websocket.DefaultDialer.Dial(wsURL, nil)
if err != nil {
return err
}
// Read events
rsc := &events.RepoStreamCallbacks{
RepoCommit: func(evt *events.RepoCommit) error {
// Filter to io.atcr.* collections only
for _, op := range evt.Ops {
if !strings.HasPrefix(op.Collection, "io.atcr.") {
continue
}
// Process create/update/delete operations
switch op.Action {
case "create", "update":
store.IndexRecord(evt.Repo, op.Collection, op.Rkey, op.Record)
case "delete":
store.DeleteRecord(evt.Repo, op.Collection, op.Rkey)
}
}
// Update cursor
return store.UpdateCursor(evt.Repo, evt.Seq)
},
}
// Process stream
scheduler := events.NewScheduler("discovery-worker", conn.RemoteAddr().String(), rsc)
return events.HandleRepoStream(ctx, conn, scheduler)
Infrastructure Requirements#
Minimum specs:
- 1 vCPU
- 1-2GB RAM
- 20GB SSD
- Minimal bandwidth (<1GB/day for dozens of holds)
Estimated cost:
- Hetzner CX11: €4.15/month (~$5/month)
- DigitalOcean Basic: $6/month
- Fly.io: ~$5-10/month
Deployment:
# Build
go build -o atcr-discovery ./cmd/atcr-discovery
# Run
export DATABASE_PATH="/var/lib/atcr-discovery/discovery.db"
export HTTP_ADDR=":8080"
./atcr-discovery
Limitations#
What it does NOT do:
- ❌ Serve outbound
subscribeReposfirehose (AppViews query via listReposByCollection) - ❌ Full MST validation (trust PDS validation)
- ❌ Scale to millions of accounts (SQLite limits)
- ❌ Multi-instance deployment (single process with SQLite)
When to migrate to full relay: When you have 1000+ holds, need PostgreSQL, or multi-instance deployment.
Future Scale: Full Relay (Sync v1.1)#
When ATCR grows beyond dozens of holds and needs real-time indexing, migrate to Bluesky's relay v1.1 implementation.
When to Upgrade#
Indicators:
- 100+ holds requesting frequent crawls
- Need real-time updates (re-crawl latency too high)
- Multiple AppView instances need coordinated discovery
- SQLite performance becomes bottleneck
Relay v1.1 Characteristics#
Released May 2025, this is Bluesky's current reference implementation.
Key features:
- Non-archival: Doesn't mirror full repository data, only processes firehose
- WebSocket subscriptions: Real-time updates from PDSs
- Scalable: 2 vCPU, 12GB RAM handles ~100M accounts
- PostgreSQL: Required for production scale
- Admin UI: Web dashboard for management
Source: github.com/bluesky-social/indigo/cmd/relay
Migration Path#
Step 1: Deploy relay v1.1
git clone https://github.com/bluesky-social/indigo.git
cd indigo
go build -o relay ./cmd/relay
export DATABASE_URL="postgres://relay:password@localhost:5432/atcr_relay"
./relay --admin-password="secure-password"
Step 2: Migrate data
- Export indexed records from SQLite
- Trigger crawls in relay for all known holds
- Verify relay indexes correctly
Step 3: Update AppView configuration
# Point to new relay
export ATCR_RELAY_ENDPOINT="https://relay.atcr.io"
Step 4: Decommission minimal service
- Monitor relay for stability
- Shut down old discovery service
Infrastructure Requirements (Full Relay)#
Minimum specs:
- 2 vCPU cores
- 12GB RAM
- 100GB SSD
- 30 Mbps bandwidth
Estimated cost:
- Hetzner: ~$30-40/month
- DigitalOcean: ~$50/month (with managed PostgreSQL)
- Fly.io: ~$35-50/month
Collection Indexing: The collectiondir Microservice#
The com.atproto.sync.listReposByCollection endpoint is not part of the relay core. It's provided by a separate microservice called collectiondir.
What is collectiondir?#
- Separate service that indexes collections for efficient discovery
- Optional: Not required by the ATProto spec, but very useful for AppViews
- Deployed alongside relay by Bluesky's public instances
Current Limitation: did:plc Only?#
Based on testing, Bluesky's public relays (with collectiondir) appear to:
- ✅ Index
io.atcr.*collections fromdid:plcDIDs - ❌ NOT index
io.atcr.*collections fromdid:webDIDs
This means:
- ATCR manifests from users (did:plc) are discoverable
- ATCR hold captain records (did:web) are NOT discoverable
- The relay still stores all data (CAR file includes did:web records)
- The issue is specifically with indexing for
listReposByCollection
Configuring collectiondir#
Documentation on configuring collectiondir is sparse. Possible approaches:
- Fork and modify: Clone indigo repo, modify collectiondir to index all DIDs
- Configuration file: Check if collectiondir accepts whitelist/configuration for indexed collections
- No filtering: Default behavior might be to index everything, but Bluesky's deployment filters
Action item: Review indigo/cmd/collectiondir source code to understand configuration options.
Multi-Relay Strategy#
Holds can request crawls from multiple relays simultaneously. This enables:
Scenario: Bluesky + ATCR Relays#
Setup:
- Hold deploys with embedded PDS at
did:web:hold01.atcr.io - Hold creates captain record (
io.atcr.hold.captain/self) - Hold requests crawl from both:
- Bluesky relay:
https://bsky.network/xrpc/com.atproto.sync.requestCrawl - ATCR relay:
https://relay.atcr.io/xrpc/com.atproto.sync.requestCrawl
- Bluesky relay:
Result:
- ✅ Bluesky relay indexes social posts (if hold owner posts)
- ✅ ATCR relay indexes hold captain records
- ✅ AppViews query ATCR relay for hold discovery
- ✅ Independent networks - Bluesky posts work regardless of ATCR relay
Request Crawl Script#
The existing script can be modified to support multiple relays:
#!/bin/bash
# deploy/request-crawl.sh
HOSTNAME=$1
BLUESKY_RELAY=${2:-"https://bsky.network"}
ATCR_RELAY=${3:-"https://relay.atcr.io"}
echo "Requesting crawl for $HOSTNAME from Bluesky relay..."
curl -X POST "$BLUESKY_RELAY/xrpc/com.atproto.sync.requestCrawl" \
-H "Content-Type: application/json" \
-d "{\"hostname\": \"$HOSTNAME\"}"
echo "Requesting crawl for $HOSTNAME from ATCR relay..."
curl -X POST "$ATCR_RELAY/xrpc/com.atproto.sync.requestCrawl" \
-H "Content-Type: application/json" \
-d "{\"hostname\": \"$HOSTNAME\"}"
Usage:
./deploy/request-crawl.sh hold01.atcr.io
Deployment: Minimal Discovery Service#
1. Infrastructure Setup#
Provision VPS:
- Hetzner CX11, DigitalOcean Basic, or Fly.io
- Public domain (e.g.,
discovery.atcr.io) - TLS certificate (Let's Encrypt)
Configure reverse proxy (optional - nginx):
upstream discovery {
server 127.0.0.1:8080;
}
server {
listen 443 ssl http2;
server_name discovery.atcr.io;
ssl_certificate /etc/letsencrypt/live/discovery.atcr.io/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/discovery.atcr.io/privkey.pem;
location / {
proxy_pass http://discovery;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}
2. Build and Deploy#
# Clone ATCR repo
git clone https://github.com/atcr-io/atcr.git
cd atcr
# Build discovery service
go build -o atcr-discovery ./cmd/atcr-discovery
# Run
export DATABASE_PATH="/var/lib/atcr-discovery/discovery.db"
export HTTP_ADDR=":8080"
export CRAWL_INTERVAL="12h"
./atcr-discovery
3. Update Hold Startup#
Each hold should request crawl on startup:
# In hold startup script or environment
export ATCR_DISCOVERY_URL="https://discovery.atcr.io"
# Request crawl from both Bluesky and ATCR
curl -X POST "https://bsky.network/xrpc/com.atproto.sync.requestCrawl" \
-H "Content-Type: application/json" \
-d "{\"hostname\": \"$HOLD_PUBLIC_URL\"}"
curl -X POST "$ATCR_DISCOVERY_URL/xrpc/com.atproto.sync.requestCrawl" \
-H "Content-Type: application/json" \
-d "{\"hostname\": \"$HOLD_PUBLIC_URL\"}"
4. Update AppView Configuration#
Point AppView discovery worker to the discovery service:
# In .env.appview or environment
export ATCR_RELAY_ENDPOINT="https://discovery.atcr.io"
export ATCR_HOLD_DISCOVERY_ENABLED="true"
export ATCR_HOLD_DISCOVERY_INTERVAL="6h"
5. Monitor and Maintain#
Monitoring:
- Check crawl queue status
- Monitor SQLite database size
- Track failed crawls
Maintenance:
- Re-crawl on schedule (every 6-24 hours)
- Prune stale records (>7 days old)
- Backup SQLite database regularly
Trade-Offs and Considerations#
Running Your Own Relay#
Pros:
- ✅ Full control over indexing (can index
did:webholds) - ✅ No dependency on third-party relay policies
- ✅ Can customize collection filters for ATCR-specific needs
- ✅ Relatively lightweight with modern relay implementation
Cons:
- ❌ Infrastructure cost (~$30-50/month minimum)
- ❌ Operational overhead (monitoring, updates, backups)
- ❌ Need to maintain as network grows
- ❌ Single point of failure for discovery (unless multi-relay)
Alternatives to Running a Relay#
1. Direct Registration API#
Holds POST to AppView on startup to register themselves:
Pros:
- ✅ Simplest implementation
- ✅ No relay infrastructure needed
- ✅ Immediate registration (no crawl delay)
Cons:
- ❌ Ties holds to specific AppView instances
- ❌ Breaks decentralized discovery model
- ❌ Each AppView has different hold registry
2. Static Discovery File#
Maintain https://atcr.io/.well-known/holds.json:
Pros:
- ✅ No infrastructure beyond static hosting
- ✅ All AppViews share same registry
- ✅ Simple to implement
Cons:
- ❌ Manual process (PRs/issues to add holds)
- ❌ Not real-time discovery
- ❌ Centralized control point
3. Hybrid Approach#
Combine multiple discovery mechanisms:
func (w *HoldDiscoveryWorker) DiscoverHolds(ctx context.Context) error {
// 1. Fetch static registry
staticHolds := w.fetchStaticRegistry()
// 2. Query relay (if available)
relayHolds := w.queryRelay(ctx)
// 3. Accept direct registrations
registeredHolds := w.getDirectRegistrations()
// Merge and deduplicate
allHolds := mergeHolds(staticHolds, relayHolds, registeredHolds)
// Cache in database
for _, hold := range allHolds {
w.cacheHold(hold)
}
}
Pros:
- ✅ Multiple discovery paths (resilient)
- ✅ Gradual migration to relay-based discovery
- ✅ Supports both centralized bootstrap and decentralized growth
Cons:
- ❌ More complex implementation
- ❌ Potential for stale data if sources conflict
Recommendations for ATCR#
Phase 1: MVP (Now - 1000 holds)#
Build minimal discovery service with WebSocket (~$5-10/month):
- Implement
requestCrawl+listReposByCollectionendpoints - Initial backfill via
getRepo(CAR file parsing) - Real-time updates via WebSocket
subscribeRepos - SQLite storage with cursor management
- Filter to
io.atcr.*collections only
Deliverables:
cmd/atcr-discoveryservice- SQLite schema with cursor storage
- CAR file parser (indigo libraries)
- WebSocket subscriber with reconnection
- Deployment scripts
Cost: ~$5-10/month VPS
Why: Minimal infrastructure, real-time updates, full control over indexing, sufficient for hundreds of holds.
Phase 2: Migrate to Full Relay (1000+ holds)#
Deploy Bluesky relay v1.1 when scaling needed (~$30-50/month):
- Set up PostgreSQL database
- Deploy indigo relay with admin UI
- Migrate indexed data from SQLite
- Configure for
io.atcr.*collection filtering (if possible) - Handle thousands of concurrent WebSocket connections
Cost: ~$30-50/month
Why: Proven scalability to 100M+ accounts, standardized protocol, community support, production-ready infrastructure.
Phase 3: Multi-Relay Federation (Future)#
Decentralized relay network:
- Multiple ATCR relays operated independently
- AppViews query multiple relays (fallback/redundancy)
- Holds request crawls from all known ATCR relays
- Cross-relay synchronization (optional)
Why: No single point of failure, fully decentralized discovery, geographic distribution.
Next Steps#
For MVP Implementation#
-
Create
cmd/atcr-discoverypackage structure- HTTP handlers for XRPC endpoints (
requestCrawl,listReposByCollection) - Crawler with indigo CAR parsing for initial backfill
- WebSocket subscriber for real-time updates
- SQLite storage layer with cursor management
- Background worker for managing subscriptions
- HTTP handlers for XRPC endpoints (
-
Database schema
indexed_recordstable for collection datacrawl_queuetable for crawl job managementsubscriptionstable for WebSocket cursor tracking- Indexes for efficient queries
-
WebSocket implementation
- Use
github.com/bluesky-social/indigo/eventsfor event handling - Implement reconnection logic with cursor resume
- Filter events to
io.atcr.*collections only - Health monitoring for active subscriptions
- Use
-
Testing strategy
- Unit tests for CAR parsing
- Unit tests for event filtering
- Integration tests with mock PDSs and WebSocket
- Connection failure and reconnection testing
- Load testing with SQLite
-
Deployment
- Dockerfile for discovery service
- Deployment scripts (systemd, docker-compose)
- Monitoring setup (logs, metrics, WebSocket health)
- Alert on subscription failures
-
Documentation
- API documentation for XRPC endpoints
- Deployment guide
- Troubleshooting guide (WebSocket connection issues)
Open Questions#
- CAR parsing edge cases: How to handle malformed CAR files or invalid records?
- WebSocket reconnection: What's the optimal backoff strategy for reconnection attempts?
- Subscription management: How many concurrent WebSocket connections can SQLite handle?
- Rate limiting: Should discovery service rate-limit requestCrawl to prevent abuse?
- Authentication: Should requestCrawl require authentication, or remain open?
- Cursor storage: Should cursors be persisted immediately or batched for performance?
- Monitoring: What metrics are most important for operational visibility (active subs, event rate, lag)?
- Error handling: When a WebSocket dies, should we re-backfill via getRepo or trust cursor resume?