AT Protocol Backfill Service (Python)#
This is a Python implementation of the TypeScript backfill service, providing historical data backfilling capabilities for the AT Protocol.
Features#
-
Configurable Backfill Duration
- Specific number of days (e.g.,
BACKFILL_DAYS=7) - Total history backfill with
BACKFILL_DAYS=-1 - Disabled with
BACKFILL_DAYS=0(default)
- Specific number of days (e.g.,
-
Resume Support
- Saves progress to database periodically
- Can resume from last saved cursor after interruption
-
Resource Management
- Configurable batch processing
- Memory monitoring and throttling
- Concurrent processing limits
- Background/idle processing support
-
Integration
- Runs automatically with unified worker when enabled
- Only runs on primary worker (WORKER_ID=0)
- Can also run standalone
Configuration#
All configuration is done through environment variables:
Core Settings#
BACKFILL_DAYS: Number of days to backfill (0=disabled, -1=total history, >0=specific days)RELAY_URL: AT Protocol relay URL (default: wss://bsky.network/xrpc/com.atproto.sync.subscribeRepos)DATABASE_URL: PostgreSQL connection stringWORKER_ID: Worker ID (backfill only runs on worker 0)
Resource Throttling#
BACKFILL_BATCH_SIZE: Events per batch (default: 5)BACKFILL_BATCH_DELAY_MS: Delay between batches in milliseconds (default: 2000)BACKFILL_MAX_CONCURRENT: Maximum concurrent event processing (default: 2)BACKFILL_MAX_MEMORY_MB: Memory limit in MB (default: 512)BACKFILL_USE_IDLE: Use idle processing (default: true)
Usage#
Quick Start with Docker Compose (Recommended)#
The backfill service is now automatically integrated into the docker-compose setup. To enable backfill:
-
Set
BACKFILL_DAYSin your environment:export BACKFILL_DAYS=7 # Backfill last 7 days # OR for all history: export BACKFILL_DAYS=-1 -
Start or restart your services:
docker-compose up -d
The python-backfill-worker service will automatically:
- Start when
BACKFILL_DAYSis set to a non-zero value - Begin processing historical data in the background
- Continue running until all historical data is processed
- Save progress periodically for resume capability
Example: Backfill last 30 days with moderate speed
export BACKFILL_DAYS=30
export BACKFILL_BATCH_SIZE=20
export BACKFILL_BATCH_DELAY_MS=1000
export BACKFILL_MAX_MEMORY_MB=1024
docker-compose up -d
To check backfill progress:
# View backfill worker logs
docker-compose logs -f python-backfill-worker
# Check progress in database
docker-compose exec db psql -U postgres -d atproto -c \
"SELECT * FROM firehose_cursor WHERE service = 'backfill';"
Manual Execution#
With Unified Worker#
The backfill service automatically starts when:
BACKFILL_DAYSis set to a non-zero value- The worker is the primary worker (
WORKER_ID=0or not set)
# Backfill last 7 days
BACKFILL_DAYS=7 python unified_worker.py
# Backfill entire available history
BACKFILL_DAYS=-1 python unified_worker.py
# With custom resource limits
BACKFILL_DAYS=30 \
BACKFILL_BATCH_SIZE=10 \
BACKFILL_MAX_MEMORY_MB=1024 \
python unified_worker.py
Standalone Mode#
You can also run the backfill service independently:
# Run standalone backfill
BACKFILL_DAYS=7 python backfill_service.py
Architecture#
The backfill service mirrors the TypeScript implementation with these key components:
- BackfillService: Main service class that manages the backfill process
- EventProcessor: Reuses the same event processor as the main worker
- Progress Tracking: Saves cursor position to
firehose_cursortable - Memory Management: Monitors RSS memory and throttles processing
- Batching: Processes events in configurable batches with delays
Differences from TypeScript Version#
While maintaining feature parity, there are some implementation differences:
- Memory Monitoring: Uses
psutilinstead of Node.jsprocess.memoryUsage() - Async Handling: Uses Python's
asynciothroughout - Cursor Management: Manual cursor tracking (Python atproto library limitation)
- No Signature Verification: Currently always disabled for performance
Progress Tracking#
Progress is saved to the firehose_cursor table with service name "backfill":
SELECT * FROM firehose_cursor WHERE service = 'backfill';
Performance Considerations#
The default settings are very conservative to ensure backfill runs as a true background task:
- Small batch size (5 events)
- Long delays between batches (2 seconds)
- Low concurrency (2 concurrent operations)
- Memory limit (512MB)
For faster backfilling on dedicated resources, you can increase these limits:
# Aggressive backfill settings
BACKFILL_BATCH_SIZE=100 \
BACKFILL_BATCH_DELAY_MS=100 \
BACKFILL_MAX_CONCURRENT=10 \
BACKFILL_MAX_MEMORY_MB=2048 \
BACKFILL_DAYS=30 \
python unified_worker.py
Monitoring#
The backfill service logs detailed progress information:
- Events received, processed, and skipped
- Processing rate (events/second)
- Memory usage
- Cursor position
Example log output:
[BACKFILL] Progress: 10000 received, 9500 processed, 500 skipped (250 evt/s)
[BACKFILL] Memory: 245MB / 512MB limit
Error Handling#
- Duplicate Records: Silently skipped (common during backfill)
- DID Resolution Timeouts: Logged but processing continues
- Memory Limits: Processing pauses until memory is freed
- Fatal Errors: Service stops and saves progress for resume
Database Schema#
The service uses the existing firehose_cursor table:
CREATE TABLE firehose_cursor (
id SERIAL PRIMARY KEY,
service VARCHAR(255) NOT NULL UNIQUE,
cursor TEXT,
last_event_time TIMESTAMP,
updated_at TIMESTAMP DEFAULT NOW() NOT NULL
);