A third party ATProto appview
at main 200 lines 5.9 kB view raw view rendered
1# AT Protocol Backfill Service (Python) 2 3This is a Python implementation of the TypeScript backfill service, providing historical data backfilling capabilities for the AT Protocol. 4 5## Features 6 7- **Configurable Backfill Duration** 8 - Specific number of days (e.g., `BACKFILL_DAYS=7`) 9 - Total history backfill with `BACKFILL_DAYS=-1` 10 - Disabled with `BACKFILL_DAYS=0` (default) 11 12- **Resume Support** 13 - Saves progress to database periodically 14 - Can resume from last saved cursor after interruption 15 16- **Resource Management** 17 - Configurable batch processing 18 - Memory monitoring and throttling 19 - Concurrent processing limits 20 - Background/idle processing support 21 22- **Integration** 23 - Runs automatically with unified worker when enabled 24 - Only runs on primary worker (WORKER_ID=0) 25 - Can also run standalone 26 27## Configuration 28 29All configuration is done through environment variables: 30 31### Core Settings 32- `BACKFILL_DAYS`: Number of days to backfill (0=disabled, -1=total history, >0=specific days) 33- `RELAY_URL`: AT Protocol relay URL (default: wss://bsky.network/xrpc/com.atproto.sync.subscribeRepos) 34- `DATABASE_URL`: PostgreSQL connection string 35- `WORKER_ID`: Worker ID (backfill only runs on worker 0) 36 37### Resource Throttling 38- `BACKFILL_BATCH_SIZE`: Events per batch (default: 5) 39- `BACKFILL_BATCH_DELAY_MS`: Delay between batches in milliseconds (default: 2000) 40- `BACKFILL_MAX_CONCURRENT`: Maximum concurrent event processing (default: 2) 41- `BACKFILL_MAX_MEMORY_MB`: Memory limit in MB (default: 512) 42- `BACKFILL_USE_IDLE`: Use idle processing (default: true) 43 44## Usage 45 46### Quick Start with Docker Compose (Recommended) 47 48The backfill service is now **automatically integrated** into the docker-compose setup. To enable backfill: 49 501. Set `BACKFILL_DAYS` in your environment: 51 ```bash 52 export BACKFILL_DAYS=7 # Backfill last 7 days 53 # OR for all history: 54 export BACKFILL_DAYS=-1 55 ``` 56 572. Start or restart your services: 58 ```bash 59 docker-compose up -d 60 ``` 61 62The `python-backfill-worker` service will automatically: 63- Start when `BACKFILL_DAYS` is set to a non-zero value 64- Begin processing historical data in the background 65- Continue running until all historical data is processed 66- Save progress periodically for resume capability 67 68**Example: Backfill last 30 days with moderate speed** 69```bash 70export BACKFILL_DAYS=30 71export BACKFILL_BATCH_SIZE=20 72export BACKFILL_BATCH_DELAY_MS=1000 73export BACKFILL_MAX_MEMORY_MB=1024 74docker-compose up -d 75``` 76 77To check backfill progress: 78```bash 79# View backfill worker logs 80docker-compose logs -f python-backfill-worker 81 82# Check progress in database 83docker-compose exec db psql -U postgres -d atproto -c \ 84 "SELECT * FROM firehose_cursor WHERE service = 'backfill';" 85``` 86 87### Manual Execution 88 89#### With Unified Worker 90 91The backfill service automatically starts when: 921. `BACKFILL_DAYS` is set to a non-zero value 932. The worker is the primary worker (`WORKER_ID=0` or not set) 94 95```bash 96# Backfill last 7 days 97BACKFILL_DAYS=7 python unified_worker.py 98 99# Backfill entire available history 100BACKFILL_DAYS=-1 python unified_worker.py 101 102# With custom resource limits 103BACKFILL_DAYS=30 \ 104BACKFILL_BATCH_SIZE=10 \ 105BACKFILL_MAX_MEMORY_MB=1024 \ 106python unified_worker.py 107``` 108 109#### Standalone Mode 110 111You can also run the backfill service independently: 112 113```bash 114# Run standalone backfill 115BACKFILL_DAYS=7 python backfill_service.py 116``` 117 118## Architecture 119 120The backfill service mirrors the TypeScript implementation with these key components: 121 1221. **BackfillService**: Main service class that manages the backfill process 1232. **EventProcessor**: Reuses the same event processor as the main worker 1243. **Progress Tracking**: Saves cursor position to `firehose_cursor` table 1254. **Memory Management**: Monitors RSS memory and throttles processing 1265. **Batching**: Processes events in configurable batches with delays 127 128## Differences from TypeScript Version 129 130While maintaining feature parity, there are some implementation differences: 131 1321. **Memory Monitoring**: Uses `psutil` instead of Node.js `process.memoryUsage()` 1332. **Async Handling**: Uses Python's `asyncio` throughout 1343. **Cursor Management**: Manual cursor tracking (Python atproto library limitation) 1354. **No Signature Verification**: Currently always disabled for performance 136 137## Progress Tracking 138 139Progress is saved to the `firehose_cursor` table with service name "backfill": 140 141```sql 142SELECT * FROM firehose_cursor WHERE service = 'backfill'; 143``` 144 145## Performance Considerations 146 147The default settings are very conservative to ensure backfill runs as a true background task: 148 149- Small batch size (5 events) 150- Long delays between batches (2 seconds) 151- Low concurrency (2 concurrent operations) 152- Memory limit (512MB) 153 154For faster backfilling on dedicated resources, you can increase these limits: 155 156```bash 157# Aggressive backfill settings 158BACKFILL_BATCH_SIZE=100 \ 159BACKFILL_BATCH_DELAY_MS=100 \ 160BACKFILL_MAX_CONCURRENT=10 \ 161BACKFILL_MAX_MEMORY_MB=2048 \ 162BACKFILL_DAYS=30 \ 163python unified_worker.py 164``` 165 166## Monitoring 167 168The backfill service logs detailed progress information: 169 170- Events received, processed, and skipped 171- Processing rate (events/second) 172- Memory usage 173- Cursor position 174 175Example log output: 176``` 177[BACKFILL] Progress: 10000 received, 9500 processed, 500 skipped (250 evt/s) 178[BACKFILL] Memory: 245MB / 512MB limit 179``` 180 181## Error Handling 182 183- **Duplicate Records**: Silently skipped (common during backfill) 184- **DID Resolution Timeouts**: Logged but processing continues 185- **Memory Limits**: Processing pauses until memory is freed 186- **Fatal Errors**: Service stops and saves progress for resume 187 188## Database Schema 189 190The service uses the existing `firehose_cursor` table: 191 192```sql 193CREATE TABLE firehose_cursor ( 194 id SERIAL PRIMARY KEY, 195 service VARCHAR(255) NOT NULL UNIQUE, 196 cursor TEXT, 197 last_event_time TIMESTAMP, 198 updated_at TIMESTAMP DEFAULT NOW() NOT NULL 199); 200```