A third party ATProto appview
1# AT Protocol Backfill Service (Python)
2
3This is a Python implementation of the TypeScript backfill service, providing historical data backfilling capabilities for the AT Protocol.
4
5## Features
6
7- **Configurable Backfill Duration**
8 - Specific number of days (e.g., `BACKFILL_DAYS=7`)
9 - Total history backfill with `BACKFILL_DAYS=-1`
10 - Disabled with `BACKFILL_DAYS=0` (default)
11
12- **Resume Support**
13 - Saves progress to database periodically
14 - Can resume from last saved cursor after interruption
15
16- **Resource Management**
17 - Configurable batch processing
18 - Memory monitoring and throttling
19 - Concurrent processing limits
20 - Background/idle processing support
21
22- **Integration**
23 - Runs automatically with unified worker when enabled
24 - Only runs on primary worker (WORKER_ID=0)
25 - Can also run standalone
26
27## Configuration
28
29All configuration is done through environment variables:
30
31### Core Settings
32- `BACKFILL_DAYS`: Number of days to backfill (0=disabled, -1=total history, >0=specific days)
33- `RELAY_URL`: AT Protocol relay URL (default: wss://bsky.network/xrpc/com.atproto.sync.subscribeRepos)
34- `DATABASE_URL`: PostgreSQL connection string
35- `WORKER_ID`: Worker ID (backfill only runs on worker 0)
36
37### Resource Throttling
38- `BACKFILL_BATCH_SIZE`: Events per batch (default: 5)
39- `BACKFILL_BATCH_DELAY_MS`: Delay between batches in milliseconds (default: 2000)
40- `BACKFILL_MAX_CONCURRENT`: Maximum concurrent event processing (default: 2)
41- `BACKFILL_MAX_MEMORY_MB`: Memory limit in MB (default: 512)
42- `BACKFILL_USE_IDLE`: Use idle processing (default: true)
43
44## Usage
45
46### Quick Start with Docker Compose (Recommended)
47
48The backfill service is now **automatically integrated** into the docker-compose setup. To enable backfill:
49
501. Set `BACKFILL_DAYS` in your environment:
51 ```bash
52 export BACKFILL_DAYS=7 # Backfill last 7 days
53 # OR for all history:
54 export BACKFILL_DAYS=-1
55 ```
56
572. Start or restart your services:
58 ```bash
59 docker-compose up -d
60 ```
61
62The `python-backfill-worker` service will automatically:
63- Start when `BACKFILL_DAYS` is set to a non-zero value
64- Begin processing historical data in the background
65- Continue running until all historical data is processed
66- Save progress periodically for resume capability
67
68**Example: Backfill last 30 days with moderate speed**
69```bash
70export BACKFILL_DAYS=30
71export BACKFILL_BATCH_SIZE=20
72export BACKFILL_BATCH_DELAY_MS=1000
73export BACKFILL_MAX_MEMORY_MB=1024
74docker-compose up -d
75```
76
77To check backfill progress:
78```bash
79# View backfill worker logs
80docker-compose logs -f python-backfill-worker
81
82# Check progress in database
83docker-compose exec db psql -U postgres -d atproto -c \
84 "SELECT * FROM firehose_cursor WHERE service = 'backfill';"
85```
86
87### Manual Execution
88
89#### With Unified Worker
90
91The backfill service automatically starts when:
921. `BACKFILL_DAYS` is set to a non-zero value
932. The worker is the primary worker (`WORKER_ID=0` or not set)
94
95```bash
96# Backfill last 7 days
97BACKFILL_DAYS=7 python unified_worker.py
98
99# Backfill entire available history
100BACKFILL_DAYS=-1 python unified_worker.py
101
102# With custom resource limits
103BACKFILL_DAYS=30 \
104BACKFILL_BATCH_SIZE=10 \
105BACKFILL_MAX_MEMORY_MB=1024 \
106python unified_worker.py
107```
108
109#### Standalone Mode
110
111You can also run the backfill service independently:
112
113```bash
114# Run standalone backfill
115BACKFILL_DAYS=7 python backfill_service.py
116```
117
118## Architecture
119
120The backfill service mirrors the TypeScript implementation with these key components:
121
1221. **BackfillService**: Main service class that manages the backfill process
1232. **EventProcessor**: Reuses the same event processor as the main worker
1243. **Progress Tracking**: Saves cursor position to `firehose_cursor` table
1254. **Memory Management**: Monitors RSS memory and throttles processing
1265. **Batching**: Processes events in configurable batches with delays
127
128## Differences from TypeScript Version
129
130While maintaining feature parity, there are some implementation differences:
131
1321. **Memory Monitoring**: Uses `psutil` instead of Node.js `process.memoryUsage()`
1332. **Async Handling**: Uses Python's `asyncio` throughout
1343. **Cursor Management**: Manual cursor tracking (Python atproto library limitation)
1354. **No Signature Verification**: Currently always disabled for performance
136
137## Progress Tracking
138
139Progress is saved to the `firehose_cursor` table with service name "backfill":
140
141```sql
142SELECT * FROM firehose_cursor WHERE service = 'backfill';
143```
144
145## Performance Considerations
146
147The default settings are very conservative to ensure backfill runs as a true background task:
148
149- Small batch size (5 events)
150- Long delays between batches (2 seconds)
151- Low concurrency (2 concurrent operations)
152- Memory limit (512MB)
153
154For faster backfilling on dedicated resources, you can increase these limits:
155
156```bash
157# Aggressive backfill settings
158BACKFILL_BATCH_SIZE=100 \
159BACKFILL_BATCH_DELAY_MS=100 \
160BACKFILL_MAX_CONCURRENT=10 \
161BACKFILL_MAX_MEMORY_MB=2048 \
162BACKFILL_DAYS=30 \
163python unified_worker.py
164```
165
166## Monitoring
167
168The backfill service logs detailed progress information:
169
170- Events received, processed, and skipped
171- Processing rate (events/second)
172- Memory usage
173- Cursor position
174
175Example log output:
176```
177[BACKFILL] Progress: 10000 received, 9500 processed, 500 skipped (250 evt/s)
178[BACKFILL] Memory: 245MB / 512MB limit
179```
180
181## Error Handling
182
183- **Duplicate Records**: Silently skipped (common during backfill)
184- **DID Resolution Timeouts**: Logged but processing continues
185- **Memory Limits**: Processing pauses until memory is freed
186- **Fatal Errors**: Service stops and saves progress for resume
187
188## Database Schema
189
190The service uses the existing `firehose_cursor` table:
191
192```sql
193CREATE TABLE firehose_cursor (
194 id SERIAL PRIMARY KEY,
195 service VARCHAR(255) NOT NULL UNIQUE,
196 cursor TEXT,
197 last_event_time TIMESTAMP,
198 updated_at TIMESTAMP DEFAULT NOW() NOT NULL
199);
200```