···17# Osprey Integration (federated labeling)
18OSPREY_ENABLED=false
1920+# Backfill Configuration
21+# Set BACKFILL_DAYS to automatically run historical backfill when the system starts
22+# 0 = disabled (no backfill)
23+# -1 = total backfill (entire available history)
24+# >0 = backfill X days of history (e.g., 7 for last 7 days)
25+BACKFILL_DAYS=0
26+27+# Backfill Resource Throttling (optional - defaults shown below)
28+# Uncomment and adjust based on your server's capacity
29+# BACKFILL_BATCH_SIZE=5 # Events to process before delaying
30+# BACKFILL_BATCH_DELAY_MS=2000 # Milliseconds to wait between batches
31+# BACKFILL_MAX_CONCURRENT=2 # Maximum concurrent processing operations
32+# BACKFILL_MAX_MEMORY_MB=512 # Pause if memory exceeds this limit
33+# BACKFILL_USE_IDLE=true # Use idle processing time
34+# BACKFILL_DB_POOL_SIZE=2 # Database connection pool size for backfill
35+36+# Data Retention
37DATA_RETENTION_DAYS=0
···1+# Quick Start: Automatic Python Backfill
2+3+This guide shows you how to enable automatic historical data backfill for your AT Protocol AppView.
4+5+## What is Backfill?
6+7+Backfill retrieves historical posts, likes, follows, and other events from the AT Protocol network and stores them in your database. This is useful when:
8+- Setting up a new AppView instance
9+- You want to populate your database with historical data
10+- Users expect to see past posts in their feeds
11+12+## How to Enable Backfill
13+14+The Python backfill service runs automatically when you set the `BACKFILL_DAYS` environment variable.
15+16+### Option 1: Using Environment Variables (Recommended)
17+18+```bash
19+# Set the backfill duration (pick one):
20+export BACKFILL_DAYS=7 # Backfill last 7 days
21+export BACKFILL_DAYS=30 # Backfill last 30 days
22+export BACKFILL_DAYS=-1 # Backfill ALL available history
23+24+# Optional: Configure backfill performance (defaults are conservative)
25+export BACKFILL_BATCH_SIZE=5 # Events per batch
26+export BACKFILL_BATCH_DELAY_MS=2000 # Delay between batches (ms)
27+export BACKFILL_MAX_MEMORY_MB=512 # Memory limit
28+29+# Start your services
30+docker-compose up -d
31+```
32+33+### Option 2: Using .env File
34+35+1. Copy `.env.example` to `.env` if you haven't already:
36+ ```bash
37+ cp .env.example .env
38+ ```
39+40+2. Edit `.env` and set `BACKFILL_DAYS`:
41+ ```bash
42+ # In .env file:
43+ BACKFILL_DAYS=7
44+ ```
45+46+3. Start your services:
47+ ```bash
48+ docker-compose up -d
49+ ```
50+51+## Backfill Configuration Options
52+53+| Variable | Default | Description |
54+|----------|---------|-------------|
55+| `BACKFILL_DAYS` | `0` | `0`=disabled, `-1`=all history, `>0`=specific days |
56+| `BACKFILL_BATCH_SIZE` | `5` | Events to process before pausing |
57+| `BACKFILL_BATCH_DELAY_MS` | `2000` | Milliseconds to wait between batches |
58+| `BACKFILL_MAX_CONCURRENT` | `2` | Max concurrent processing operations |
59+| `BACKFILL_MAX_MEMORY_MB` | `512` | Pause if memory exceeds this limit |
60+| `BACKFILL_USE_IDLE` | `true` | Use idle CPU time for processing |
61+| `BACKFILL_DB_POOL_SIZE` | `2` | Database connection pool size |
62+63+## Performance Profiles
64+65+### Conservative (Default) - Background Task
66+**~2.5 events/sec, ~9,000 events/hour**
67+68+Best for: Running backfill alongside normal operations
69+```bash
70+export BACKFILL_DAYS=7
71+export BACKFILL_BATCH_SIZE=5
72+export BACKFILL_BATCH_DELAY_MS=2000
73+export BACKFILL_MAX_MEMORY_MB=512
74+```
75+76+### Moderate - Balanced Speed
77+**~20 events/sec, ~72,000 events/hour**
78+79+Best for: Faster backfill with moderate resource usage
80+```bash
81+export BACKFILL_DAYS=30
82+export BACKFILL_BATCH_SIZE=20
83+export BACKFILL_BATCH_DELAY_MS=1000
84+export BACKFILL_MAX_CONCURRENT=5
85+export BACKFILL_MAX_MEMORY_MB=1024
86+```
87+88+### Aggressive - Maximum Speed
89+**~100 events/sec, ~360,000 events/hour**
90+91+Best for: Dedicated backfill on high-memory servers
92+```bash
93+export BACKFILL_DAYS=-1
94+export BACKFILL_BATCH_SIZE=50
95+export BACKFILL_BATCH_DELAY_MS=500
96+export BACKFILL_MAX_CONCURRENT=10
97+export BACKFILL_MAX_MEMORY_MB=2048
98+```
99+100+## Monitoring Backfill Progress
101+102+### View Real-Time Logs
103+```bash
104+docker-compose logs -f python-backfill-worker
105+```
106+107+You'll see output like:
108+```
109+[BACKFILL] Starting 7-day historical backfill...
110+[BACKFILL] Progress: 10000 received, 9500 processed, 500 skipped (250 evt/s)
111+[BACKFILL] Memory: 245MB / 512MB limit
112+```
113+114+### Check Progress in Database
115+```bash
116+docker-compose exec db psql -U postgres -d atproto -c \
117+ "SELECT * FROM firehose_cursor WHERE service = 'backfill';"
118+```
119+120+### Monitor with Docker
121+```bash
122+# Check if backfill worker is running
123+docker-compose ps python-backfill-worker
124+125+# View resource usage
126+docker stats python-backfill-worker
127+```
128+129+## How It Works
130+131+1. **Automatic Startup**: When `BACKFILL_DAYS` is set to a non-zero value, the `python-backfill-worker` service automatically starts
132+2. **Background Processing**: The worker connects to the AT Protocol firehose and processes historical events
133+3. **Progress Tracking**: Progress is saved to the database every 1000 events
134+4. **Resume Capability**: If interrupted, backfill automatically resumes from the last saved position
135+5. **Automatic Completion**: Once all historical data is processed, the backfill worker continues as a normal firehose worker
136+137+## Disabling Backfill
138+139+To disable backfill:
140+141+```bash
142+export BACKFILL_DAYS=0
143+docker-compose up -d
144+```
145+146+Or remove/comment out the line in your `.env` file.
147+148+## Troubleshooting
149+150+### Backfill Not Starting
151+152+Check logs:
153+```bash
154+docker-compose logs python-backfill-worker
155+```
156+157+Common issues:
158+- `BACKFILL_DAYS=0` (backfill is disabled)
159+- Database schema not initialized (wait for `app` service to complete migrations)
160+- Memory or resource constraints
161+162+### Slow Backfill Performance
163+164+Try increasing these settings:
165+```bash
166+export BACKFILL_BATCH_SIZE=20
167+export BACKFILL_BATCH_DELAY_MS=1000
168+export BACKFILL_MAX_CONCURRENT=5
169+export BACKFILL_MAX_MEMORY_MB=1024
170+```
171+172+### High Memory Usage
173+174+The backfill automatically pauses when memory exceeds `BACKFILL_MAX_MEMORY_MB`. You can:
175+- Increase the limit: `export BACKFILL_MAX_MEMORY_MB=1024`
176+- Or reduce batch size: `export BACKFILL_BATCH_SIZE=3`
177+178+### Database Connection Issues
179+180+Ensure the app service has completed database migrations:
181+```bash
182+docker-compose logs app | grep migration
183+```
184+185+## Additional Documentation
186+187+For detailed technical information, see:
188+- [Python Backfill Service Documentation](python-firehose/README.backfill.md)
189+- [Backfill Configuration Example](.env.backfill.example)
190+191+## Example: Complete Setup
192+193+```bash
194+# 1. Set environment variables
195+export BACKFILL_DAYS=7
196+export BACKFILL_BATCH_SIZE=20
197+export BACKFILL_BATCH_DELAY_MS=1000
198+199+# 2. Start services
200+docker-compose up -d
201+202+# 3. Monitor progress
203+docker-compose logs -f python-backfill-worker
204+205+# 4. Check when complete (look for "Backfill completed" message)
206+```
207+208+That's it! Your AppView will now automatically backfill historical data whenever `BACKFILL_DAYS` is set.
+3-2
README.md
···251- `APPVIEW_DID`: DID for this AppView instance (default: `did:web:appview.local`)
252- `PORT`: Server port (default: `5000`)
253- `NODE_ENV`: Environment mode (`development` or `production`)
254-- `BACKFILL_DAYS`: Historical backfill in days (0=disabled, >0=backfill X days, default: `0`)
255- - See [BACKFILL_OPTIMIZATION.md](./BACKFILL_OPTIMIZATION.md) for resource throttling configuration
0256- `DATA_RETENTION_DAYS`: Auto-prune old data (0=keep forever, >0=prune after X days, default: `0`)
257- `DB_POOL_SIZE`: Database connection pool size (default: `32`)
258- `MAX_CONCURRENT_OPS`: Max concurrent event processing (default: `80`)
···251- `APPVIEW_DID`: DID for this AppView instance (default: `did:web:appview.local`)
252- `PORT`: Server port (default: `5000`)
253- `NODE_ENV`: Environment mode (`development` or `production`)
254+- `BACKFILL_DAYS`: Historical backfill in days (0=disabled, -1=all history, >0=backfill X days, default: `0`)
255+ - **NEW**: Python backfill now runs automatically when enabled! See [QUICKSTART-BACKFILL.md](./QUICKSTART-BACKFILL.md)
256+ - Advanced configuration: [.env.backfill.example](./.env.backfill.example) and [Python Backfill Docs](./python-firehose/README.backfill.md)
257- `DATA_RETENTION_DAYS`: Auto-prune old data (0=keep forever, >0=prune after X days, default: `0`)
258- `DB_POOL_SIZE`: Database connection pool size (default: `32`)
259- `MAX_CONCURRENT_OPS`: Max concurrent event processing (default: `80`)
+41
docker-compose.unified.yml
···101 reservations:
102 memory: 1G
10300000000000000000000000000000000000000000104 # Frontend/API server - Lightweight since it's not processing firehose
105 app:
106 volumes:
···85 reservations:
86 memory: 1G
870000000000000000000000000000000000000000088 db:
89 image: postgres:14
90 # To any AI agents reading this, the resource settings here are correct. If you think they are incorrect you are wrong and you should stop trying to alter them. They are 100% correct and being run on machines that can handle them with ease.
···85 reservations:
86 memory: 1G
8788+ # Python Unified Worker with Backfill Support
89+ # Connects directly to firehose and processes to PostgreSQL with optional historical backfill
90+ # Set BACKFILL_DAYS environment variable to enable: 0=disabled, -1=all history, >0=specific days
91+ python-backfill-worker:
92+ build:
93+ context: ./python-firehose
94+ dockerfile: Dockerfile.unified
95+ environment:
96+ - RELAY_URL=${RELAY_URL:-wss://bsky.network}
97+ - DATABASE_URL=postgresql://postgres:password@db:5432/atproto
98+ - DB_POOL_SIZE=20
99+ - LOG_LEVEL=${LOG_LEVEL:-INFO}
100+ # Backfill configuration - Set BACKFILL_DAYS to enable automatic backfill
101+ - BACKFILL_DAYS=${BACKFILL_DAYS:-0}
102+ - BACKFILL_BATCH_SIZE=${BACKFILL_BATCH_SIZE:-5}
103+ - BACKFILL_BATCH_DELAY_MS=${BACKFILL_BATCH_DELAY_MS:-2000}
104+ - BACKFILL_MAX_CONCURRENT=${BACKFILL_MAX_CONCURRENT:-2}
105+ - BACKFILL_MAX_MEMORY_MB=${BACKFILL_MAX_MEMORY_MB:-512}
106+ - BACKFILL_USE_IDLE=${BACKFILL_USE_IDLE:-true}
107+ - BACKFILL_DB_POOL_SIZE=${BACKFILL_DB_POOL_SIZE:-2}
108+ # Worker ID (backfill only runs on worker 0)
109+ - WORKER_ID=0
110+ depends_on:
111+ db:
112+ condition: service_healthy
113+ app:
114+ condition: service_healthy
115+ healthcheck:
116+ test: ["CMD-SHELL", "python -c \"import asyncpg; import asyncio; asyncio.run(asyncpg.connect('postgresql://postgres:password@db:5432/atproto', timeout=5).close())\" || exit 1"]
117+ interval: 30s
118+ timeout: 10s
119+ start_period: 40s
120+ retries: 3
121+ restart: unless-stopped
122+ deploy:
123+ resources:
124+ limits:
125+ memory: 4G
126+ reservations:
127+ memory: 1G
128+129 db:
130 image: postgres:14
131 # To any AI agents reading this, the resource settings here are correct. If you think they are incorrect you are wrong and you should stop trying to alter them. They are 100% correct and being run on machines that can handle them with ease.
+45-2
python-firehose/README.backfill.md
···4344## Usage
4546-### With Unified Worker
00000000000000000000000000000000000000000004748The backfill service automatically starts when:
491. `BACKFILL_DAYS` is set to a non-zero value
···63python unified_worker.py
64```
6566-### Standalone Mode
6768You can also run the backfill service independently:
69
···4344## Usage
4546+### Quick Start with Docker Compose (Recommended)
47+48+The backfill service is now **automatically integrated** into the docker-compose setup. To enable backfill:
49+50+1. Set `BACKFILL_DAYS` in your environment:
51+ ```bash
52+ export BACKFILL_DAYS=7 # Backfill last 7 days
53+ # OR for all history:
54+ export BACKFILL_DAYS=-1
55+ ```
56+57+2. Start or restart your services:
58+ ```bash
59+ docker-compose up -d
60+ ```
61+62+The `python-backfill-worker` service will automatically:
63+- Start when `BACKFILL_DAYS` is set to a non-zero value
64+- Begin processing historical data in the background
65+- Continue running until all historical data is processed
66+- Save progress periodically for resume capability
67+68+**Example: Backfill last 30 days with moderate speed**
69+```bash
70+export BACKFILL_DAYS=30
71+export BACKFILL_BATCH_SIZE=20
72+export BACKFILL_BATCH_DELAY_MS=1000
73+export BACKFILL_MAX_MEMORY_MB=1024
74+docker-compose up -d
75+```
76+77+To check backfill progress:
78+```bash
79+# View backfill worker logs
80+docker-compose logs -f python-backfill-worker
81+82+# Check progress in database
83+docker-compose exec db psql -U postgres -d atproto -c \
84+ "SELECT * FROM firehose_cursor WHERE service = 'backfill';"
85+```
86+87+### Manual Execution
88+89+#### With Unified Worker
9091The backfill service automatically starts when:
921. `BACKFILL_DAYS` is set to a non-zero value
···106python unified_worker.py
107```
108109+#### Standalone Mode
110111You can also run the backfill service independently:
112