···47|----------|---------|-------------|
48| `APP_NAME` | `leaflet-search` | name shown in startup logs |
49| `DASHBOARD_URL` | `https://pub-search.waow.tech/dashboard.html` | redirect target for `/dashboard` |
50-| `TAP_HOST` | `leaflet-search-tap.fly.dev` | TAP websocket host |
51-| `TAP_PORT` | `443` | TAP websocket port |
52| `PORT` | `3000` | HTTP server port |
53| `TURSO_URL` | - | Turso database URL (required) |
54| `TURSO_TOKEN` | - | Turso auth token (required) |
···61- [Fly.io](https://fly.io) hosts backend + tap
62- [Turso](https://turso.tech) cloud SQLite with vector support
63- [Voyage AI](https://voyageai.com) embeddings (voyage-3-lite)
64-- [Tap](https://github.com/bluesky-social/indigo/tree/main/cmd/tap) syncs content from ATProto firehose
65- [Zig](https://ziglang.org) HTTP server, search API, content indexing
66- [Cloudflare Pages](https://pages.cloudflare.com) static frontend
67
···47|----------|---------|-------------|
48| `APP_NAME` | `leaflet-search` | name shown in startup logs |
49| `DASHBOARD_URL` | `https://pub-search.waow.tech/dashboard.html` | redirect target for `/dashboard` |
50+| `TAP_HOST` | `leaflet-search-tap.fly.dev` | tap websocket host |
51+| `TAP_PORT` | `443` | tap websocket port |
52| `PORT` | `3000` | HTTP server port |
53| `TURSO_URL` | - | Turso database URL (required) |
54| `TURSO_TOKEN` | - | Turso auth token (required) |
···61- [Fly.io](https://fly.io) hosts backend + tap
62- [Turso](https://turso.tech) cloud SQLite with vector support
63- [Voyage AI](https://voyageai.com) embeddings (voyage-3-lite)
64+- [tap](https://github.com/bluesky-social/indigo/tree/main/cmd/tap) syncs content from ATProto firehose
65- [Zig](https://ziglang.org) HTTP server, search API, content indexing
66- [Cloudflare Pages](https://pages.cloudflare.com) static frontend
67
+3-3
docs/standard-search-planning.md
···221- keep existing block parser for `pub.leaflet.*`
222- platform detection from `content.$type`
223224-### PR3: TAP subscriber for site.standard.document
225- subscribe to `site.standard.document` + `site.standard.publication`
226- route to appropriate extractor
227- starts ingesting pckt.blog content
···2542. ~~find and examine offprint records~~ (done - no public content yet)
2553. ~~PR1: database schema~~ (merged)
2564. PR2: generalized content extraction
257-5. PR3: TAP subscriber
2586. PR4: API platform filter
2597. consider witness cache architecture (see below)
260···275### current leaflet-search architecture (no witness cache)
276277```
278-Firehose → TAP → Parse & Transform → Store DERIVED data → Discard raw record
279```
280281we store:
···221- keep existing block parser for `pub.leaflet.*`
222- platform detection from `content.$type`
223224+### PR3: tap subscriber for site.standard.document
225- subscribe to `site.standard.document` + `site.standard.publication`
226- route to appropriate extractor
227- starts ingesting pckt.blog content
···2542. ~~find and examine offprint records~~ (done - no public content yet)
2553. ~~PR1: database schema~~ (merged)
2564. PR2: generalized content extraction
257+5. PR3: tap subscriber
2586. PR4: API platform filter
2597. consider witness cache architecture (see below)
260···275### current leaflet-search architecture (no witness cache)
276277```
278+Firehose → tap → Parse & Transform → Store DERIVED data → Discard raw record
279```
280281we store:
+27-27
docs/tap.md
···1# tap (firehose sync)
23-leaflet-search uses [TAP](https://github.com/bluesky-social/indigo/tree/main/cmd/tap) from bluesky-social/indigo to receive real-time events from the ATProto firehose.
45## what is tap?
67tap subscribes to the ATProto firehose, filters for specific collections (e.g., `pub.leaflet.document`), and broadcasts matching events to websocket clients. it also does initial crawling/backfilling of existing records.
89-key behavior: **TAP backfills historical data when repos are added**. when a repo is added to tracking:
10-1. TAP fetches the full repo from the account's PDS using `com.atproto.sync.getRepo`
112. live firehose events during backfill are buffered in memory
123. historical events (marked `live: false`) are delivered first
134. after historical events complete, buffered live events are released
145. subsequent firehose events arrive immediately marked as `live: true`
1516-TAP enforces strict per-repo ordering - live events are synchronization barriers that require all prior events to complete first.
1718## message format
1920-TAP sends JSON messages over websocket. record events look like:
2122```json
23{
···4647## gotchas
4849-1. **action is a string, not an enum** - TAP sends `"action": "create"` as a JSON string. if your parser expects an enum type, extraction will silently fail. use string comparison.
5051-2. **collection filters apply to output** - `TAP_COLLECTION_FILTERS` controls which records TAP sends to clients. records from other collections are fetched but not forwarded.
52533. **signal collection vs collection filters** - `TAP_SIGNAL_COLLECTION` controls auto-discovery of repos (which repos to track), while `TAP_COLLECTION_FILTERS` controls which records from those repos to output. a repo must either be auto-discovered via signal collection OR manually added via `/repos/add`.
54···6566## memory and performance tuning
6768-TAP loads **entire repo CARs into memory** during resync. some bsky users have repos that are 100-300MB+. this causes spiky memory usage that can OOM the machine.
6970### recommended settings for leaflet-search
71···87- **lower firehose/outbox**: we track ~1000 repos, not millions - defaults are overkill
88- **smaller ident cache**: we don't need 2M cached identities
8990-if TAP keeps OOM'ing, check logs for large repo resyncs:
91```bash
92fly logs -a leaflet-search-tap | grep "parsing repo CAR" | grep -E "size\":[0-9]{8,}"
93```
···99just check
100```
101102-shows TAP machine state, most recent indexed date, and 7-day timeline. useful for verifying indexing is working after restarts.
103104example output:
105```
106-=== TAP Status ===
107app 781417db604d48 23 ewr started ...
108109=== Recent Indexing Activity ===
···117...
118```
119120-if "Last indexed" is more than a day behind "Today", TAP may be down or catching up.
121122## checking catch-up progress
123124-when TAP restarts after downtime, it replays the firehose from its saved cursor. to check progress:
125126```bash
127# see current firehose position (look for timestamps in log messages)
128fly logs -a leaflet-search-tap | grep -E '"time".*"seq"' | tail -3
129```
130131-the `"time"` field in log messages shows how far behind TAP is. compare to current time to estimate catch-up.
132133catch-up speed varies:
134- **~0.3x** when resync queue is full (large repos being fetched)
···143144look for:
145- `"connected to firehose"` - successfully connected to bsky relay
146-- `"websocket connected"` - backend connected to TAP
147- `"dialing failed"` / `"i/o timeout"` - network issues
148149### check backend is receiving
···152```
153154look for:
155-- `tap connected!` - connected to TAP
156- `tap: msg_type=record` - receiving messages
157- `indexed document:` - successfully processing
158···160161| symptom | cause | fix |
162|---------|-------|-----|
163-| TAP machine stopped, `oom_killed=true` | large repo CARs exhausted memory | increase memory to 2GB, reduce `TAP_RESYNC_PARALLELISM` to 1 |
164-| `websocket handshake failed: error.Timeout` | TAP not running or network issue | restart TAP, check regions match |
165-| `dialing failed: lookup ... i/o timeout` | DNS issues reaching bsky relay | restart TAP, transient network issue |
166| messages received but not indexed | extraction failing (type mismatch) | enable zat debug logging, check field types |
167-| repo shows `records: 0` after adding | resync failed or collection not in filters | check TAP logs for resync errors, verify `TAP_COLLECTION_FILTERS` |
168-| new platform records not appearing | platform's collection not in `TAP_COLLECTION_FILTERS` | add collection to filters, restart TAP |
169-| indexing stopped, TAP shows "started" | TAP catching up from downtime | check firehose position in logs, wait for catch-up |
170171-## TAP API endpoints
172173-TAP exposes HTTP endpoints for monitoring and control:
174175| endpoint | description |
176|----------|-------------|
···196197## fly.io deployment
198199-both TAP and backend should be in the same region for internal networking:
200201```bash
202# check current regions
203fly status -a leaflet-search-tap
204fly status -a leaflet-search-backend
205206-# restart TAP if needed
207fly machine restart -a leaflet-search-tap <machine-id>
208```
209···211212## references
213214-- [TAP source (bluesky-social/indigo)](https://github.com/bluesky-social/indigo/tree/main/cmd/tap)
215- [ATProto firehose docs](https://atproto.com/specs/sync#firehose)
···1# tap (firehose sync)
23+leaflet-search uses [tap](https://github.com/bluesky-social/indigo/tree/main/cmd/tap) from bluesky-social/indigo to receive real-time events from the ATProto firehose.
45## what is tap?
67tap subscribes to the ATProto firehose, filters for specific collections (e.g., `pub.leaflet.document`), and broadcasts matching events to websocket clients. it also does initial crawling/backfilling of existing records.
89+key behavior: **tap backfills historical data when repos are added**. when a repo is added to tracking:
10+1. tap fetches the full repo from the account's PDS using `com.atproto.sync.getRepo`
112. live firehose events during backfill are buffered in memory
123. historical events (marked `live: false`) are delivered first
134. after historical events complete, buffered live events are released
145. subsequent firehose events arrive immediately marked as `live: true`
1516+tap enforces strict per-repo ordering - live events are synchronization barriers that require all prior events to complete first.
1718## message format
1920+tap sends JSON messages over websocket. record events look like:
2122```json
23{
···4647## gotchas
4849+1. **action is a string, not an enum** - tap sends `"action": "create"` as a JSON string. if your parser expects an enum type, extraction will silently fail. use string comparison.
5051+2. **collection filters apply to output** - `TAP_COLLECTION_FILTERS` controls which records tap sends to clients. records from other collections are fetched but not forwarded.
52533. **signal collection vs collection filters** - `TAP_SIGNAL_COLLECTION` controls auto-discovery of repos (which repos to track), while `TAP_COLLECTION_FILTERS` controls which records from those repos to output. a repo must either be auto-discovered via signal collection OR manually added via `/repos/add`.
54···6566## memory and performance tuning
6768+tap loads **entire repo CARs into memory** during resync. some bsky users have repos that are 100-300MB+. this causes spiky memory usage that can OOM the machine.
6970### recommended settings for leaflet-search
71···87- **lower firehose/outbox**: we track ~1000 repos, not millions - defaults are overkill
88- **smaller ident cache**: we don't need 2M cached identities
8990+if tap keeps OOM'ing, check logs for large repo resyncs:
91```bash
92fly logs -a leaflet-search-tap | grep "parsing repo CAR" | grep -E "size\":[0-9]{8,}"
93```
···99just check
100```
101102+shows tap machine state, most recent indexed date, and 7-day timeline. useful for verifying indexing is working after restarts.
103104example output:
105```
106+=== tap status ===
107app 781417db604d48 23 ewr started ...
108109=== Recent Indexing Activity ===
···117...
118```
119120+if "Last indexed" is more than a day behind "Today", tap may be down or catching up.
121122## checking catch-up progress
123124+when tap restarts after downtime, it replays the firehose from its saved cursor. to check progress:
125126```bash
127# see current firehose position (look for timestamps in log messages)
128fly logs -a leaflet-search-tap | grep -E '"time".*"seq"' | tail -3
129```
130131+the `"time"` field in log messages shows how far behind tap is. compare to current time to estimate catch-up.
132133catch-up speed varies:
134- **~0.3x** when resync queue is full (large repos being fetched)
···143144look for:
145- `"connected to firehose"` - successfully connected to bsky relay
146+- `"websocket connected"` - backend connected to tap
147- `"dialing failed"` / `"i/o timeout"` - network issues
148149### check backend is receiving
···152```
153154look for:
155+- `tap connected!` - connected to tap
156- `tap: msg_type=record` - receiving messages
157- `indexed document:` - successfully processing
158···160161| symptom | cause | fix |
162|---------|-------|-----|
163+| tap machine stopped, `oom_killed=true` | large repo CARs exhausted memory | increase memory to 2GB, reduce `TAP_RESYNC_PARALLELISM` to 1 |
164+| `websocket handshake failed: error.Timeout` | tap not running or network issue | restart tap, check regions match |
165+| `dialing failed: lookup ... i/o timeout` | DNS issues reaching bsky relay | restart tap, transient network issue |
166| messages received but not indexed | extraction failing (type mismatch) | enable zat debug logging, check field types |
167+| repo shows `records: 0` after adding | resync failed or collection not in filters | check tap logs for resync errors, verify `TAP_COLLECTION_FILTERS` |
168+| new platform records not appearing | platform's collection not in `TAP_COLLECTION_FILTERS` | add collection to filters, restart tap |
169+| indexing stopped, tap shows "started" | tap catching up from downtime | check firehose position in logs, wait for catch-up |
170171+## tap API endpoints
172173+tap exposes HTTP endpoints for monitoring and control:
174175| endpoint | description |
176|----------|-------------|
···196197## fly.io deployment
198199+both tap and backend should be in the same region for internal networking:
200201```bash
202# check current regions
203fly status -a leaflet-search-tap
204fly status -a leaflet-search-backend
205206+# restart tap if needed
207fly machine restart -a leaflet-search-tap <machine-id>
208```
209···211212## references
213214+- [tap source (bluesky-social/indigo)](https://github.com/bluesky-social/indigo/tree/main/cmd/tap)
215- [ATProto firehose docs](https://atproto.com/specs/sync#firehose)
+1-1
tap/justfile
···2728# check indexing status - shows most recent indexed documents
29check:
30- @echo "=== TAP Status ==="
31 @fly status --app leaflet-search-tap 2>/dev/null | grep -E "(STATE|started|stopped)"
32 @echo ""
33 @echo "=== Recent Indexing Activity ==="
···2728# check indexing status - shows most recent indexed documents
29check:
30+ @echo "=== tap status ==="
31 @fly status --app leaflet-search-tap 2>/dev/null | grep -E "(STATE|started|stopped)"
32 @echo ""
33 @echo "=== Recent Indexing Activity ==="