···4747|----------|---------|-------------|
4848| `APP_NAME` | `leaflet-search` | name shown in startup logs |
4949| `DASHBOARD_URL` | `https://pub-search.waow.tech/dashboard.html` | redirect target for `/dashboard` |
5050-| `TAP_HOST` | `leaflet-search-tap.fly.dev` | TAP websocket host |
5151-| `TAP_PORT` | `443` | TAP websocket port |
5050+| `TAP_HOST` | `leaflet-search-tap.fly.dev` | tap websocket host |
5151+| `TAP_PORT` | `443` | tap websocket port |
5252| `PORT` | `3000` | HTTP server port |
5353| `TURSO_URL` | - | Turso database URL (required) |
5454| `TURSO_TOKEN` | - | Turso auth token (required) |
···6161- [Fly.io](https://fly.io) hosts backend + tap
6262- [Turso](https://turso.tech) cloud SQLite with vector support
6363- [Voyage AI](https://voyageai.com) embeddings (voyage-3-lite)
6464-- [Tap](https://github.com/bluesky-social/indigo/tree/main/cmd/tap) syncs content from ATProto firehose
6464+- [tap](https://github.com/bluesky-social/indigo/tree/main/cmd/tap) syncs content from ATProto firehose
6565- [Zig](https://ziglang.org) HTTP server, search API, content indexing
6666- [Cloudflare Pages](https://pages.cloudflare.com) static frontend
6767
+3-3
docs/standard-search-planning.md
···221221- keep existing block parser for `pub.leaflet.*`
222222- platform detection from `content.$type`
223223224224-### PR3: TAP subscriber for site.standard.document
224224+### PR3: tap subscriber for site.standard.document
225225- subscribe to `site.standard.document` + `site.standard.publication`
226226- route to appropriate extractor
227227- starts ingesting pckt.blog content
···2542542. ~~find and examine offprint records~~ (done - no public content yet)
2552553. ~~PR1: database schema~~ (merged)
2562564. PR2: generalized content extraction
257257-5. PR3: TAP subscriber
257257+5. PR3: tap subscriber
2582586. PR4: API platform filter
2592597. consider witness cache architecture (see below)
260260···275275### current leaflet-search architecture (no witness cache)
276276277277```
278278-Firehose → TAP → Parse & Transform → Store DERIVED data → Discard raw record
278278+Firehose → tap → Parse & Transform → Store DERIVED data → Discard raw record
279279```
280280281281we store:
+27-27
docs/tap.md
···11# tap (firehose sync)
2233-leaflet-search uses [TAP](https://github.com/bluesky-social/indigo/tree/main/cmd/tap) from bluesky-social/indigo to receive real-time events from the ATProto firehose.
33+leaflet-search uses [tap](https://github.com/bluesky-social/indigo/tree/main/cmd/tap) from bluesky-social/indigo to receive real-time events from the ATProto firehose.
4455## what is tap?
6677tap subscribes to the ATProto firehose, filters for specific collections (e.g., `pub.leaflet.document`), and broadcasts matching events to websocket clients. it also does initial crawling/backfilling of existing records.
8899-key behavior: **TAP backfills historical data when repos are added**. when a repo is added to tracking:
1010-1. TAP fetches the full repo from the account's PDS using `com.atproto.sync.getRepo`
99+key behavior: **tap backfills historical data when repos are added**. when a repo is added to tracking:
1010+1. tap fetches the full repo from the account's PDS using `com.atproto.sync.getRepo`
11112. live firehose events during backfill are buffered in memory
12123. historical events (marked `live: false`) are delivered first
13134. after historical events complete, buffered live events are released
14145. subsequent firehose events arrive immediately marked as `live: true`
15151616-TAP enforces strict per-repo ordering - live events are synchronization barriers that require all prior events to complete first.
1616+tap enforces strict per-repo ordering - live events are synchronization barriers that require all prior events to complete first.
17171818## message format
19192020-TAP sends JSON messages over websocket. record events look like:
2020+tap sends JSON messages over websocket. record events look like:
21212222```json
2323{
···46464747## gotchas
48484949-1. **action is a string, not an enum** - TAP sends `"action": "create"` as a JSON string. if your parser expects an enum type, extraction will silently fail. use string comparison.
4949+1. **action is a string, not an enum** - tap sends `"action": "create"` as a JSON string. if your parser expects an enum type, extraction will silently fail. use string comparison.
50505151-2. **collection filters apply to output** - `TAP_COLLECTION_FILTERS` controls which records TAP sends to clients. records from other collections are fetched but not forwarded.
5151+2. **collection filters apply to output** - `TAP_COLLECTION_FILTERS` controls which records tap sends to clients. records from other collections are fetched but not forwarded.
525253533. **signal collection vs collection filters** - `TAP_SIGNAL_COLLECTION` controls auto-discovery of repos (which repos to track), while `TAP_COLLECTION_FILTERS` controls which records from those repos to output. a repo must either be auto-discovered via signal collection OR manually added via `/repos/add`.
5454···65656666## memory and performance tuning
67676868-TAP loads **entire repo CARs into memory** during resync. some bsky users have repos that are 100-300MB+. this causes spiky memory usage that can OOM the machine.
6868+tap loads **entire repo CARs into memory** during resync. some bsky users have repos that are 100-300MB+. this causes spiky memory usage that can OOM the machine.
69697070### recommended settings for leaflet-search
7171···8787- **lower firehose/outbox**: we track ~1000 repos, not millions - defaults are overkill
8888- **smaller ident cache**: we don't need 2M cached identities
89899090-if TAP keeps OOM'ing, check logs for large repo resyncs:
9090+if tap keeps OOM'ing, check logs for large repo resyncs:
9191```bash
9292fly logs -a leaflet-search-tap | grep "parsing repo CAR" | grep -E "size\":[0-9]{8,}"
9393```
···9999just check
100100```
101101102102-shows TAP machine state, most recent indexed date, and 7-day timeline. useful for verifying indexing is working after restarts.
102102+shows tap machine state, most recent indexed date, and 7-day timeline. useful for verifying indexing is working after restarts.
103103104104example output:
105105```
106106-=== TAP Status ===
106106+=== tap status ===
107107app 781417db604d48 23 ewr started ...
108108109109=== Recent Indexing Activity ===
···117117...
118118```
119119120120-if "Last indexed" is more than a day behind "Today", TAP may be down or catching up.
120120+if "Last indexed" is more than a day behind "Today", tap may be down or catching up.
121121122122## checking catch-up progress
123123124124-when TAP restarts after downtime, it replays the firehose from its saved cursor. to check progress:
124124+when tap restarts after downtime, it replays the firehose from its saved cursor. to check progress:
125125126126```bash
127127# see current firehose position (look for timestamps in log messages)
128128fly logs -a leaflet-search-tap | grep -E '"time".*"seq"' | tail -3
129129```
130130131131-the `"time"` field in log messages shows how far behind TAP is. compare to current time to estimate catch-up.
131131+the `"time"` field in log messages shows how far behind tap is. compare to current time to estimate catch-up.
132132133133catch-up speed varies:
134134- **~0.3x** when resync queue is full (large repos being fetched)
···143143144144look for:
145145- `"connected to firehose"` - successfully connected to bsky relay
146146-- `"websocket connected"` - backend connected to TAP
146146+- `"websocket connected"` - backend connected to tap
147147- `"dialing failed"` / `"i/o timeout"` - network issues
148148149149### check backend is receiving
···152152```
153153154154look for:
155155-- `tap connected!` - connected to TAP
155155+- `tap connected!` - connected to tap
156156- `tap: msg_type=record` - receiving messages
157157- `indexed document:` - successfully processing
158158···160160161161| symptom | cause | fix |
162162|---------|-------|-----|
163163-| TAP machine stopped, `oom_killed=true` | large repo CARs exhausted memory | increase memory to 2GB, reduce `TAP_RESYNC_PARALLELISM` to 1 |
164164-| `websocket handshake failed: error.Timeout` | TAP not running or network issue | restart TAP, check regions match |
165165-| `dialing failed: lookup ... i/o timeout` | DNS issues reaching bsky relay | restart TAP, transient network issue |
163163+| tap machine stopped, `oom_killed=true` | large repo CARs exhausted memory | increase memory to 2GB, reduce `TAP_RESYNC_PARALLELISM` to 1 |
164164+| `websocket handshake failed: error.Timeout` | tap not running or network issue | restart tap, check regions match |
165165+| `dialing failed: lookup ... i/o timeout` | DNS issues reaching bsky relay | restart tap, transient network issue |
166166| messages received but not indexed | extraction failing (type mismatch) | enable zat debug logging, check field types |
167167-| repo shows `records: 0` after adding | resync failed or collection not in filters | check TAP logs for resync errors, verify `TAP_COLLECTION_FILTERS` |
168168-| new platform records not appearing | platform's collection not in `TAP_COLLECTION_FILTERS` | add collection to filters, restart TAP |
169169-| indexing stopped, TAP shows "started" | TAP catching up from downtime | check firehose position in logs, wait for catch-up |
167167+| repo shows `records: 0` after adding | resync failed or collection not in filters | check tap logs for resync errors, verify `TAP_COLLECTION_FILTERS` |
168168+| new platform records not appearing | platform's collection not in `TAP_COLLECTION_FILTERS` | add collection to filters, restart tap |
169169+| indexing stopped, tap shows "started" | tap catching up from downtime | check firehose position in logs, wait for catch-up |
170170171171-## TAP API endpoints
171171+## tap API endpoints
172172173173-TAP exposes HTTP endpoints for monitoring and control:
173173+tap exposes HTTP endpoints for monitoring and control:
174174175175| endpoint | description |
176176|----------|-------------|
···196196197197## fly.io deployment
198198199199-both TAP and backend should be in the same region for internal networking:
199199+both tap and backend should be in the same region for internal networking:
200200201201```bash
202202# check current regions
203203fly status -a leaflet-search-tap
204204fly status -a leaflet-search-backend
205205206206-# restart TAP if needed
206206+# restart tap if needed
207207fly machine restart -a leaflet-search-tap <machine-id>
208208```
209209···211211212212## references
213213214214-- [TAP source (bluesky-social/indigo)](https://github.com/bluesky-social/indigo/tree/main/cmd/tap)
214214+- [tap source (bluesky-social/indigo)](https://github.com/bluesky-social/indigo/tree/main/cmd/tap)
215215- [ATProto firehose docs](https://atproto.com/specs/sync#firehose)
+1-1
tap/justfile
···27272828# check indexing status - shows most recent indexed documents
2929check:
3030- @echo "=== TAP Status ==="
3030+ @echo "=== tap status ==="
3131 @fly status --app leaflet-search-tap 2>/dev/null | grep -E "(STATE|started|stopped)"
3232 @echo ""
3333 @echo "=== Recent Indexing Activity ==="