search for standard sites pub-search.waow.tech
search zig blog atproto

docs: lowercase tap references (not an acronym)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

+34 -34
+3 -3
README.md
··· 47 47 |----------|---------|-------------| 48 48 | `APP_NAME` | `leaflet-search` | name shown in startup logs | 49 49 | `DASHBOARD_URL` | `https://pub-search.waow.tech/dashboard.html` | redirect target for `/dashboard` | 50 - | `TAP_HOST` | `leaflet-search-tap.fly.dev` | TAP websocket host | 51 - | `TAP_PORT` | `443` | TAP websocket port | 50 + | `TAP_HOST` | `leaflet-search-tap.fly.dev` | tap websocket host | 51 + | `TAP_PORT` | `443` | tap websocket port | 52 52 | `PORT` | `3000` | HTTP server port | 53 53 | `TURSO_URL` | - | Turso database URL (required) | 54 54 | `TURSO_TOKEN` | - | Turso auth token (required) | ··· 61 61 - [Fly.io](https://fly.io) hosts backend + tap 62 62 - [Turso](https://turso.tech) cloud SQLite with vector support 63 63 - [Voyage AI](https://voyageai.com) embeddings (voyage-3-lite) 64 - - [Tap](https://github.com/bluesky-social/indigo/tree/main/cmd/tap) syncs content from ATProto firehose 64 + - [tap](https://github.com/bluesky-social/indigo/tree/main/cmd/tap) syncs content from ATProto firehose 65 65 - [Zig](https://ziglang.org) HTTP server, search API, content indexing 66 66 - [Cloudflare Pages](https://pages.cloudflare.com) static frontend 67 67
+3 -3
docs/standard-search-planning.md
··· 221 221 - keep existing block parser for `pub.leaflet.*` 222 222 - platform detection from `content.$type` 223 223 224 - ### PR3: TAP subscriber for site.standard.document 224 + ### PR3: tap subscriber for site.standard.document 225 225 - subscribe to `site.standard.document` + `site.standard.publication` 226 226 - route to appropriate extractor 227 227 - starts ingesting pckt.blog content ··· 254 254 2. ~~find and examine offprint records~~ (done - no public content yet) 255 255 3. ~~PR1: database schema~~ (merged) 256 256 4. PR2: generalized content extraction 257 - 5. PR3: TAP subscriber 257 + 5. PR3: tap subscriber 258 258 6. PR4: API platform filter 259 259 7. consider witness cache architecture (see below) 260 260 ··· 275 275 ### current leaflet-search architecture (no witness cache) 276 276 277 277 ``` 278 - Firehose → TAP → Parse & Transform → Store DERIVED data → Discard raw record 278 + Firehose → tap → Parse & Transform → Store DERIVED data → Discard raw record 279 279 ``` 280 280 281 281 we store:
+27 -27
docs/tap.md
··· 1 1 # tap (firehose sync) 2 2 3 - leaflet-search uses [TAP](https://github.com/bluesky-social/indigo/tree/main/cmd/tap) from bluesky-social/indigo to receive real-time events from the ATProto firehose. 3 + leaflet-search uses [tap](https://github.com/bluesky-social/indigo/tree/main/cmd/tap) from bluesky-social/indigo to receive real-time events from the ATProto firehose. 4 4 5 5 ## what is tap? 6 6 7 7 tap subscribes to the ATProto firehose, filters for specific collections (e.g., `pub.leaflet.document`), and broadcasts matching events to websocket clients. it also does initial crawling/backfilling of existing records. 8 8 9 - key behavior: **TAP backfills historical data when repos are added**. when a repo is added to tracking: 10 - 1. TAP fetches the full repo from the account's PDS using `com.atproto.sync.getRepo` 9 + key behavior: **tap backfills historical data when repos are added**. when a repo is added to tracking: 10 + 1. tap fetches the full repo from the account's PDS using `com.atproto.sync.getRepo` 11 11 2. live firehose events during backfill are buffered in memory 12 12 3. historical events (marked `live: false`) are delivered first 13 13 4. after historical events complete, buffered live events are released 14 14 5. subsequent firehose events arrive immediately marked as `live: true` 15 15 16 - TAP enforces strict per-repo ordering - live events are synchronization barriers that require all prior events to complete first. 16 + tap enforces strict per-repo ordering - live events are synchronization barriers that require all prior events to complete first. 17 17 18 18 ## message format 19 19 20 - TAP sends JSON messages over websocket. record events look like: 20 + tap sends JSON messages over websocket. record events look like: 21 21 22 22 ```json 23 23 { ··· 46 46 47 47 ## gotchas 48 48 49 - 1. **action is a string, not an enum** - TAP sends `"action": "create"` as a JSON string. if your parser expects an enum type, extraction will silently fail. use string comparison. 49 + 1. **action is a string, not an enum** - tap sends `"action": "create"` as a JSON string. if your parser expects an enum type, extraction will silently fail. use string comparison. 50 50 51 - 2. **collection filters apply to output** - `TAP_COLLECTION_FILTERS` controls which records TAP sends to clients. records from other collections are fetched but not forwarded. 51 + 2. **collection filters apply to output** - `TAP_COLLECTION_FILTERS` controls which records tap sends to clients. records from other collections are fetched but not forwarded. 52 52 53 53 3. **signal collection vs collection filters** - `TAP_SIGNAL_COLLECTION` controls auto-discovery of repos (which repos to track), while `TAP_COLLECTION_FILTERS` controls which records from those repos to output. a repo must either be auto-discovered via signal collection OR manually added via `/repos/add`. 54 54 ··· 65 65 66 66 ## memory and performance tuning 67 67 68 - TAP loads **entire repo CARs into memory** during resync. some bsky users have repos that are 100-300MB+. this causes spiky memory usage that can OOM the machine. 68 + tap loads **entire repo CARs into memory** during resync. some bsky users have repos that are 100-300MB+. this causes spiky memory usage that can OOM the machine. 69 69 70 70 ### recommended settings for leaflet-search 71 71 ··· 87 87 - **lower firehose/outbox**: we track ~1000 repos, not millions - defaults are overkill 88 88 - **smaller ident cache**: we don't need 2M cached identities 89 89 90 - if TAP keeps OOM'ing, check logs for large repo resyncs: 90 + if tap keeps OOM'ing, check logs for large repo resyncs: 91 91 ```bash 92 92 fly logs -a leaflet-search-tap | grep "parsing repo CAR" | grep -E "size\":[0-9]{8,}" 93 93 ``` ··· 99 99 just check 100 100 ``` 101 101 102 - shows TAP machine state, most recent indexed date, and 7-day timeline. useful for verifying indexing is working after restarts. 102 + shows tap machine state, most recent indexed date, and 7-day timeline. useful for verifying indexing is working after restarts. 103 103 104 104 example output: 105 105 ``` 106 - === TAP Status === 106 + === tap status === 107 107 app 781417db604d48 23 ewr started ... 108 108 109 109 === Recent Indexing Activity === ··· 117 117 ... 118 118 ``` 119 119 120 - if "Last indexed" is more than a day behind "Today", TAP may be down or catching up. 120 + if "Last indexed" is more than a day behind "Today", tap may be down or catching up. 121 121 122 122 ## checking catch-up progress 123 123 124 - when TAP restarts after downtime, it replays the firehose from its saved cursor. to check progress: 124 + when tap restarts after downtime, it replays the firehose from its saved cursor. to check progress: 125 125 126 126 ```bash 127 127 # see current firehose position (look for timestamps in log messages) 128 128 fly logs -a leaflet-search-tap | grep -E '"time".*"seq"' | tail -3 129 129 ``` 130 130 131 - the `"time"` field in log messages shows how far behind TAP is. compare to current time to estimate catch-up. 131 + the `"time"` field in log messages shows how far behind tap is. compare to current time to estimate catch-up. 132 132 133 133 catch-up speed varies: 134 134 - **~0.3x** when resync queue is full (large repos being fetched) ··· 143 143 144 144 look for: 145 145 - `"connected to firehose"` - successfully connected to bsky relay 146 - - `"websocket connected"` - backend connected to TAP 146 + - `"websocket connected"` - backend connected to tap 147 147 - `"dialing failed"` / `"i/o timeout"` - network issues 148 148 149 149 ### check backend is receiving ··· 152 152 ``` 153 153 154 154 look for: 155 - - `tap connected!` - connected to TAP 155 + - `tap connected!` - connected to tap 156 156 - `tap: msg_type=record` - receiving messages 157 157 - `indexed document:` - successfully processing 158 158 ··· 160 160 161 161 | symptom | cause | fix | 162 162 |---------|-------|-----| 163 - | TAP machine stopped, `oom_killed=true` | large repo CARs exhausted memory | increase memory to 2GB, reduce `TAP_RESYNC_PARALLELISM` to 1 | 164 - | `websocket handshake failed: error.Timeout` | TAP not running or network issue | restart TAP, check regions match | 165 - | `dialing failed: lookup ... i/o timeout` | DNS issues reaching bsky relay | restart TAP, transient network issue | 163 + | tap machine stopped, `oom_killed=true` | large repo CARs exhausted memory | increase memory to 2GB, reduce `TAP_RESYNC_PARALLELISM` to 1 | 164 + | `websocket handshake failed: error.Timeout` | tap not running or network issue | restart tap, check regions match | 165 + | `dialing failed: lookup ... i/o timeout` | DNS issues reaching bsky relay | restart tap, transient network issue | 166 166 | messages received but not indexed | extraction failing (type mismatch) | enable zat debug logging, check field types | 167 - | repo shows `records: 0` after adding | resync failed or collection not in filters | check TAP logs for resync errors, verify `TAP_COLLECTION_FILTERS` | 168 - | new platform records not appearing | platform's collection not in `TAP_COLLECTION_FILTERS` | add collection to filters, restart TAP | 169 - | indexing stopped, TAP shows "started" | TAP catching up from downtime | check firehose position in logs, wait for catch-up | 167 + | repo shows `records: 0` after adding | resync failed or collection not in filters | check tap logs for resync errors, verify `TAP_COLLECTION_FILTERS` | 168 + | new platform records not appearing | platform's collection not in `TAP_COLLECTION_FILTERS` | add collection to filters, restart tap | 169 + | indexing stopped, tap shows "started" | tap catching up from downtime | check firehose position in logs, wait for catch-up | 170 170 171 - ## TAP API endpoints 171 + ## tap API endpoints 172 172 173 - TAP exposes HTTP endpoints for monitoring and control: 173 + tap exposes HTTP endpoints for monitoring and control: 174 174 175 175 | endpoint | description | 176 176 |----------|-------------| ··· 196 196 197 197 ## fly.io deployment 198 198 199 - both TAP and backend should be in the same region for internal networking: 199 + both tap and backend should be in the same region for internal networking: 200 200 201 201 ```bash 202 202 # check current regions 203 203 fly status -a leaflet-search-tap 204 204 fly status -a leaflet-search-backend 205 205 206 - # restart TAP if needed 206 + # restart tap if needed 207 207 fly machine restart -a leaflet-search-tap <machine-id> 208 208 ``` 209 209 ··· 211 211 212 212 ## references 213 213 214 - - [TAP source (bluesky-social/indigo)](https://github.com/bluesky-social/indigo/tree/main/cmd/tap) 214 + - [tap source (bluesky-social/indigo)](https://github.com/bluesky-social/indigo/tree/main/cmd/tap) 215 215 - [ATProto firehose docs](https://atproto.com/specs/sync#firehose)
+1 -1
tap/justfile
··· 27 27 28 28 # check indexing status - shows most recent indexed documents 29 29 check: 30 - @echo "=== TAP Status ===" 30 + @echo "=== tap status ===" 31 31 @fly status --app leaflet-search-tap 2>/dev/null | grep -E "(STATE|started|stopped)" 32 32 @echo "" 33 33 @echo "=== Recent Indexing Activity ==="