search for standard sites pub-search.waow.tech
search zig blog atproto

docs: update search-architecture, api, tap, and performance-saga

- fix FTS5 description: local replica, not Turso directly
- fix doc count (~11k not ~25k), query count (14 not 10)
- fix /similar: uses turbopuffer, not Turso
- remove phantom /platforms endpoint from api.md
- fix v2 format (hasMore not offset), timing keys (search_keyword etc)
- add coverImage/handle fields, clarify hybrid-only score/source
- fix tap examples: wget not curl (container has no curl)
- add missing tap env vars (TAP_RELAY_URL, cursor/timeout settings)
- fix voyage model in perf saga (voyage-3-lite 512 dims)
- add embedder thread to architecture diagram

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

+60 -55
+40 -41
docs/api.md
··· 19 19 | `tag` | string | no | filter by tag (documents only) | 20 20 | `platform` | string | no | filter by platform: `leaflet`, `pckt`, `offprint`, `greengale`, `whitewind`, `other` | 21 21 | `since` | string | no | ISO date, filter to documents created after | 22 - | `mode` | string | no | `keyword` (default), `semantic`, or `hybrid`. semantic uses voyage-4-lite embeddings + turbopuffer ANN. hybrid merges keyword + semantic via reciprocal rank fusion. semantic/hybrid ignore `tag` and `since` filters. | 23 - | `format` | string | no | `v2` wraps response in `{"results": [...], "total": N, "offset": N}` | 22 + | `mode` | string | no | `keyword` (default), `semantic`, or `hybrid`. semantic uses voyage-4-lite embeddings + turbopuffer ANN. hybrid merges keyword + semantic via reciprocal rank fusion. | 23 + | `format` | string | no | `v2` wraps response in `{"results": [...], "total": N, "hasMore": bool}` | 24 24 | `limit` | int | no | max results to return (default 20) | 25 25 | `offset` | int | no | pagination offset | 26 26 27 27 *at least one of `q` or `tag` required 28 + 29 + **filter behavior by mode:** 30 + - **keyword**: respects all filters (`tag`, `platform`, `since`) 31 + - **semantic**: respects `platform` only. ignores `tag` and `since`. 32 + - **hybrid**: keyword half respects all filters, semantic half respects `platform` only. results merged via RRF. 28 33 29 34 **response:** 30 35 ```json ··· 40 45 "basePath": "gyst.leaflet.pub", 41 46 "platform": "leaflet", 42 47 "path": "/001", 43 - "source": "keyword", 44 - "score": 0.85 48 + "coverImage": "", 49 + "handle": "@user.bsky.social" 45 50 } 46 51 ] 47 52 ``` ··· 51 56 { 52 57 "results": [ /* same as above */ ], 53 58 "total": 89, 54 - "offset": 0 59 + "hasMore": false 60 + } 61 + ``` 62 + 63 + hybrid mode adds `source` and `score` fields: 64 + ```json 65 + { 66 + "source": "keyword+semantic", 67 + "score": 0.85 55 68 } 56 69 ``` 57 70 ··· 109 122 ] 110 123 ``` 111 124 112 - ### platforms 113 - 114 - ``` 115 - GET /platforms 116 - ``` 117 - 118 - document counts by platform. 119 - 120 - **response:** 121 - ```json 122 - [ 123 - {"platform": "leaflet", "count": 2500}, 124 - {"platform": "pckt", "count": 800}, 125 - {"platform": "greengale", "count": 150}, 126 - {"platform": "offprint", "count": 50}, 127 - {"platform": "other", "count": 100} 128 - ] 129 - ``` 130 - 131 125 ### stats 132 126 133 127 ``` ··· 139 133 **response:** 140 134 ```json 141 135 { 142 - "documents": 3500, 143 - "publications": 120, 144 - "embeddings": 3200, 136 + "documents": 11445, 137 + "publications": 2603, 138 + "embeddings": 10900, 145 139 "searches": 5000, 146 140 "errors": 5, 147 141 "cache_hits": 1200, 148 142 "cache_misses": 800, 149 143 "timing": { 150 - "search": {"count": 1000, "avg_ms": 25, "p50_ms": 20, "p95_ms": 50, "p99_ms": 80, "max_ms": 150}, 151 - "similar": {"count": 200, "avg_ms": 150, "p50_ms": 140, "p95_ms": 200, "p99_ms": 250, "max_ms": 300}, 152 - "tags": {"count": 500, "avg_ms": 5, "p50_ms": 4, "p95_ms": 10, "p99_ms": 15, "max_ms": 25}, 153 - "popular": {"count": 300, "avg_ms": 3, "p50_ms": 2, "p95_ms": 5, "p99_ms": 8, "max_ms": 12} 144 + "search_keyword": {"count": 1000, "avg_ms": 25, "p50_ms": 20, "p95_ms": 50, "p99_ms": 80, "max_ms": 150}, 145 + "search_semantic": {"count": 100, "avg_ms": 350, "p50_ms": 340, ...}, 146 + "search_hybrid": {"count": 50, "avg_ms": 380, ...}, 147 + "similar": {"count": 200, "avg_ms": 150, ...}, 148 + "tags": {"count": 500, "avg_ms": 5, ...}, 149 + "popular": {"count": 300, "avg_ms": 3, ...} 154 150 } 155 151 } 156 152 ``` ··· 174 170 GET /api/dashboard 175 171 ``` 176 172 177 - rich dashboard data for analytics UI. 173 + rich dashboard data for analytics UI. includes platform counts (no separate `/platforms` endpoint). 178 174 179 175 **response:** 180 176 ```json 181 177 { 182 178 "startedAt": 1705000000, 183 179 "searches": 5000, 184 - "publications": 120, 185 - "documents": 3500, 186 - "platforms": [{"platform": "leaflet", "count": 2500}], 180 + "publications": 2603, 181 + "documents": 11445, 182 + "platforms": [{"platform": "leaflet", "count": 5399}], 187 183 "tags": [{"tag": "programming", "count": 42}], 188 184 "timeline": [{"date": "2025-01-15", "count": 25}], 189 185 "topPubs": [{"name": "gyst", "basePath": "gyst.leaflet.pub", "count": 150}], ··· 204 200 205 201 ## building URLs 206 202 207 - documents can be accessed on the web via their `basePath` and `rkey`: 208 - - articles: `https://{basePath}/{rkey}` or `https://{basePath}{path}` if path is set 209 - - publications: `https://{basePath}` 203 + documents can be accessed on the web via their `basePath` and platform-specific patterns: 210 204 211 - examples: 212 - - `https://gyst.leaflet.pub/3ldasifz7bs2l` 213 - - `https://greengale.app/3fz.org/001` 205 + | platform | URL pattern | example | 206 + |----------|-------------|---------| 207 + | leaflet | `https://{basePath}/{rkey}` | `https://gyst.leaflet.pub/3ldasifz7bs2l` | 208 + | pckt | `https://{basePath}{path}` | `https://devlog.pckt.blog/some-slug` | 209 + | offprint | `https://{basePath}{path}` | `https://dalisay.offprint.app/a/3me5ucj7vxf23-title-slug` | 210 + | greengale | `https://{basePath}{path}` | `https://3fz.greengale.app/001` | 211 + | whitewind | `https://whtwnd.com/{did}/{rkey}` | `https://whtwnd.com/did:plc:.../3abc123` | 212 + | publications | `https://{basePath}` | `https://gyst.leaflet.pub` |
+3 -2
docs/performance-saga.md
··· 2 2 3 3 ## what happened 4 4 5 - attempted to add a vector similarity search feature using voyage-3 embeddings (1024 dims) stored in turso with a DiskANN index. the embedding model change had a different shape than what was stored, turso performance degraded badly, and multiple attempts to back out the changes failed to restore performance. 5 + attempted to add a vector similarity search feature using voyage-3-lite embeddings (512 dims) stored in turso with a DiskANN index. the embedding model change had a different shape than what was stored, turso performance degraded badly, and multiple attempts to back out the changes failed to restore performance. 6 6 7 7 ## the problems we found and fixed 8 8 ··· 91 91 ├── HTTP thread pool (16 workers) 92 92 ├── local SQLite (read_conn for search, conn+mutex for writes) 93 93 ├── turso client (fallback for unsupported queries) 94 - ├── sync thread (turso → local, periodic) 94 + ├── sync thread (turso → local, full on startup + periodic incremental) 95 95 ├── tap consumer (firehose → turso) 96 + ├── embedder (voyage-4-lite → turbopuffer, background) 96 97 ├── stats buffer (periodic flush to turso) 97 98 └── activity tracker 98 99 ```
+6 -6
docs/search-architecture.md
··· 4 4 5 5 ## current: SQLite FTS5 6 6 7 - we use SQLite's built-in full-text search (FTS5) via Turso. 7 + keyword search uses SQLite's FTS5 on a local read replica, synced from Turso (the source of truth). 8 8 9 9 ### why FTS5 works for now 10 10 11 - - **scale**: ~25k documents. FTS5 handles this trivially. 11 + - **scale**: ~11k documents. FTS5 handles this trivially. 12 12 - **latency**: keyword p50 ~9ms (local SQLite replica), semantic p50 ~345ms (voyage + turbopuffer), hybrid p50 ~360ms. 13 13 - **cost**: $0. included with Turso free tier. 14 14 - **ops**: zero. no separate service to run. ··· 21 21 22 22 buildFtsQuery(): "crypto OR casino*" 23 23 24 - FTS5 MATCH query with BM25 + recency decay 24 + FTS5 MATCH query with BM25 + recency decay (on local SQLite replica) 25 25 26 26 results with snippet() 27 27 ``` ··· 38 38 39 39 | component | FTS5-specific | 40 40 |-----------|---------------| 41 - | 10 query definitions | `MATCH`, `snippet()`, `ORDER BY rank` | 41 + | 14 query definitions | `MATCH`, `snippet()`, `ORDER BY rank` | 42 42 | `buildFtsQuery()` | constructs FTS5 syntax | 43 43 | schema | `documents_fts`, `publications_fts` virtual tables | 44 44 ··· 99 99 2. add Elasticsearch as search index 100 100 3. sync documents to ES on write (async) 101 101 4. point `/search` at Elasticsearch 102 - 5. keep `/similar` on Turso (vector search) 102 + 5. keep `/similar` on turbopuffer (vector search) 103 103 104 104 the `search()` function would change from SQL queries to ES client calls. result types stay the same. HTTP layer unchanged. 105 105 ··· 107 107 108 108 ### vector search scaling 109 109 110 - similarity search currently uses voyage-4-lite embeddings (1024 dims) with turbopuffer ANN index. this handles ~25k docs well. at larger scale: 110 + similarity search currently uses voyage-4-lite embeddings (1024 dims) with turbopuffer ANN index. this handles ~11k docs well. at larger scale: 111 111 112 112 - **Elasticsearch**: has vector search (dense_vector + kNN) 113 113 - **dedicated vector DB**: Qdrant, Pinecone, Weaviate
+11 -6
docs/tap.md
··· 74 74 memory = '2gb' # 1gb is not enough 75 75 76 76 [env] 77 - TAP_RESYNC_PARALLELISM = '1' # only one repo CAR in memory at a time (default: 5) 78 - TAP_FIREHOSE_PARALLELISM = '5' # concurrent event processors (default: 10) 79 - TAP_OUTBOX_CAPACITY = '10000' # event buffer size (default: 100000) 80 - TAP_IDENT_CACHE_SIZE = '10000' # identity cache entries (default: 2000000) 77 + TAP_RELAY_URL = 'https://relay.waow.tech' # custom relay (not default bsky.network) 78 + TAP_RESYNC_PARALLELISM = '1' # only one repo CAR in memory at a time (default: 5) 79 + TAP_FIREHOSE_PARALLELISM = '5' # concurrent event processors (default: 10) 80 + TAP_OUTBOX_CAPACITY = '10000' # event buffer size (default: 100000) 81 + TAP_IDENT_CACHE_SIZE = '10000' # identity cache entries (default: 2000000) 82 + TAP_CURSOR_SAVE_INTERVAL = '5s' # how often to persist firehose cursor 83 + TAP_REPO_FETCH_TIMEOUT = '600s' # timeout for repo CAR fetches 81 84 ``` 82 85 83 86 ### why these values? ··· 184 187 | `/repos/add` | POST with `{"dids":["did:plc:..."]}` to add repos | 185 188 | `/repos/remove` | POST with `{"dids":["did:plc:..."]}` to remove repos | 186 189 190 + **note:** the tap container has no `curl` — use `wget` instead. 191 + 187 192 example: check repo status 188 193 ```bash 189 - fly ssh console -a leaflet-search-tap -C "curl -s localhost:2480/info/did:plc:abc123" 194 + fly ssh console -a leaflet-search-tap -C "wget -qO- http://localhost:2480/info/did:plc:abc123" 190 195 ``` 191 196 192 197 example: manually add a repo for backfill 193 198 ```bash 194 - fly ssh console -a leaflet-search-tap -C 'curl -X POST -H "Content-Type: application/json" -d "{\"dids\":[\"did:plc:abc123\"]}" localhost:2480/repos/add' 199 + fly ssh console -a leaflet-search-tap -C 'wget -qO- --post-data="{\"dids\":[\"did:plc:abc123\"]}" --header="Content-Type: application/json" http://localhost:2480/repos/add' 195 200 ``` 196 201 197 202 ## fly.io deployment