search for standard sites pub-search.waow.tech
search zig blog atproto

docs: update README, API docs, and architecture for current state

reflects voyage-4-lite + turbopuffer, hybrid search mode, whitewind
platform, content dedup, format=v2, and corrected code paths.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

+81 -34
+7 -1
CLAUDE.md
··· 17 17 - **db**: Turso (source of truth) + local SQLite read replica (FTS queries) 18 18 19 19 ## platforms 20 - - leaflet, pckt, offprint, greengale: known platforms (detected via basePath) 20 + - leaflet, pckt, offprint, greengale, whitewind: known platforms 21 + - leaflet/pckt/offprint/greengale detected via basePath; whitewind via `com.whtwnd.*` collection 21 22 - other: site.standard.* documents not from a known platform 22 23 23 24 ## search ranking ··· 31 32 32 33 ## zig dependencies 33 34 - update a dependency hash: `zig fetch --save <url>` (fetches and updates build.zig.zon automatically) 35 + 36 + ## MCP server 37 + - hosted: `claude mcp add-json pub-search '{"type": "http", "url": "https://pub-search-by-zzstoatzz.fastmcp.app/mcp"}'` 38 + - local dev: `cd mcp && uv run pytest` for tests 39 + - deployed on fastmcp.app 34 40 35 41 ## common tasks 36 42 - check indexing: `curl -s https://leaflet-search-backend.fly.dev/api/dashboard | jq`
+18 -15
README.md
··· 10 10 11 11 ## how it works 12 12 13 - 1. **[tap](https://github.com/bluesky-social/indigo/tree/main/cmd/tap)** syncs content from ATProto firehose (signals on `site.standard.document`, filters `pub.leaflet.*` + `site.standard.*`) 14 - 2. **backend** indexes content into SQLite FTS5 via [Turso](https://turso.tech), serves search API 13 + 1. **[tap](https://github.com/bluesky-social/indigo/tree/main/cmd/tap)** syncs content from ATProto firehose (`pub.leaflet.*`, `site.standard.*`, `com.whtwnd.*`) 14 + 2. **backend** indexes content into SQLite FTS5 via [Turso](https://turso.tech), serves search API with keyword, semantic, and hybrid modes 15 15 3. **site** static frontend on Cloudflare Pages 16 + 4. **mcp** server for AI agents (Claude Code, etc.) 16 17 17 18 ## MCP server 18 19 ··· 27 28 ## api 28 29 29 30 ``` 30 - GET /search?q=<query>&tag=<tag>&platform=<platform>&since=<date> # full-text search 31 - GET /similar?uri=<at-uri> # find similar documents 32 - GET /tags # list all tags with counts 33 - GET /popular # popular search queries 34 - GET /stats # counts + request latency (p50/p95) 35 - GET /health # health check 31 + GET /search?q=<query>&mode=keyword|semantic|hybrid&platform=<platform>&tag=<tag>&since=<date>&format=v2 32 + GET /similar?uri=<at-uri>&format=v2 33 + GET /tags 34 + GET /popular 35 + GET /stats 36 + GET /health 36 37 ``` 37 38 38 - search returns three entity types: `article` (document in a publication), `looseleaf` (standalone document), `publication` (newsletter itself). each result includes a `platform` field (leaflet, pckt, offprint, greengale, or other). tag and platform filtering apply to documents only. 39 + search returns three entity types: `article` (document in a publication), `looseleaf` (standalone document), `publication` (newsletter itself). each result includes a `platform` field (leaflet, pckt, offprint, greengale, whitewind, or other). use `format=v2` for a wrapped response with `total`, `offset`, and `results` fields. 39 40 40 - **ranking**: results use hybrid BM25 + recency scoring. text relevance is primary, but recent documents get a boost (~1 point per 30 days). the `since` parameter filters to documents created after the given ISO date (e.g., `since=2025-01-01`). 41 + **modes**: `keyword` (default) uses FTS5 with BM25 + recency scoring. `semantic` uses voyage embeddings + [turbopuffer](https://turbopuffer.com) ANN. `hybrid` merges both via reciprocal rank fusion. 41 42 42 - `/similar` uses [Voyage AI](https://voyageai.com) embeddings with brute-force cosine similarity (~0.15s for 3500 docs). 43 + **ranking**: keyword results use hybrid BM25 + recency scoring. text relevance is primary, but recent documents get a boost (~1 point per 30 days). the `since` parameter filters to documents created after the given ISO date (e.g., `since=2025-01-01`). 44 + 45 + `/similar` uses [Voyage AI](https://voyageai.com) embeddings with [turbopuffer](https://turbopuffer.com) ANN search. 43 46 44 47 ## configuration 45 48 ··· 61 64 ## [stack](https://bsky.app/profile/zzstoatzz.io/post/3mbij5ip4ws2a) 62 65 63 66 - [Fly.io](https://fly.io) hosts [Zig](https://ziglang.org) search API and content indexing 64 - - [Turso](https://turso.tech) cloud SQLite with [Voyage AI](https://voyageai.com) vector support 67 + - [Turso](https://turso.tech) cloud SQLite (source of truth) + local read replica (FTS queries) 68 + - [turbopuffer](https://turbopuffer.com) ANN vector search 69 + - [Voyage AI](https://voyageai.com) embeddings (voyage-4-lite, 1024 dims) 65 70 - [tap](https://github.com/bluesky-social/indigo/tree/main/cmd/tap) syncs content from ATProto firehose 66 71 - [Cloudflare Pages](https://pages.cloudflare.com) static frontend 67 72 68 73 ## embeddings 69 74 70 - documents are embedded using Voyage AI's `voyage-3-lite` model (512 dimensions). the backend automatically generates embeddings for new documents via a background worker - no manual backfill needed. 71 - 72 - **note:** we use brute-force cosine similarity instead of a vector index. Turso's DiskANN index has ~60s write latency per row, making it impractical for incremental updates. brute-force on 3500 vectors runs in ~0.15s which is fine for this scale. 75 + documents are embedded using Voyage AI's `voyage-4-lite` model (1024 dimensions). the backend automatically generates embeddings for new documents via a background worker — no manual backfill needed. similarity search uses turbopuffer's ANN index for fast nearest-neighbor queries across ~25k documents.
+18 -4
docs/api.md
··· 17 17 |-------|------|----------|-------------| 18 18 | `q` | string | no* | search query (titles and content) | 19 19 | `tag` | string | no | filter by tag (documents only) | 20 - | `platform` | string | no | filter by platform: `leaflet`, `pckt`, `offprint`, `greengale`, `other` | 20 + | `platform` | string | no | filter by platform: `leaflet`, `pckt`, `offprint`, `greengale`, `whitewind`, `other` | 21 21 | `since` | string | no | ISO date, filter to documents created after | 22 - | `mode` | string | no | `keyword` (default) or `semantic`. semantic uses vector similarity via voyage embeddings + turbopuffer ANN. ignores `tag` and `since` filters. | 22 + | `mode` | string | no | `keyword` (default), `semantic`, or `hybrid`. semantic uses voyage-4-lite embeddings + turbopuffer ANN. hybrid merges keyword + semantic via reciprocal rank fusion. semantic/hybrid ignore `tag` and `since` filters. | 23 + | `format` | string | no | `v2` wraps response in `{"results": [...], "total": N, "offset": N}` | 24 + | `limit` | int | no | max results to return (default 20) | 25 + | `offset` | int | no | pagination offset | 23 26 24 27 *at least one of `q` or `tag` required 25 28 ··· 36 39 "rkey": "abc123", 37 40 "basePath": "gyst.leaflet.pub", 38 41 "platform": "leaflet", 39 - "path": "/001" 42 + "path": "/001", 43 + "source": "keyword", 44 + "score": 0.85 40 45 } 41 46 ] 42 47 ``` 43 48 49 + with `format=v2`: 50 + ```json 51 + { 52 + "results": [ /* same as above */ ], 53 + "total": 89, 54 + "offset": 0 55 + } 56 + ``` 57 + 44 58 **result types:** 45 59 - `article`: document in a publication 46 60 - `looseleaf`: standalone document (no publication) ··· 54 68 GET /similar?uri=<at-uri> 55 69 ``` 56 70 57 - find semantically similar documents using vector similarity (voyage-3-lite embeddings). 71 + find semantically similar documents using vector similarity (voyage-4-lite embeddings + turbopuffer ANN). 58 72 59 73 **parameters:** 60 74 | param | type | required | description |
+15 -3
docs/content-extraction.md
··· 86 86 87 87 3. if neither matches → `other` 88 88 89 + ## whitewind 90 + 91 + [WhiteWind](https://whtwnd.com) (`com.whtwnd.blog.entry`) stores content as markdown in the `content` field (a string, not a blocks structure). extraction is trivial — just use the string directly. author-only posts (`visibility: "author"`) are skipped. 92 + 93 + ## deduplication 94 + 95 + two layers prevent duplicate results: 96 + 97 + 1. **ingestion-time**: content hash (wyhash of `title + \x00 + content`) per author. if the same author publishes identical content across platforms (different rkeys), only the first is indexed. 98 + 2. **search-time**: `(did, title)` dedup collapses any remaining duplicates in results (e.g. records indexed before content hash was added). 99 + 89 100 ## summary 90 101 91 102 - **pckt/offprint/greengale**: use `textContent` directly 92 103 - **leaflet**: extract from `content.pages[].blocks[].block.plaintext` 93 - - **deduplication**: `ON CONFLICT` on `(did, rkey)` or `uri` 104 + - **whitewind**: use `content` string directly (markdown) 105 + - **deduplication**: content hash at ingestion + `(did, title)` at search time 94 106 - **platform**: infer from basePath, fallback to content.$type for custom domains 95 107 96 108 ## code references 97 109 98 - - `backend/src/extractor.zig` - content extraction logic, content_type field 99 - - `backend/src/indexer.zig:99-118` - platform detection from basePath + content_type 110 + - `backend/src/ingest/extractor.zig` - content extraction logic, content_type field 111 + - `backend/src/ingest/indexer.zig` - platform detection from basePath + content_type, content hash dedup
+9 -6
docs/search-architecture.md
··· 8 8 9 9 ### why FTS5 works for now 10 10 11 - - **scale**: ~3500 documents. FTS5 handles this trivially. 12 - - **latency**: 10-50ms for search queries. fine for our use case. 11 + - **scale**: ~25k documents. FTS5 handles this trivially. 12 + - **latency**: keyword p50 ~9ms (local SQLite replica), semantic p50 ~345ms (voyage + turbopuffer), hybrid p50 ~360ms. 13 13 - **cost**: $0. included with Turso free tier. 14 14 - **ops**: zero. no separate service to run. 15 - - **simplicity**: one database for everything (docs, FTS, vectors, cache). 15 + - **simplicity**: Turso as source of truth, local SQLite read replica for FTS queries. 16 16 17 17 ### how it works 18 18 ··· 45 45 ### what's already decoupled 46 46 47 47 - result types (`SearchResultJson`, `Doc`, `Pub`) 48 - - similarity search (uses `vector_distance_cos`, not FTS5) 48 + - similarity search (uses voyage-4-lite embeddings + turbopuffer ANN, not FTS5) 49 + - hybrid mode (merges keyword + semantic via reciprocal rank fusion, k=60) 50 + - search-time dedup by `(did, title)` — collapses cross-platform duplicates 51 + - ingestion-time dedup by content hash — prevents duplicates at write time 49 52 - caching logic 50 53 - HTTP layer (server.zig just calls `search()`) 51 54 ··· 104 107 105 108 ### vector search scaling 106 109 107 - similarity search currently uses brute-force `vector_distance_cos` with caching. at scale: 110 + similarity search currently uses voyage-4-lite embeddings (1024 dims) with turbopuffer ANN index. this handles ~25k docs well. at larger scale: 108 111 109 112 - **Elasticsearch**: has vector search (dense_vector + kNN) 110 113 - **dedicated vector DB**: Qdrant, Pinecone, Weaviate 111 114 - **pgvector**: if on Postgres 112 115 113 - could consolidate text + vector in Elasticsearch, or keep them separate. 116 + could consolidate text + vector in Elasticsearch, or keep them separate. turbopuffer scales well so may not need to change. 114 117 115 118 ## summary 116 119
+14 -5
mcp/README.md
··· 1 1 # pub search MCP 2 2 3 - MCP server for [pub search](https://pub-search.waow.tech) - search ATProto publishing platforms (Leaflet, pckt, standard.site). 3 + MCP server for [pub search](https://pub-search.waow.tech) - search ATProto publishing platforms (Leaflet, pckt, Offprint, Greengale, WhiteWind, and others using standard.site). 4 4 5 5 ## usage 6 6 ··· 24 24 claude mcp add pub-search -- uvx --from 'git+https://github.com/zzstoatzz/leaflet-search#subdirectory=mcp' pub-search 25 25 ``` 26 26 27 - ## workflow 27 + ## tools 28 + 29 + | tool | description | 30 + |------|-------------| 31 + | `search` | search documents by query, tag, platform, or date | 32 + | `get_document` | retrieve full content by AT-URI | 33 + | `find_similar` | find semantically similar documents | 34 + | `get_tags` | list all tags with document counts | 35 + | `get_popular` | see popular search queries | 36 + | `get_stats` | index statistics (document/publication counts) | 28 37 29 - 1. **search** for documents by query or tag 30 - 2. **get_document** to retrieve full content by AT-URI 38 + ## workflow 31 39 32 40 ``` 33 - search("space station") → [{uri: "at://...", title: "...", snippet: "..."}] 41 + search("space station") → [{uri: "at://...", title: "...", snippet: "...", url: "..."}] 34 42 get_document("at://...") → {title: "...", content: "full article text..."} 43 + find_similar("at://...") → [{uri: "at://...", title: "...", snippet: "..."}] 35 44 ``` 36 45 37 46 ## development