search for standard sites pub-search.waow.tech
search zig blog atproto
at main 75 lines 3.9 kB view raw view rendered
1# pub search 2 3by [@zzstoatzz.io](https://bsky.app/profile/zzstoatzz.io) 4 5search ATProto publishing platforms ([leaflet](https://leaflet.pub), [pckt](https://pckt.blog), [offprint](https://offprint.app), [greengale](https://greengale.app), and others using [standard.site](https://standard.site)). 6 7**live:** [pub-search.waow.tech](https://pub-search.waow.tech) 8 9> formerly "leaflet-search" - generalized to support multiple publishing platforms 10 11## how it works 12 131. **[tap](https://github.com/bluesky-social/indigo/tree/main/cmd/tap)** syncs content from ATProto firehose (`pub.leaflet.*`, `site.standard.*`, `com.whtwnd.*`) 142. **backend** indexes content into SQLite FTS5 via [Turso](https://turso.tech), serves search API with keyword, semantic, and hybrid modes 153. **site** static frontend on Cloudflare Pages 164. **mcp** server for AI agents (Claude Code, etc.) 17 18## MCP server 19 20search is also exposed as an MCP server for AI agents like Claude Code: 21 22```bash 23claude mcp add-json pub-search '{"type": "http", "url": "https://pub-search-by-zzstoatzz.fastmcp.app/mcp"}' 24``` 25 26see [mcp/README.md](mcp/README.md) for local setup and usage details. 27 28## api 29 30``` 31GET /search?q=<query>&mode=keyword|semantic|hybrid&platform=<platform>&tag=<tag>&since=<date>&author=<did|handle>&format=v2 32GET /similar?uri=<at-uri>&format=v2 33GET /tags 34GET /popular 35GET /stats 36GET /health 37``` 38 39search returns three entity types: `article` (document in a publication), `looseleaf` (standalone document), `publication` (newsletter itself). each result includes a `platform` field (leaflet, pckt, offprint, greengale, whitewind, or other). use `format=v2` for a wrapped response with `total`, `hasMore`, and `results` fields. 40 41**modes**: `keyword` (default) uses FTS5 with BM25 + recency scoring. `semantic` uses voyage embeddings + [turbopuffer](https://turbopuffer.com) ANN. `hybrid` merges both via reciprocal rank fusion. 42 43**ranking**: keyword results use hybrid BM25 + recency scoring. text relevance is primary, but recent documents get a boost (~1 point per 30 days). the `since` parameter filters to documents created after the given ISO date (e.g., `since=2025-01-01`). 44 45`/similar` uses [Voyage AI](https://voyageai.com) embeddings with [turbopuffer](https://turbopuffer.com) ANN search. 46 47## configuration 48 49the backend is fully configurable via environment variables: 50 51| variable | default | description | 52|----------|---------|-------------| 53| `APP_NAME` | `leaflet-search` | name shown in startup logs | 54| `DASHBOARD_URL` | `https://pub-search.waow.tech/dashboard.html` | redirect target for `/dashboard` | 55| `TAP_HOST` | `leaflet-search-tap.fly.dev` | tap websocket host | 56| `TAP_PORT` | `443` | tap websocket port | 57| `PORT` | `3000` | HTTP server port | 58| `TURSO_URL` | - | Turso database URL (required) | 59| `TURSO_TOKEN` | - | Turso auth token (required) | 60| `VOYAGE_API_KEY` | - | Voyage AI API key (for embeddings) | 61 62the backend indexes multiple ATProto platforms - currently `pub.leaflet.*` and `site.standard.*` collections. platform is stored per-document and returned in search results. 63 64## [stack](https://bsky.app/profile/zzstoatzz.io/post/3mbij5ip4ws2a) 65 66- [Fly.io](https://fly.io) hosts [Zig](https://ziglang.org) search API and content indexing 67- [Turso](https://turso.tech) cloud SQLite (source of truth) + local read replica (FTS queries) 68- [turbopuffer](https://turbopuffer.com) ANN vector search 69- [Voyage AI](https://voyageai.com) embeddings (voyage-4-lite, 1024 dims) 70- [tap](https://github.com/bluesky-social/indigo/tree/main/cmd/tap) syncs content from ATProto firehose 71- [Cloudflare Pages](https://pages.cloudflare.com) static frontend 72 73## embeddings 74 75documents are embedded using Voyage AI's `voyage-4-lite` model (1024 dimensions). the backend automatically generates embeddings for new documents via a background worker — no manual backfill needed. similarity search uses turbopuffer's ANN index for fast nearest-neighbor queries.