how pub search works#
a search engine for content published on the AT Protocol — the open network behind Bluesky. it indexes posts from publishing platforms like leaflet, pckt, offprint, greengale, and whitewind, all of which use the standard.site schema.
live at pub-search.waow.tech
the big picture#
ATProto firehose (every post, everywhere)
↓ filtered by collection
tap (firehose consumer)
↓ documents + publications
backend (zig)
├── turso (cloud sqlite — source of truth)
├── local sqlite replica (fast keyword search via FTS5)
├── voyage AI embeddings → turbopuffer (semantic search)
└── HTTP API
↓
static frontend (cloudflare pages)
content flows in one direction: the firehose broadcasts every AT Protocol event in real-time, tap filters for publishing-related records, and the backend indexes them. a reconciler periodically verifies documents still exist at their source, catching deletions missed while the tap was down.
how searching works#
there are three search modes, each using different technology:
keyword search#
uses SQLite FTS5 — a built-in full-text search engine. when a document is indexed, FTS5 builds an inverted index (a map from every word to every document containing it). queries use BM25 ranking — a standard relevance scoring algorithm that considers term frequency and document length. recent documents get a small boost.
this is not something custom — FTS5 is a well-established tool built into SQLite. the custom part is building the index (deciding what to index, how to tokenize, how to rank) and the query syntax (OR between terms for recall, prefix matching on the last word for a type-ahead feel).
keyword search runs against a local SQLite replica on the same machine as the backend, not over the network to the database. this keeps latency around ~9ms.
semantic search#
uses Voyage AI embeddings (voyage-4-lite, 1024 dimensions) to convert text into vectors — arrays of numbers that capture meaning. similar texts produce similar vectors, even if they don't share any words.
these vectors are stored in turbopuffer, a vector database optimized for approximate nearest-neighbor (ANN) search. when you search semantically, your query is embedded into a vector, and turbopuffer finds the documents whose vectors are closest.
this is how a search for "loosely about cooking" can find a post titled "my grandmother's kitchen" — keyword search would miss it entirely because the words don't overlap, but the meaning is close.
hybrid search#
runs both keyword and semantic in parallel, then merges results using reciprocal rank fusion (RRF, k=60). documents found by both methods rank highest. each result is annotated with its source: "keyword", "semantic", or "keyword+semantic".
the content extraction problem#
every platform on standard.site stores document content differently. this is the most fiddly part of the system.
- pckt, offprint, greengale provide a
textContentfield with pre-flattened plaintext — easy - leaflet omits
textContentto save record size. content lives nested insidecontent.pages[].blocks[].block.plaintext— requires block-by-block extraction - whitewind stores markdown directly in a
contentstring field
the backend handles all of this in the content extraction layer, producing a uniform plaintext blob for indexing regardless of source platform.
what's custom vs off-the-shelf#
| component | off-the-shelf | custom |
|---|---|---|
| full-text matching | SQLite FTS5 (BM25 ranking, inverted index) | query construction, tokenization rules, recency scoring |
| vector similarity | Voyage AI (embeddings), turbopuffer (ANN search) | hybrid fusion, result merging, snippet extraction |
| firehose sync | tap (from bluesky-social/indigo) | content extraction per platform, deduplication |
| data storage | Turso (cloud SQLite), local SQLite replica | schema design, sync logic, migration handling |
| frontend | Cloudflare Pages (hosting) | the entire UI and search experience |
the tools are popular and well-established. the assembly — wiring the firehose to content extraction to multi-modal search across heterogeneous publishing platforms — is very custom.
further reading#
- search-architecture.md — FTS5 details, scaling considerations, future options
- content-extraction.md — how content is extracted from each platform
- api.md — API endpoint reference
- tap.md — firehose consumer setup, debugging, memory tuning
- reconciliation.md — stale document detection and cleanup
- turso-hrana.md — Turso's HTTP protocol for database queries
- performance-saga.md — a debugging story about latency spikes