how pub search works#

a search engine for content published on the AT Protocol — the open network behind Bluesky. it indexes posts from publishing platforms like leaflet, pckt, offprint, greengale, and whitewind, all of which use the standard.site schema.

live at pub-search.waow.tech

the big picture#

ATProto firehose (every post, everywhere)
     ↓ filtered by collection
tap (firehose consumer)
     ↓ documents + publications
backend (zig)
     ├── turso (cloud sqlite — source of truth)
     ├── local sqlite replica (fast keyword search via FTS5)
     ├── voyage AI embeddings → turbopuffer (semantic search)
     └── HTTP API
           ↓
static frontend (cloudflare pages)

content flows in one direction: the firehose broadcasts every AT Protocol event in real-time, tap filters for publishing-related records, and the backend indexes them. a reconciler periodically verifies documents still exist at their source, catching deletions missed while the tap was down.

how searching works#

there are three search modes, each using different technology:

keyword search#

uses SQLite FTS5 — a built-in full-text search engine. when a document is indexed, FTS5 builds an inverted index (a map from every word to every document containing it). queries use BM25 ranking — a standard relevance scoring algorithm that considers term frequency and document length. recent documents get a small boost.

this is not something custom — FTS5 is a well-established tool built into SQLite. the custom part is building the index (deciding what to index, how to tokenize, how to rank) and the query syntax (OR between terms for recall, prefix matching on the last word for a type-ahead feel).

keyword search runs against a local SQLite replica on the same machine as the backend, not over the network to the database. this keeps latency around ~9ms.

semantic search#

uses Voyage AI embeddings (voyage-4-lite, 1024 dimensions) to convert text into vectors — arrays of numbers that capture meaning. similar texts produce similar vectors, even if they don't share any words.

these vectors are stored in turbopuffer, a vector database optimized for approximate nearest-neighbor (ANN) search. when you search semantically, your query is embedded into a vector, and turbopuffer finds the documents whose vectors are closest.

this is how a search for "loosely about cooking" can find a post titled "my grandmother's kitchen" — keyword search would miss it entirely because the words don't overlap, but the meaning is close.

hybrid search#

runs both keyword and semantic in parallel, then merges results using reciprocal rank fusion (RRF, k=60). documents found by both methods rank highest. each result is annotated with its source: "keyword", "semantic", or "keyword+semantic".

the content extraction problem#

every platform on standard.site stores document content differently. this is the most fiddly part of the system.

pckt, offprint, greengale provide a textContent field with pre-flattened plaintext — easy
leaflet omits textContent to save record size. content lives nested inside content.pages[].blocks[].block.plaintext — requires block-by-block extraction
whitewind stores markdown directly in a content string field

the backend handles all of this in the content extraction layer, producing a uniform plaintext blob for indexing regardless of source platform.

what's custom vs off-the-shelf#

component	off-the-shelf	custom
full-text matching	SQLite FTS5 (BM25 ranking, inverted index)	query construction, tokenization rules, recency scoring
vector similarity	Voyage AI (embeddings), turbopuffer (ANN search)	hybrid fusion, result merging, snippet extraction
firehose sync	tap (from bluesky-social/indigo)	content extraction per platform, deduplication
data storage	Turso (cloud SQLite), local SQLite replica	schema design, sync logic, migration handling
frontend	Cloudflare Pages (hosting)	the entire UI and search experience

the tools are popular and well-established. the assembly — wiring the firehose to content extraction to multi-modal search across heterogeneous publishing platforms — is very custom.