search architecture#

current state, rationale, and future options.

current: SQLite FTS5#

keyword search uses SQLite's FTS5 on a local read replica, synced from Turso (the source of truth).

why FTS5 works for now#

scale: ~11k documents. FTS5 handles this trivially.
latency: keyword p50 ~9ms (local SQLite replica), semantic p50 ~345ms (voyage + turbopuffer), hybrid p50 ~360ms.
cost: $0. included with Turso free tier.
ops: zero. no separate service to run.
simplicity: Turso as source of truth, local SQLite read replica for FTS queries.

how it works#

user query: "crypto-casino"
     ↓
buildFtsQuery(): "crypto OR casino*"
     ↓
FTS5 MATCH query with BM25 + recency decay (on local SQLite replica)
     ↓
results with snippet()

key decisions (see search-syntax.md for the user-facing reference):

OR between terms for better recall (deliberate, see commit 35ad4b5)
quoted phrases passed through to FTS5 for exact matching
prefix match on last word for type-ahead feel (bare words only, not phrases)
unicode61 tokenizer splits on non-alphanumeric (we match this in buildFtsQuery)
recency decay boosts recent docs: ORDER BY rank + (days_old / 30)

what's coupled to FTS5#

all in backend/src/server/search.zig:

component	FTS5-specific
14 query definitions	`MATCH`, `snippet()`, `ORDER BY rank`
`buildFtsQuery()`	constructs FTS5 syntax
schema	`documents_fts`, `publications_fts` virtual tables

what's already decoupled#

result types (SearchResultJson, Doc, Pub)
similarity search (uses voyage-4-lite embeddings + turbopuffer ANN, not FTS5)
hybrid mode (merges keyword + semantic via reciprocal rank fusion, k=60)
search-time dedup by (did, title) — collapses cross-platform duplicates
ingestion-time dedup by content hash — prevents duplicates at write time
caching logic
HTTP layer (server/mod.zig just calls search())

known limitations#

no typo tolerance: "leafet" won't find "leaflet"
no relevance tuning: can't boost title vs content
single writer: SQLite write lock
no horizontal scaling: single database

these aren't problems at current scale.

future: if we need to scale#

when to consider switching#

search latency consistently >100ms
write contention from indexing
need typo tolerance or better relevance
millions of documents

recommended: Elasticsearch#

Elasticsearch is the battle-tested choice for production search:

proven at massive scale (Wikipedia, GitHub, Stack Overflow)
rich query DSL, analyzers, aggregations
typo tolerance via fuzzy matching
horizontal scaling built-in
extensive tooling and community

trade-offs:

operational complexity (JVM, cluster management)
resource hungry (~2GB+ RAM minimum)
cost: $50-500/month depending on scale

alternatives considered#

Meilisearch/Typesense: simpler, lighter, great defaults. good for straightforward search but less proven at scale. would work fine for this use case but Elasticsearch has more headroom.

Algolia: fully managed, excellent but expensive. makes sense if you want zero ops.

PostgreSQL full-text: if already on Postgres. not as good as FTS5 or Elasticsearch but one less system.

migration path#

keep Turso as source of truth
add Elasticsearch as search index
sync documents to ES on write (async)
point /search at Elasticsearch
keep /similar on turbopuffer (vector search)

the search() function would change from SQL queries to ES client calls. result types stay the same. HTTP layer unchanged.

estimated effort: 1-2 days to swap search backend.

vector search scaling#

similarity search currently uses voyage-4-lite embeddings (1024 dims) with turbopuffer ANN index. this handles ~11k docs well. at larger scale:

Elasticsearch: has vector search (dense_vector + kNN)
dedicated vector DB: Qdrant, Pinecone, Weaviate
pgvector: if on Postgres

could consolidate text + vector in Elasticsearch, or keep them separate. turbopuffer scales well so may not need to change.

summary#

scale	recommendation
<10k docs	keep FTS5 (current)
10k-100k docs	still probably fine, monitor latency
100k+ docs	consider Elasticsearch
millions + sub-ms latency	Elasticsearch cluster + caching layer

we're in the "keep FTS5" zone. the code is structured to swap later if needed.