docs/performance-saga.md at main · zzstoatzz.io/leaflet-search

zzstoatzz.io / leaflet-search
fork atom
search for standard sites pub-search.waow.tech
search zig blog atproto
fork atom
leaflet-search / docs / performance-saga.md
at main 156 lines 8.6 kB view raw view rendered
wrap content
zzstoatzz.io docs: update search-architecture, api, tap, and performance-saga 15d ago
3cac527d
  1# performance saga (feb 2026)
  2
  3## what happened
  4
  5attempted to add a vector similarity search feature using voyage-3-lite embeddings (512 dims) stored in turso with a DiskANN index. the embedding model change had a different shape than what was stored, turso performance degraded badly, and multiple attempts to back out the changes failed to restore performance.
  6
  7## the problems we found and fixed
  8
  9### 1. vector similarity saturating turso
 10
 11the `/similar` endpoint was doing brute-force `vector_distance_cos` queries on turso. each call took 10-15 seconds and blocked turso for all other queries. this meant keyword search (which falls back to turso when local isn't ready) took 30-57 seconds.
 12
 13**fix**: neutered `/similar` to return `[]` immediately. removed all vector similarity code from search.zig (~150 lines). commit `d179d20`.
 14
 15**result**: search dropped from 19.6s avg / 185s max → sub-second.
 16
 17### 2. sqlite read/write lock contention
 18
 19a single sqlite connection (`conn`) was shared between search queries and sync writes, protected by a mutex. during sync batch writes, search queries blocked waiting for the lock.
 20
 21**fix**: added a separate read-only connection (`read_conn`) opened with `ReadOnly` flags. search queries use `read_conn` (no mutex needed). writes still use `conn` with mutex. WAL mode guarantees readers never block on writers. commit `1268461`.
 22
 23**result**: `db.local.query` consistently 0-0.4ms.
 24
 25### 3. missing instrumentation
 26
 27had no visibility into the HTTP connection lifecycle. couldn't tell where time was spent between client request and server processing.
 28
 29**fix**: added `http.request` span wrapping the full request lifecycle, `queue_ms` tracking pool queueing time, `receive_ms` tracking `receiveHead()` latency, and slow-receive warnings.
 30
 31### 4. missing publication_uri index
 32
 33the base_path FTS query joins `documents → publications → publications_fts`. without an index on `documents.publication_uri`, sqlite did a full table scan (5752 rows) for each publication match. for "test" (many matches), this caused ~600ms delays during iteration.
 34
 35**fix**: added `CREATE INDEX IF NOT EXISTS idx_documents_publication_uri ON documents(publication_uri)` in LocalDb.zig createSchema().
 36
 37**result**: "test" query dropped from ~1.5s → ~100ms.
 38
 39### 5. HTTP write buffer too small
 40
 41`HTTP_BUF_SIZE` was 8192 (8KB). for large responses (32KB for "test"), zig's HTTP server needed multiple flush cycles to write the response, each involving a syscall.
 42
 43**fix**: increased `HTTP_BUF_SIZE` to 65536 (64KB). a 32KB response now fits in a single flush.
 44
 45### 6. sqlite read connection tuning
 46
 47added `PRAGMA mmap_size=268435456` (256MB) and `PRAGMA cache_size=-20000` (20MB) on the read connection so FTS5 index pages stay in memory.
 48
 49## what we know now (instrumented)
 50
 51added `search.iterate.*` spans around each iteration loop in `searchLocal()`. this revealed:
 52
 53server-side processing is fast:
 54- `db.local.query` (prepare): 0.2ms per query
 55- `search.iterate.docs_fts` (40 rows + snippet + JSON): **2ms**
 56- `search.iterate.base_path`: 0.3ms
 57- `search.iterate.pubs_fts`: 0.3ms
 58- total actual work: **~3ms**
 59
 60but `http.search` intermittently shows 1-3 seconds. the gap occurs between two consecutive lines of code (after `local.query()` returns, before the iteration span starts). there is literally nothing in between — no I/O, no allocations, no function calls.
 61
 62### what we ruled out
 63
 64- **snippet() overhead**: replaced with `substr(d.content, 1, 200)` — spikes still occurred. reverted. snippet is not the cause.
 65- **iteration overhead**: spans prove all 3 iteration loops complete in ~3ms total
 66- **logfire span.end() blocking**: logfire-zig uses `BatchSpanProcessor` (async, 500ms interval). span.end() just queues — no synchronous I/O.
 67- **query preparation**: db.local.query spans show 0.2ms consistently
 68
 69### what it actually is
 70
 71**CPU starvation on shared-cpu-1x Fly VM.** the thread gets preempted by the hypervisor for 1-2 seconds between two consecutive lines of code. the VM runs:
 72- 16 HTTP worker threads
 73- sync thread (turso → local batch writes)
 74- tap consumer thread (firehose processing)
 75- stats buffer thread (periodic turso flush)
 76- activity tracker thread
 77- BatchSpanProcessor thread (logfire export every 500ms, involves TLS + protobuf)
 78
 79on a shared CPU with fractional allocation, all these threads compete. the BatchSpanProcessor's 500ms TLS flush is particularly suspect — TLS is CPU-intensive.
 80
 81## possible next steps
 82
 831. **upgrade VM**: move to `performance-1x` (dedicated CPU) — this is the real fix
 842. **reduce thread contention**: lower HTTP workers from 16, or reduce logfire export frequency
 853. **reduce logfire spans in hot path**: fewer spans = less BatchSpanProcessor work = less CPU contention
 86
 87## architecture (current)
 88
 89```
 90client → fly proxy (TLS termination) → app (port 3000)
 91                                         ├── HTTP thread pool (16 workers)
 92                                         ├── local SQLite (read_conn for search, conn+mutex for writes)
 93                                         ├── turso client (fallback for unsupported queries)
 94                                         ├── sync thread (turso → local, full on startup + periodic incremental)
 95                                         ├── tap consumer (firehose → turso)
 96                                         ├── embedder (voyage-4-lite → turbopuffer, background)
 97                                         ├── stats buffer (periodic flush to turso)
 98                                         └── activity tracker
 99```
100
101search path: `handleSearch` → `searchLocal` (3 FTS queries on read_conn, ~3ms total) → `sendJson` → `respond()` + `flush()`
102
103hot path timing (when not CPU-starved): ~3-6ms end to end
104hot path timing (when CPU-starved): 1-3 seconds (thread preempted between any two operations)
105
106## resolution: otel-zig BatchSpanProcessor mutex contention
107
108the "CPU starvation" theory above was wrong. Fly CPU utilization was at ~0-5% during the stalls — the CPU was idle, not saturated. the thread wasn't being preempted by the hypervisor; it was **blocked on a mutex**.
109
110### root cause
111
112otel-zig's `BatchSpanProcessor` (`src/sdk/trace/batch_span_processor.zig`) has a background thread that periodically exports queued spans to the logfire backend via OTLP/HTTP. the `exportThreadFn` called `exportBatchLocked()` which did the full HTTP export (TLS handshake + protobuf serialization + POST) **while holding the mutex**. the same mutex is acquired by `onEnd()`, which is called by every `span.end()`.
113
114so every 500ms, the background thread would:
1151. lock mutex
1162. serialize all queued spans to protobuf
1173. POST to logfire via HTTPS (1-2 seconds on a fly VM)
1184. unlock mutex
119
120any application thread calling `span.end()` during step 2-3 would block for the entire duration of the HTTP export. this explains the 1-3 second stalls appearing between arbitrary consecutive lines of code — the stall was in the `defer span.end()` of whichever logfire span happened to coincide with a batch flush.
121
122### how we confirmed it
123
1241. added `search.iterate.*` spans — proved all actual search work takes ~3ms
1252. the 1-3s gap appeared between `db.local.query` span ending and the next span starting (i.e., inside `span.end()`)
1263. disabled logfire entirely (`logfire.configure()` commented out) — stalls disappeared completely, all requests 75-110ms
1274. Fly CPU graph showed ~0-5% utilization during stalls — ruled out CPU contention, pointed to lock contention
128
129### the fix
130
131in `otel-zig/src/sdk/trace/batch_span_processor.zig`, changed `exportThreadFn` to drain the queue under the lock, then release the lock before doing the HTTP export. same pattern that `forceFlush()` already used correctly.
132
133before:
134```
135exportThreadFn: lock → export (HTTP POST, 1-2s) → unlock
136onEnd:          lock (BLOCKED for 1-2s) → enqueue → unlock
137```
138
139after:
140```
141exportThreadFn: lock → drain queue → unlock → export (HTTP POST) → done
142onEnd:          lock (instant) → enqueue → unlock
143```
144
145changes pushed through the dependency chain:
146- `otel-zig` (zzstoatzz/otel-zig trunk): fix in `batch_span_processor.zig`
147- `logfire-zig` (zzstoatzz.io/logfire-zig main): bumped otel-zig hash
148- `leaflet-search` backend: bumped logfire-zig hash, re-enabled logfire
149
150### final results
151
152with logfire fully enabled and the fix applied:
153- hello (21KB response): **75-112ms** consistently, 10/10 requests
154- test (32KB response): **76-110ms** consistently, 10/10 requests
155- zero multi-second stalls
156- full observability retained