commits
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
was doing full sync (~80 min for 12k docs) on every deploy, blocking
new content from appearing in search. the full sync existed to clean
up stale docs deleted from Turso.
now: incremental sync on startup (seconds), with tombstone queries to
handle deletions. fullSync only runs on first-ever boot (no last_sync).
tombstones table was already populated by deleteDocument/deletePublication
but never queried during sync.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The blanket bridgy-fed filter was added because we couldn't build links
(empty base_path). Now that the indexer resolves base_path from HTTP
site URLs in publication_uri, bridgy-fed documents can get working links
like any other standard.site content.
Removes isBridgyFed, resolvePdsIsBridgy, and PdsCache (no longer needed).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
base_path values like "tedium.co/" combined with paths like "/some-post"
produce double-slash URLs ("https://tedium.co//some-post") that 404.
The trailing slash comes from publication URLs like "https://tedium.co/"
where stripUrlScheme preserved it. Fix in three places:
- tap.zig: stripUrlScheme strips trailing slash
- indexer.zig: HTTP fallback strips trailing slash
- indexer.zig: normalize base_path after all resolution (catches values
already stored in publications table with trailing slashes)
Backfill: RTRIM(base_path, '/') on both documents and publications.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
f2c6d29 intended to keep local FTS serving during re-sync by only
calling setReady(false) on first-ever sync. But is_ready initializes
to false, so the has_data branch needed an explicit setReady(true).
Without it, every deploy caused ~60min of turso fallback (3-5s per
search) until fullSync completed.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
standard.site documents store the origin URL (e.g., "https://attoshi.com")
in publication_uri, but the indexer only resolves base_path from AT-URIs
via the publications table. When publication_uri is an HTTP URL, the lookup
fails silently and base_path stays empty, breaking frontend links.
Add a fallback: if base_path is still empty and publication_uri starts with
http(s)://, strip the scheme and use the remainder as base_path.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
background worker verifies documents still exist at their source PDS
via com.atproto.repo.getRecord. catches deletions missed while the tap
was down (firehose delete events are ephemeral and never replayed).
also fixes the forward path: firehose deletes now clean turbopuffer
vectors in addition to turso records.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
accessible explainer of how the search engine works — covers keyword
(FTS5), semantic (voyage + turbopuffer), hybrid (RRF), content
extraction challenges, and what's custom vs off-the-shelf.
also fixes ~25k → actual count and v2 format (offset → hasMore) in
root README.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- fix FTS5 description: local replica, not Turso directly
- fix doc count (~11k not ~25k), query count (14 not 10)
- fix /similar: uses turbopuffer, not Turso
- remove phantom /platforms endpoint from api.md
- fix v2 format (hasMore not offset), timing keys (search_keyword etc)
- add coverImage/handle fields, clarify hybrid-only score/source
- fix tap examples: wget not curl (container has no curl)
- add missing tap env vars (TAP_RELAY_URL, cursor/timeout settings)
- fix voyage model in perf saga (voyage-3-lite 512 dims)
- add embedder thread to architecture diagram
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
fullSync now tracks synced URIs in a temp table and deletes local docs
that no longer exist in Turso. Startup always does a full sync instead
of incremental, so deleted docs get cleaned up on every deploy.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- delete 31 stale docs from turso (16 dead offprint links, 18 .test domains, 3 overlap)
- add .test domain filter in tap.zig (processPublication) and indexer.zig (insertDocument)
- revert offprint URL slug construction — old docs purged, only /a/ format remains
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous deploy caused a full re-sync which set local DB to
not-ready, forcing all searches through Turso (10-60s response times).
- Only set not-ready on first-ever sync (empty DB)
- Skip DELETE when re-syncing — INSERT OR REPLACE updates in place
- Add ALTER TABLE migration for cover_image on existing local DBs
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Semantic and similar searches used tpuf.QueryResult which lacks
cover_image. Now fetchLocalExtras() fetches both snippets and
cover images from local SQLite in a single query per URI. Hybrid
search also falls back to local DB for semantic-only cover images.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add cover_image column to local SQLite schema and sync queries
- Replace 3-button theme toggle with single cycling icon (dark/light/system)
- Make platform badges clickable as search filters
- Fix cover image hover jump with position compensation
- Move cover thumbnail to right-side absolute positioning
- Add backfill-cover-images script for populating existing docs
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- stats/dashboard links use /dashboard.html (same origin) instead of
going through backend redirect to dead leaflet-search.pages.dev URL
- theme toggle moved from cramped header to bottom of page where it
doesn't clutter the title line
- dashboard "back" and title links use relative paths (same origin)
- backend DASHBOARD_URL default updated to pub-search.waow.tech
- localStorage theme now shared between search and stats pages
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- 3-way theme toggle (dark/light/system) with localStorage persistence
and flash prevention on both search and dashboard pages
- extract cover image blob CID from document records (coverImage field
for pckt/offprint/greengale, first image block fallback for leaflet)
- add cover_image column to documents table, pass through indexer/search
- render 32x32 thumbnails in search results via bsky CDN, graceful
fallback when image unavailable
- convert all hardcoded colors to CSS custom properties
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
the homepage was missing og:image tags, so link previews showed no image.
also fixed CLAUDE.md deploy command — must run from inside site/ dir
to include the Functions bundle.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- add limit param to get_tags (default 10, was returning all ~100 tags)
- strip timing data from get_stats (36 latency numbers most callers
don't need — use the dashboard for that)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- expand Stats type with embeddings, searches, errors, started_at,
cache_hits, cache_misses, and per-endpoint timing metrics
- update backfill-embeddings to use voyage-4-lite with output_dimension
- update rebuild-documents-table to use F32_BLOB(1024)
- add regression tests for full Stats model
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
generate per-query 1200x630 PNG images via workers-og with dark terminal
aesthetic, filter chips with type-matched colors, top 3 result titles,
and result count. rewrite meta tag injection to use HTMLRewriter with
support for all 5 URL params (q, tag, platform, since, mode).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
20 concurrent PLC lookups, 10 concurrent turso deletes.
What took ~15min sequential now finishes in ~40s.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PLC directory lookup in tap worker to detect brid.gy-hosted DIDs and
skip indexing their documents/publications. Results cached per worker
lifetime, fails open on errors.
Purge script queries turso for platform='other' DIDs, resolves PDS,
and batch-deletes bridgy fed content (documents, tags, FTS entries).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
created_at is the document's publication date (set by the author),
which can be in the future. indexed_at is when we actually indexed
it, which is what the timeline should show. also caps at today.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
reflects voyage-4-lite + turbopuffer, hybrid search mode, whitewind
platform, content dedup, format=v2, and corrected code paths.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
add content_hash (wyhash of title+content) to documents table. on
ingest, skip documents where the same author already has identical
content under a different rkey (cross-platform publishing dedup).
frontend: add date filter (any/week/month/year) with since param,
URL state sync, and active filter bar.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- backfill-pds now handles com.whtwnd.blog.entry collection
- extracts markdown content from whitewind's content field
- sets platform to "whitewind", skips visibility:"author" entries
- prefers publishedAt over createdAt for date extraction
- update tangled.sh URLs to tangled.org in build.zig.zon
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- platform filter button in search UI
- homepage platform links
- MCP server Platform type and tool descriptions
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The /stats endpoint was missing the started_at field entirely. Also
added diagnostic logging to refreshCachedStats so turso query failures
are visible instead of silently swallowed.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
WhiteWind blog entries use three visibility values: "public", "url",
and "author". "url" means publicly accessible via link. Our filter
was dropping everything except "public", which meant every WhiteWind
entry with visibility "url" was silently discarded. Now only "author"
entries are skipped.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
getStatsLocal() required stats_buffer cache to be initialized, but
refreshCachedStats was only called at startup (which could fail) and
when search deltas were non-zero. If init failed and no searches
happened, cache stayed uninitialized, /stats returned all zeros.
- always refresh cache in sync loop (not just when deltas exist)
- getStatsLocal() no longer fails when cache isn't initialized — returns
local counts with 0 for cached fields instead of aborting entirely
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
fetchLocal() queried COUNT(*) WHERE embedded_at IS NOT NULL, but the
local schema never had that column. Every dashboard request failed,
falling through to turso batch (which also returned zeros).
- add embedded_at column migration to LocalDb schema
- sync embedded_at from turso in full and incremental sync
- add logfire warnings when fetchLocal fails or turso batch returns no rows
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
root cause: processMessage (which writes to turso via HTTP) ran
synchronously in the websocket readLoop callback. when turso was
slow or hung, the readLoop blocked — no messages read, no ACKs
sent, TAP outbox grew unboundedly (4222 events stuck).
fix: send ACK immediately upon receipt, push message data to a
bounded queue, process in a separate worker thread. readLoop
never blocks on turso. if turso is slow, queue fills and oldest
messages are dropped (already ACK'd, indexing is idempotent via
ON CONFLICT DO UPDATE).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tracks ack_count and no_id_count to determine whether extractMessageId
returns null (no ACK sent) or ACKs are sent but not received by TAP.
Logs first 3 ACK payloads and first 5 no-id messages.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Backend support is in place but no WhiteWind documents have been
indexed yet. Remove the filter button and footer link until we
have actual results to show.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- add com.whtwnd.blog.entry to tap collection filters and document routing
- add content-as-string fallback in extractor (whitewind stores markdown in content field)
- add visibility filter to skip non-public whitewind entries
- add whitewind platform to frontend (filter button, URL pattern, config)
- add stats link to header
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Related items (for the top result) were staying in the DOM when
"load more" was clicked, causing new results to appear below them.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
base_path queries (publication name matches) were bypassing the since
filter, leaking old results. added since-aware turso query variants
and post-fetch date filtering in searchLocal.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- deduplicate search results by (did, title) to collapse cross-platform
duplicates (same content published to multiple ATProto apps)
- add date filter buttons (any/week/month/year) wired to since param
- load more button shows remaining count from v2 total
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- content previews for semantic/similar results via local SQLite lookup
- RRF score field in hybrid search results
- opt-in v2 response wrapper (?format=v2) with total/hasMore metadata
- pagination via limit/offset params with "load more" in frontend
- all consumers (frontend, MCP) handle both v1 and v2 formats
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
the 0.6 cosine distance cutoff was filtering out all results for
oblique/indirect queries (e.g. "guy from south africa with lots of kids"
→ elon musk). tpuf already returns results sorted by distance, so
natural ordering handles relevance without an arbitrary cutoff.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
measured actual distance distributions across 8 test queries using
scripts/measure-distances. voyage-4-lite 1024d best matches range
0.32-0.51, and the 0.5 threshold completely killed queries like
"community builders" (best=0.506) and "atproto federation" (best=0.505).
0.6 captures all clearly relevant top results while cutting off noise
that starts around 0.61+.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This reverts commit 704eb7427ddae23ee2d8e659df332f790341ae0b.
voyage-4-lite 1024d produces tighter cosine distance ranges than
voyage-3-lite 512d, so the old 0.5 threshold was filtering out all
results for many queries (e.g. "community builders" returned 0 results).
Raise to 0.75 to let tpuf's natural ranking handle quality.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- add URI dedup in searchSemantic() (same doc appeared twice from tpuf)
- rewrite scripts/rebuild-vector-index for tpuf namespace reset workflow
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
voyage-3-lite (512 dims) produced poor semantic search quality — only 4
results for "consciousness" vs 39 on greengale.app. voyage-4-lite was
released Jan 2026 with significantly better retrieval accuracy.
- model: voyage-3-lite → voyage-4-lite
- dims: 512 → 1024
- explicit output_dimension parameter for Matryoshka support
- tpuf namespace deleted, embedded_at cleared for full re-embed
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
single-request hybrid mode merges keyword (FTS5) and semantic (voyage +
tpuf) results using Reciprocal Rank Fusion scoring. adds mode toggle to
frontend, source badges on results, per-mode latency tracking, and
embeddings count on dashboard.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- embedder: skip docs with content < 50 chars or test titles
- searchSemantic: over-fetch 40, filter dist > 0.5 + empty titles, cap at 20
- frontend: remove mode toggle (keep backend support for when quality is ready)
- scripts: add cleanup-vector-index to purge junk vectors from tpuf
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- tpuf.zig: embedQuery() calls Voyage API with input_type="query" for asymmetric search
- search.zig: SearchMode enum, searchSemantic() dispatches to tpuf, keyword path untouched
- server.zig: parse mode query param, pass to search
- site: mode toggle (keyword/semantic/hybrid), hybrid shows keyword instantly + appends semantic
- docs: document mode parameter on /search endpoint
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
getVectorById was using the deprecated include_vectors parameter which
the v2 API rejects, causing /similar to always return empty.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
turbopuffer has a 64-byte ID limit but AT-URIs are 60-96 bytes.
use SHA256 truncated to 128 bits (32 hex chars) as tpuf document ID.
store full URI as metadata attribute for result serialization.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
was doing full sync (~80 min for 12k docs) on every deploy, blocking
new content from appearing in search. the full sync existed to clean
up stale docs deleted from Turso.
now: incremental sync on startup (seconds), with tombstone queries to
handle deletions. fullSync only runs on first-ever boot (no last_sync).
tombstones table was already populated by deleteDocument/deletePublication
but never queried during sync.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The blanket bridgy-fed filter was added because we couldn't build links
(empty base_path). Now that the indexer resolves base_path from HTTP
site URLs in publication_uri, bridgy-fed documents can get working links
like any other standard.site content.
Removes isBridgyFed, resolvePdsIsBridgy, and PdsCache (no longer needed).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
base_path values like "tedium.co/" combined with paths like "/some-post"
produce double-slash URLs ("https://tedium.co//some-post") that 404.
The trailing slash comes from publication URLs like "https://tedium.co/"
where stripUrlScheme preserved it. Fix in three places:
- tap.zig: stripUrlScheme strips trailing slash
- indexer.zig: HTTP fallback strips trailing slash
- indexer.zig: normalize base_path after all resolution (catches values
already stored in publications table with trailing slashes)
Backfill: RTRIM(base_path, '/') on both documents and publications.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
f2c6d29 intended to keep local FTS serving during re-sync by only
calling setReady(false) on first-ever sync. But is_ready initializes
to false, so the has_data branch needed an explicit setReady(true).
Without it, every deploy caused ~60min of turso fallback (3-5s per
search) until fullSync completed.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
standard.site documents store the origin URL (e.g., "https://attoshi.com")
in publication_uri, but the indexer only resolves base_path from AT-URIs
via the publications table. When publication_uri is an HTTP URL, the lookup
fails silently and base_path stays empty, breaking frontend links.
Add a fallback: if base_path is still empty and publication_uri starts with
http(s)://, strip the scheme and use the remainder as base_path.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
background worker verifies documents still exist at their source PDS
via com.atproto.repo.getRecord. catches deletions missed while the tap
was down (firehose delete events are ephemeral and never replayed).
also fixes the forward path: firehose deletes now clean turbopuffer
vectors in addition to turso records.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
accessible explainer of how the search engine works — covers keyword
(FTS5), semantic (voyage + turbopuffer), hybrid (RRF), content
extraction challenges, and what's custom vs off-the-shelf.
also fixes ~25k → actual count and v2 format (offset → hasMore) in
root README.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- fix FTS5 description: local replica, not Turso directly
- fix doc count (~11k not ~25k), query count (14 not 10)
- fix /similar: uses turbopuffer, not Turso
- remove phantom /platforms endpoint from api.md
- fix v2 format (hasMore not offset), timing keys (search_keyword etc)
- add coverImage/handle fields, clarify hybrid-only score/source
- fix tap examples: wget not curl (container has no curl)
- add missing tap env vars (TAP_RELAY_URL, cursor/timeout settings)
- fix voyage model in perf saga (voyage-3-lite 512 dims)
- add embedder thread to architecture diagram
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- delete 31 stale docs from turso (16 dead offprint links, 18 .test domains, 3 overlap)
- add .test domain filter in tap.zig (processPublication) and indexer.zig (insertDocument)
- revert offprint URL slug construction — old docs purged, only /a/ format remains
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous deploy caused a full re-sync which set local DB to
not-ready, forcing all searches through Turso (10-60s response times).
- Only set not-ready on first-ever sync (empty DB)
- Skip DELETE when re-syncing — INSERT OR REPLACE updates in place
- Add ALTER TABLE migration for cover_image on existing local DBs
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Semantic and similar searches used tpuf.QueryResult which lacks
cover_image. Now fetchLocalExtras() fetches both snippets and
cover images from local SQLite in a single query per URI. Hybrid
search also falls back to local DB for semantic-only cover images.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add cover_image column to local SQLite schema and sync queries
- Replace 3-button theme toggle with single cycling icon (dark/light/system)
- Make platform badges clickable as search filters
- Fix cover image hover jump with position compensation
- Move cover thumbnail to right-side absolute positioning
- Add backfill-cover-images script for populating existing docs
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- stats/dashboard links use /dashboard.html (same origin) instead of
going through backend redirect to dead leaflet-search.pages.dev URL
- theme toggle moved from cramped header to bottom of page where it
doesn't clutter the title line
- dashboard "back" and title links use relative paths (same origin)
- backend DASHBOARD_URL default updated to pub-search.waow.tech
- localStorage theme now shared between search and stats pages
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- 3-way theme toggle (dark/light/system) with localStorage persistence
and flash prevention on both search and dashboard pages
- extract cover image blob CID from document records (coverImage field
for pckt/offprint/greengale, first image block fallback for leaflet)
- add cover_image column to documents table, pass through indexer/search
- render 32x32 thumbnails in search results via bsky CDN, graceful
fallback when image unavailable
- convert all hardcoded colors to CSS custom properties
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- expand Stats type with embeddings, searches, errors, started_at,
cache_hits, cache_misses, and per-endpoint timing metrics
- update backfill-embeddings to use voyage-4-lite with output_dimension
- update rebuild-documents-table to use F32_BLOB(1024)
- add regression tests for full Stats model
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
generate per-query 1200x630 PNG images via workers-og with dark terminal
aesthetic, filter chips with type-matched colors, top 3 result titles,
and result count. rewrite meta tag injection to use HTMLRewriter with
support for all 5 URL params (q, tag, platform, since, mode).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PLC directory lookup in tap worker to detect brid.gy-hosted DIDs and
skip indexing their documents/publications. Results cached per worker
lifetime, fails open on errors.
Purge script queries turso for platform='other' DIDs, resolves PDS,
and batch-deletes bridgy fed content (documents, tags, FTS entries).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
add content_hash (wyhash of title+content) to documents table. on
ingest, skip documents where the same author already has identical
content under a different rkey (cross-platform publishing dedup).
frontend: add date filter (any/week/month/year) with since param,
URL state sync, and active filter bar.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- backfill-pds now handles com.whtwnd.blog.entry collection
- extracts markdown content from whitewind's content field
- sets platform to "whitewind", skips visibility:"author" entries
- prefers publishedAt over createdAt for date extraction
- update tangled.sh URLs to tangled.org in build.zig.zon
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
WhiteWind blog entries use three visibility values: "public", "url",
and "author". "url" means publicly accessible via link. Our filter
was dropping everything except "public", which meant every WhiteWind
entry with visibility "url" was silently discarded. Now only "author"
entries are skipped.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
getStatsLocal() required stats_buffer cache to be initialized, but
refreshCachedStats was only called at startup (which could fail) and
when search deltas were non-zero. If init failed and no searches
happened, cache stayed uninitialized, /stats returned all zeros.
- always refresh cache in sync loop (not just when deltas exist)
- getStatsLocal() no longer fails when cache isn't initialized — returns
local counts with 0 for cached fields instead of aborting entirely
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
fetchLocal() queried COUNT(*) WHERE embedded_at IS NOT NULL, but the
local schema never had that column. Every dashboard request failed,
falling through to turso batch (which also returned zeros).
- add embedded_at column migration to LocalDb schema
- sync embedded_at from turso in full and incremental sync
- add logfire warnings when fetchLocal fails or turso batch returns no rows
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
root cause: processMessage (which writes to turso via HTTP) ran
synchronously in the websocket readLoop callback. when turso was
slow or hung, the readLoop blocked — no messages read, no ACKs
sent, TAP outbox grew unboundedly (4222 events stuck).
fix: send ACK immediately upon receipt, push message data to a
bounded queue, process in a separate worker thread. readLoop
never blocks on turso. if turso is slow, queue fills and oldest
messages are dropped (already ACK'd, indexing is idempotent via
ON CONFLICT DO UPDATE).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- add com.whtwnd.blog.entry to tap collection filters and document routing
- add content-as-string fallback in extractor (whitewind stores markdown in content field)
- add visibility filter to skip non-public whitewind entries
- add whitewind platform to frontend (filter button, URL pattern, config)
- add stats link to header
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- deduplicate search results by (did, title) to collapse cross-platform
duplicates (same content published to multiple ATProto apps)
- add date filter buttons (any/week/month/year) wired to since param
- load more button shows remaining count from v2 total
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- content previews for semantic/similar results via local SQLite lookup
- RRF score field in hybrid search results
- opt-in v2 response wrapper (?format=v2) with total/hasMore metadata
- pagination via limit/offset params with "load more" in frontend
- all consumers (frontend, MCP) handle both v1 and v2 formats
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
the 0.6 cosine distance cutoff was filtering out all results for
oblique/indirect queries (e.g. "guy from south africa with lots of kids"
→ elon musk). tpuf already returns results sorted by distance, so
natural ordering handles relevance without an arbitrary cutoff.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
measured actual distance distributions across 8 test queries using
scripts/measure-distances. voyage-4-lite 1024d best matches range
0.32-0.51, and the 0.5 threshold completely killed queries like
"community builders" (best=0.506) and "atproto federation" (best=0.505).
0.6 captures all clearly relevant top results while cutting off noise
that starts around 0.61+.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
voyage-4-lite 1024d produces tighter cosine distance ranges than
voyage-3-lite 512d, so the old 0.5 threshold was filtering out all
results for many queries (e.g. "community builders" returned 0 results).
Raise to 0.75 to let tpuf's natural ranking handle quality.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
voyage-3-lite (512 dims) produced poor semantic search quality — only 4
results for "consciousness" vs 39 on greengale.app. voyage-4-lite was
released Jan 2026 with significantly better retrieval accuracy.
- model: voyage-3-lite → voyage-4-lite
- dims: 512 → 1024
- explicit output_dimension parameter for Matryoshka support
- tpuf namespace deleted, embedded_at cleared for full re-embed
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- embedder: skip docs with content < 50 chars or test titles
- searchSemantic: over-fetch 40, filter dist > 0.5 + empty titles, cap at 20
- frontend: remove mode toggle (keep backend support for when quality is ready)
- scripts: add cleanup-vector-index to purge junk vectors from tpuf
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- tpuf.zig: embedQuery() calls Voyage API with input_type="query" for asymmetric search
- search.zig: SearchMode enum, searchSemantic() dispatches to tpuf, keyword path untouched
- server.zig: parse mode query param, pass to search
- site: mode toggle (keyword/semantic/hybrid), hybrid shows keyword instantly + appends semantic
- docs: document mode parameter on /search endpoint
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>