implement multimodal early fusion: combine filename text with image embeddings
research findings:
- voyage-multimodal-3 uses unified transformer encoder for text + images
- 41.44% improvement on retrieval with combined modalities
- no "pollution" - this is the intended design for the model
changes:
1. ingestion script: prepend filename text to content array
- convert "bufo-jumping-on-bed.png" -> "bufo jumping on bed"
- send as {"type": "text"} + {"type": "image_base64"} in same request
- model creates single unified embedding capturing both modalities
2. search logic: simplify by removing BM25 and RRF fusion
- early fusion embeddings already contain semantic text meaning
- rely entirely on vector search with unified embeddings
- removed bm25_query method from turbopuffer client
- eliminated complex RRF score calculation
benefits:
- simpler codebase (removed ~80 lines of RRF fusion logic)
- better semantic understanding (text + visual unified)
- fewer api calls (no separate BM25 search)
- research-validated approach
next step: re-run ingestion to regenerate all embeddings
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>