search for standard sites pub-search.waow.tech
search zig blog atproto
at main 111 lines 3.9 kB view raw view rendered
1# content extraction for site.standard.document 2 3lessons learned from implementing cross-platform content extraction. 4 5## the problem 6 7[eli mallon raised this question](https://bsky.app/profile/iame.li/post/3md4s4vm2os2y): 8 9> The `site.standard.document` "content" field kinda confuses me. I see my leaflet posts have a $type field of "pub.leaflet.content". So if I were writing a renderer for site.standard.document records, presumably I'd have to know about separate things for leaflet, pckt, and offprint. 10 11short answer: yes. but once you handle `content.pages` extraction, it's straightforward. 12 13## textContent: platform-dependent 14 15`site.standard.document` has a `textContent` field for pre-flattened plaintext: 16 17```json 18{ 19 "title": "my post", 20 "textContent": "the full text content, ready for indexing...", 21 "content": { 22 "$type": "blog.pckt.content", 23 "items": [ /* platform-specific blocks */ ] 24 } 25} 26``` 27 28**pckt, offprint, greengale** populate `textContent`. extraction is trivial. 29 30**leaflet** intentionally leaves `textContent` null to avoid inflating record size. content lives in `content.pages[].blocks[].block.plaintext`. 31 32## extraction strategy 33 34priority order (in `extractor.zig`): 35 361. `textContent` - use if present 372. `pages` - top-level blocks (pub.leaflet.document) 383. `content.pages` - nested blocks (site.standard.document with pub.leaflet.content) 39 40```zig 41// try textContent first 42if (zat.json.getString(record, "textContent")) |text| { 43 return text; 44} 45 46// fall back to block parsing 47const pages = zat.json.getArray(record, "pages") orelse 48 zat.json.getArray(record, "content.pages"); 49``` 50 51the key insight: if you extract from `content.pages` correctly, you're good. no need for extra network calls. 52 53## deduplication 54 55documents can appear in both collections with identical `(did, rkey)`: 56- `site.standard.document` 57- `pub.leaflet.document` 58 59handle with `ON CONFLICT`: 60 61```sql 62INSERT INTO documents (uri, ...) 63ON CONFLICT(uri) DO UPDATE SET ... 64``` 65 66note: leaflet is phasing out `pub.leaflet.document` records, keeping old ones for backwards compat. 67 68## platform detection 69 70collection name doesn't indicate platform for `site.standard.*` records. detection order: 71 721. **basePath** - infer from publication basePath: 73 74| basePath contains | platform | 75|-------------------|----------| 76| `leaflet.pub` | leaflet | 77| `pckt.blog` | pckt | 78| `offprint.app` | offprint | 79| `greengale.app` | greengale | 80 812. **content.$type** - fallback for custom domains (e.g., `cailean.journal.ewancroft.uk`): 82 83| content.$type starts with | platform | 84|---------------------------|----------| 85| `pub.leaflet.` | leaflet | 86 873. if neither matches → `other` 88 89## whitewind 90 91[WhiteWind](https://whtwnd.com) (`com.whtwnd.blog.entry`) stores content as markdown in the `content` field (a string, not a blocks structure). extraction is trivial — just use the string directly. author-only posts (`visibility: "author"`) are skipped. 92 93## deduplication 94 95two layers prevent duplicate results: 96 971. **ingestion-time**: content hash (wyhash of `title + \x00 + content`) per author. if the same author publishes identical content across platforms (different rkeys), only the first is indexed. 982. **search-time**: `(did, title)` dedup collapses any remaining duplicates in results (e.g. records indexed before content hash was added). 99 100## summary 101 102- **pckt/offprint/greengale**: use `textContent` directly 103- **leaflet**: extract from `content.pages[].blocks[].block.plaintext` 104- **whitewind**: use `content` string directly (markdown) 105- **deduplication**: content hash at ingestion + `(did, title)` at search time 106- **platform**: infer from basePath, fallback to content.$type for custom domains 107 108## code references 109 110- `backend/src/ingest/extractor.zig` - content extraction logic, content_type field 111- `backend/src/ingest/indexer.zig` - platform detection from basePath + content_type, content hash dedup