docs/content-extraction.md at main · zzstoatzz.io/leaflet-search

zzstoatzz.io / leaflet-search
fork atom
search for standard sites pub-search.waow.tech
search zig blog atproto
fork atom
leaflet-search / docs / content-extraction.md
at main 111 lines 3.9 kB view raw view rendered
wrap content
zzstoatzz.io docs: update README, API docs, and architecture for current state 4w ago
0fdc8c42
  1# content extraction for site.standard.document
  2
  3lessons learned from implementing cross-platform content extraction.
  4
  5## the problem
  6
  7[eli mallon raised this question](https://bsky.app/profile/iame.li/post/3md4s4vm2os2y):
  8
  9> The `site.standard.document` "content" field kinda confuses me. I see my leaflet posts have a $type field of "pub.leaflet.content". So if I were writing a renderer for site.standard.document records, presumably I'd have to know about separate things for leaflet, pckt, and offprint.
 10
 11short answer: yes. but once you handle `content.pages` extraction, it's straightforward.
 12
 13## textContent: platform-dependent
 14
 15`site.standard.document` has a `textContent` field for pre-flattened plaintext:
 16
 17```json
 18{
 19  "title": "my post",
 20  "textContent": "the full text content, ready for indexing...",
 21  "content": {
 22    "$type": "blog.pckt.content",
 23    "items": [ /* platform-specific blocks */ ]
 24  }
 25}
 26```
 27
 28**pckt, offprint, greengale** populate `textContent`. extraction is trivial.
 29
 30**leaflet** intentionally leaves `textContent` null to avoid inflating record size. content lives in `content.pages[].blocks[].block.plaintext`.
 31
 32## extraction strategy
 33
 34priority order (in `extractor.zig`):
 35
 361. `textContent` - use if present
 372. `pages` - top-level blocks (pub.leaflet.document)
 383. `content.pages` - nested blocks (site.standard.document with pub.leaflet.content)
 39
 40```zig
 41// try textContent first
 42if (zat.json.getString(record, "textContent")) |text| {
 43    return text;
 44}
 45
 46// fall back to block parsing
 47const pages = zat.json.getArray(record, "pages") orelse
 48    zat.json.getArray(record, "content.pages");
 49```
 50
 51the key insight: if you extract from `content.pages` correctly, you're good. no need for extra network calls.
 52
 53## deduplication
 54
 55documents can appear in both collections with identical `(did, rkey)`:
 56- `site.standard.document`
 57- `pub.leaflet.document`
 58
 59handle with `ON CONFLICT`:
 60
 61```sql
 62INSERT INTO documents (uri, ...)
 63ON CONFLICT(uri) DO UPDATE SET ...
 64```
 65
 66note: leaflet is phasing out `pub.leaflet.document` records, keeping old ones for backwards compat.
 67
 68## platform detection
 69
 70collection name doesn't indicate platform for `site.standard.*` records. detection order:
 71
 721. **basePath** - infer from publication basePath:
 73
 74| basePath contains | platform |
 75|-------------------|----------|
 76| `leaflet.pub` | leaflet |
 77| `pckt.blog` | pckt |
 78| `offprint.app` | offprint |
 79| `greengale.app` | greengale |
 80
 812. **content.$type** - fallback for custom domains (e.g., `cailean.journal.ewancroft.uk`):
 82
 83| content.$type starts with | platform |
 84|---------------------------|----------|
 85| `pub.leaflet.` | leaflet |
 86
 873. if neither matches → `other`
 88
 89## whitewind
 90
 91[WhiteWind](https://whtwnd.com) (`com.whtwnd.blog.entry`) stores content as markdown in the `content` field (a string, not a blocks structure). extraction is trivial — just use the string directly. author-only posts (`visibility: "author"`) are skipped.
 92
 93## deduplication
 94
 95two layers prevent duplicate results:
 96
 971. **ingestion-time**: content hash (wyhash of `title + \x00 + content`) per author. if the same author publishes identical content across platforms (different rkeys), only the first is indexed.
 982. **search-time**: `(did, title)` dedup collapses any remaining duplicates in results (e.g. records indexed before content hash was added).
 99
100## summary
101
102- **pckt/offprint/greengale**: use `textContent` directly
103- **leaflet**: extract from `content.pages[].blocks[].block.plaintext`
104- **whitewind**: use `content` string directly (markdown)
105- **deduplication**: content hash at ingestion + `(did, title)` at search time
106- **platform**: infer from basePath, fallback to content.$type for custom domains
107
108## code references
109
110- `backend/src/ingest/extractor.zig` - content extraction logic, content_type field
111- `backend/src/ingest/indexer.zig` - platform detection from basePath + content_type, content hash dedup