search for standard sites
pub-search.waow.tech
search
zig
blog
atproto
1# content extraction for site.standard.document
2
3lessons learned from implementing cross-platform content extraction.
4
5## the problem
6
7[eli mallon raised this question](https://bsky.app/profile/iame.li/post/3md4s4vm2os2y):
8
9> The `site.standard.document` "content" field kinda confuses me. I see my leaflet posts have a $type field of "pub.leaflet.content". So if I were writing a renderer for site.standard.document records, presumably I'd have to know about separate things for leaflet, pckt, and offprint.
10
11short answer: yes. but once you handle `content.pages` extraction, it's straightforward.
12
13## textContent: platform-dependent
14
15`site.standard.document` has a `textContent` field for pre-flattened plaintext:
16
17```json
18{
19 "title": "my post",
20 "textContent": "the full text content, ready for indexing...",
21 "content": {
22 "$type": "blog.pckt.content",
23 "items": [ /* platform-specific blocks */ ]
24 }
25}
26```
27
28**pckt, offprint, greengale** populate `textContent`. extraction is trivial.
29
30**leaflet** intentionally leaves `textContent` null to avoid inflating record size. content lives in `content.pages[].blocks[].block.plaintext`.
31
32## extraction strategy
33
34priority order (in `extractor.zig`):
35
361. `textContent` - use if present
372. `pages` - top-level blocks (pub.leaflet.document)
383. `content.pages` - nested blocks (site.standard.document with pub.leaflet.content)
39
40```zig
41// try textContent first
42if (zat.json.getString(record, "textContent")) |text| {
43 return text;
44}
45
46// fall back to block parsing
47const pages = zat.json.getArray(record, "pages") orelse
48 zat.json.getArray(record, "content.pages");
49```
50
51the key insight: if you extract from `content.pages` correctly, you're good. no need for extra network calls.
52
53## deduplication
54
55documents can appear in both collections with identical `(did, rkey)`:
56- `site.standard.document`
57- `pub.leaflet.document`
58
59handle with `ON CONFLICT`:
60
61```sql
62INSERT INTO documents (uri, ...)
63ON CONFLICT(uri) DO UPDATE SET ...
64```
65
66note: leaflet is phasing out `pub.leaflet.document` records, keeping old ones for backwards compat.
67
68## platform detection
69
70collection name doesn't indicate platform for `site.standard.*` records. detection order:
71
721. **basePath** - infer from publication basePath:
73
74| basePath contains | platform |
75|-------------------|----------|
76| `leaflet.pub` | leaflet |
77| `pckt.blog` | pckt |
78| `offprint.app` | offprint |
79| `greengale.app` | greengale |
80
812. **content.$type** - fallback for custom domains (e.g., `cailean.journal.ewancroft.uk`):
82
83| content.$type starts with | platform |
84|---------------------------|----------|
85| `pub.leaflet.` | leaflet |
86
873. if neither matches → `other`
88
89## whitewind
90
91[WhiteWind](https://whtwnd.com) (`com.whtwnd.blog.entry`) stores content as markdown in the `content` field (a string, not a blocks structure). extraction is trivial — just use the string directly. author-only posts (`visibility: "author"`) are skipped.
92
93## deduplication
94
95two layers prevent duplicate results:
96
971. **ingestion-time**: content hash (wyhash of `title + \x00 + content`) per author. if the same author publishes identical content across platforms (different rkeys), only the first is indexed.
982. **search-time**: `(did, title)` dedup collapses any remaining duplicates in results (e.g. records indexed before content hash was added).
99
100## summary
101
102- **pckt/offprint/greengale**: use `textContent` directly
103- **leaflet**: extract from `content.pages[].blocks[].block.plaintext`
104- **whitewind**: use `content` string directly (markdown)
105- **deduplication**: content hash at ingestion + `(did, title)` at search time
106- **platform**: infer from basePath, fallback to content.$type for custom domains
107
108## code references
109
110- `backend/src/ingest/extractor.zig` - content extraction logic, content_type field
111- `backend/src/ingest/indexer.zig` - platform detection from basePath + content_type, content hash dedup