content extraction for site.standard.document#
lessons learned from implementing cross-platform content extraction.
the problem#
eli mallon raised this question:
The
site.standard.document"content" field kinda confuses me. I see my leaflet posts have a $type field of "pub.leaflet.content". So if I were writing a renderer for site.standard.document records, presumably I'd have to know about separate things for leaflet, pckt, and offprint.
short answer: yes. but once you handle content.pages extraction, it's straightforward.
textContent: platform-dependent#
site.standard.document has a textContent field for pre-flattened plaintext:
{
"title": "my post",
"textContent": "the full text content, ready for indexing...",
"content": {
"$type": "blog.pckt.content",
"items": [ /* platform-specific blocks */ ]
}
}
pckt, offprint, greengale populate textContent. extraction is trivial.
leaflet intentionally leaves textContent null to avoid inflating record size. content lives in content.pages[].blocks[].block.plaintext.
extraction strategy#
priority order (in extractor.zig):
textContent- use if presentpages- top-level blocks (pub.leaflet.document)content.pages- nested blocks (site.standard.document with pub.leaflet.content)
// try textContent first
if (zat.json.getString(record, "textContent")) |text| {
return text;
}
// fall back to block parsing
const pages = zat.json.getArray(record, "pages") orelse
zat.json.getArray(record, "content.pages");
the key insight: if you extract from content.pages correctly, you're good. no need for extra network calls.
deduplication#
documents can appear in both collections with identical (did, rkey):
site.standard.documentpub.leaflet.document
handle with ON CONFLICT:
INSERT INTO documents (uri, ...)
ON CONFLICT(uri) DO UPDATE SET ...
note: leaflet is phasing out pub.leaflet.document records, keeping old ones for backwards compat.
platform detection#
collection name doesn't indicate platform for site.standard.* records. detection order:
- basePath - infer from publication basePath:
| basePath contains | platform |
|---|---|
leaflet.pub |
leaflet |
pckt.blog |
pckt |
offprint.app |
offprint |
greengale.app |
greengale |
- content.$type - fallback for custom domains (e.g.,
cailean.journal.ewancroft.uk):
| content.$type starts with | platform |
|---|---|
pub.leaflet. |
leaflet |
- if neither matches →
other
whitewind#
WhiteWind (com.whtwnd.blog.entry) stores content as markdown in the content field (a string, not a blocks structure). extraction is trivial — just use the string directly. author-only posts (visibility: "author") are skipped.
deduplication#
two layers prevent duplicate results:
- ingestion-time: content hash (wyhash of
title + \x00 + content) per author. if the same author publishes identical content across platforms (different rkeys), only the first is indexed. - search-time:
(did, title)dedup collapses any remaining duplicates in results (e.g. records indexed before content hash was added).
summary#
- pckt/offprint/greengale: use
textContentdirectly - leaflet: extract from
content.pages[].blocks[].block.plaintext - whitewind: use
contentstring directly (markdown) - deduplication: content hash at ingestion +
(did, title)at search time - platform: infer from basePath, fallback to content.$type for custom domains
code references#
backend/src/ingest/extractor.zig- content extraction logic, content_type fieldbackend/src/ingest/indexer.zig- platform detection from basePath + content_type, content hash dedup