search for standard sites pub-search.waow.tech
search zig blog atproto

content extraction for site.standard.document#

lessons learned from implementing cross-platform content extraction.

the problem#

eli mallon raised this question:

The site.standard.document "content" field kinda confuses me. I see my leaflet posts have a $type field of "pub.leaflet.content". So if I were writing a renderer for site.standard.document records, presumably I'd have to know about separate things for leaflet, pckt, and offprint.

short answer: yes. but once you handle content.pages extraction, it's straightforward.

textContent: platform-dependent#

site.standard.document has a textContent field for pre-flattened plaintext:

{
  "title": "my post",
  "textContent": "the full text content, ready for indexing...",
  "content": {
    "$type": "blog.pckt.content",
    "items": [ /* platform-specific blocks */ ]
  }
}

pckt, offprint, greengale populate textContent. extraction is trivial.

leaflet intentionally leaves textContent null to avoid inflating record size. content lives in content.pages[].blocks[].block.plaintext.

extraction strategy#

priority order (in extractor.zig):

  1. textContent - use if present
  2. pages - top-level blocks (pub.leaflet.document)
  3. content.pages - nested blocks (site.standard.document with pub.leaflet.content)
// try textContent first
if (zat.json.getString(record, "textContent")) |text| {
    return text;
}

// fall back to block parsing
const pages = zat.json.getArray(record, "pages") orelse
    zat.json.getArray(record, "content.pages");

the key insight: if you extract from content.pages correctly, you're good. no need for extra network calls.

deduplication#

documents can appear in both collections with identical (did, rkey):

  • site.standard.document
  • pub.leaflet.document

handle with ON CONFLICT:

INSERT INTO documents (uri, ...)
ON CONFLICT(uri) DO UPDATE SET ...

note: leaflet is phasing out pub.leaflet.document records, keeping old ones for backwards compat.

platform detection#

collection name doesn't indicate platform for site.standard.* records. detection order:

  1. basePath - infer from publication basePath:
basePath contains platform
leaflet.pub leaflet
pckt.blog pckt
offprint.app offprint
greengale.app greengale
  1. content.$type - fallback for custom domains (e.g., cailean.journal.ewancroft.uk):
content.$type starts with platform
pub.leaflet. leaflet
  1. if neither matches → other

whitewind#

WhiteWind (com.whtwnd.blog.entry) stores content as markdown in the content field (a string, not a blocks structure). extraction is trivial — just use the string directly. author-only posts (visibility: "author") are skipped.

deduplication#

two layers prevent duplicate results:

  1. ingestion-time: content hash (wyhash of title + \x00 + content) per author. if the same author publishes identical content across platforms (different rkeys), only the first is indexed.
  2. search-time: (did, title) dedup collapses any remaining duplicates in results (e.g. records indexed before content hash was added).

summary#

  • pckt/offprint/greengale: use textContent directly
  • leaflet: extract from content.pages[].blocks[].block.plaintext
  • whitewind: use content string directly (markdown)
  • deduplication: content hash at ingestion + (did, title) at search time
  • platform: infer from basePath, fallback to content.$type for custom domains

code references#

  • backend/src/ingest/extractor.zig - content extraction logic, content_type field
  • backend/src/ingest/indexer.zig - platform detection from basePath + content_type, content hash dedup