# content extraction for site.standard.document

lessons learned from implementing cross-platform content extraction.

## the problem

[eli mallon raised this question](https://bsky.app/profile/iame.li/post/3md4s4vm2os2y):

> The `site.standard.document` "content" field kinda confuses me. I see my leaflet posts have a $type field of "pub.leaflet.content". So if I were writing a renderer for site.standard.document records, presumably I'd have to know about separate things for leaflet, pckt, and offprint.

short answer: yes. but once you handle `content.pages` extraction, it's straightforward.

## textContent: platform-dependent

`site.standard.document` has a `textContent` field for pre-flattened plaintext:

```json
{
  "title": "my post",
  "textContent": "the full text content, ready for indexing...",
  "content": {
    "$type": "blog.pckt.content",
    "items": [ /* platform-specific blocks */ ]
  }
}
```

**pckt, offprint, greengale** populate `textContent`. extraction is trivial.

**leaflet** intentionally leaves `textContent` null to avoid inflating record size. content lives in `content.pages[].blocks[].block.plaintext`.

## extraction strategy

priority order (in `extractor.zig`):

1. `textContent` - use if present
2. `pages` - top-level blocks (pub.leaflet.document)
3. `content.pages` - nested blocks (site.standard.document with pub.leaflet.content)

```zig
// try textContent first
if (zat.json.getString(record, "textContent")) |text| {
    return text;
}

// fall back to block parsing
const pages = zat.json.getArray(record, "pages") orelse
    zat.json.getArray(record, "content.pages");
```

the key insight: if you extract from `content.pages` correctly, you're good. no need for extra network calls.

## deduplication

documents can appear in both collections with identical `(did, rkey)`:
- `site.standard.document`
- `pub.leaflet.document`

handle with `ON CONFLICT`:

```sql
INSERT INTO documents (uri, ...)
ON CONFLICT(uri) DO UPDATE SET ...
```

note: leaflet is phasing out `pub.leaflet.document` records, keeping old ones for backwards compat.

## platform detection

collection name doesn't indicate platform for `site.standard.*` records. detection order:

1. **basePath** - infer from publication basePath:

| basePath contains | platform |
|-------------------|----------|
| `leaflet.pub` | leaflet |
| `pckt.blog` | pckt |
| `offprint.app` | offprint |
| `greengale.app` | greengale |

2. **content.$type** - fallback for custom domains (e.g., `cailean.journal.ewancroft.uk`):

| content.$type starts with | platform |
|---------------------------|----------|
| `pub.leaflet.` | leaflet |

3. if neither matches → `other`

## whitewind

[WhiteWind](https://whtwnd.com) (`com.whtwnd.blog.entry`) stores content as markdown in the `content` field (a string, not a blocks structure). extraction is trivial — just use the string directly. author-only posts (`visibility: "author"`) are skipped.

## deduplication

two layers prevent duplicate results:

1. **ingestion-time**: content hash (wyhash of `title + \x00 + content`) per author. if the same author publishes identical content across platforms (different rkeys), only the first is indexed.
2. **search-time**: `(did, title)` dedup collapses any remaining duplicates in results (e.g. records indexed before content hash was added).

## summary

- **pckt/offprint/greengale**: use `textContent` directly
- **leaflet**: extract from `content.pages[].blocks[].block.plaintext`
- **whitewind**: use `content` string directly (markdown)
- **deduplication**: content hash at ingestion + `(did, title)` at search time
- **platform**: infer from basePath, fallback to content.$type for custom domains

## code references

- `backend/src/ingest/extractor.zig` - content extraction logic, content_type field
- `backend/src/ingest/indexer.zig` - platform detection from basePath + content_type, content hash dedup