semantic percolation via named entity recognition#

informed by hailey's post on bluesky trending topics.

the insight#

bluesky's trending topics system doesn't use embeddings on raw text (too noisy - similar topics drift apart, unrelated posts cluster). instead it uses named entity recognition (NER) to extract structured entities (PERSON, ORG, PRODUCT, EVENT, etc.) which dramatically reduces surface area.

from hailey's post:

Immediately, we have reduced the "surface area" that we're working with significantly, which in turn makes it significantly easier to decide what a given thing is about.

this is the key. rather than treating posts as points in high-dimensional embedding space, extract the things they're about and work with those.

applying this to coral#

current state: events hash to random grid positions. pure entropy visualization.

potential state: entities extracted from posts map to grid positions. posts about the same thing occupy the same region. when enough posts mention an entity, that region becomes occupied. when related entities cluster, they can percolate.

what "percolation" could mean here#

in classical percolation, sites are randomly occupied with probability p. at p_c, a spanning cluster emerges.

in semantic percolation:

sites = entity-position mappings (or entity pairs, or entity-context combinations)
occupation = enough posts mentioning that entity within a time window
adjacency = entities that co-occur in posts, or are semantically related
percolation = a connected path of active entities spanning some threshold

the question: when a topic "trends", is it percolating? is there a phase transition structure in how ideas spread?

what we need#

1. entity extraction#

tool: spaCy with en_core_web_trf model

from hailey's post:

handles full firehose throughput on a single consumer GPU
works well with and without casing (important for social media)
outperforms BERT-based NER models

entity types to keep: PERSON, ORG, GPE (places), PRODUCT, EVENT, WORK_OF_ART, FAC (buildings/landmarks)

entity types to ignore: DATE, TIME, MONEY, QUANTITY, ORDINAL, CARDINAL

2. entity → position mapping#

this is the hard design question. options:

a) hash-based (simple)

hash entity text to grid coordinates
same entity always maps to same position
pros: deterministic, simple
cons: no spatial semantics, unrelated entities might be adjacent

b) embedding-based (complex)

embed entities, project to 2D via UMAP/t-SNE
similar entities cluster spatially
pros: semantically meaningful layout
cons: need stable projection, new entities shift everything

c) category-based (structured)

divide grid into regions by entity type (PERSON zone, ORG zone, etc.)
within region, hash or embed
pros: visual structure, interpretable
cons: arbitrary boundaries

d) wikidata-based (rich)

hailey mentions spaCy can integrate with wikidata
entities get canonical IDs, relationships, categories
could use wikidata's knowledge graph for adjacency
pros: ground truth, rich structure
cons: complexity, coverage gaps

for v1, probably option (a) with the grid divided into type-based regions (c). simple, deterministic, interpretable.

3. decay and windowing#

current system: N events in rolling window, cells decay by generation.

semantic version: entity must receive K mentions within time window W to be "occupied". occupation decays as mentions age out.

this naturally creates the fluctuation around threshold we want - entities trend up, fade down, occasionally cross into percolation.

4. adjacency definition#

classical percolation: 4-neighbor or 8-neighbor on grid.

semantic options:

grid neighbors (same as now, but positions are entity-derived)
co-occurrence (entities mentioned in same post are adjacent)
reply chains (entities in reply threads are connected)

co-occurrence is interesting because it creates semantic edges, not spatial ones. but then we're not doing site percolation on a lattice - we're doing something more like bond percolation on a graph.

hybrid approach: grid for visualization, but spanning detection uses co-occurrence graph. "percolation" = connected component in co-occurrence graph reaches threshold size.

5. infrastructure changes#

good news: turbostream already gives us post content. looking at music-atmosphere-feed:

const post_record = commit.object.get("record") orelse return error.NotAPost;
const hydrated = record_obj.get("hydrated_metadata");

turbostream provides:

commit.record - full post record including text field
hydrated_metadata - quoted/parent post content

coral's turbostream.zig already receives this - we just discard it and only use did + rkey for hashing. the text is right there.

backend additions needed:

extract record.text from turbostream messages (trivial - data already available)
spaCy integration for NER (python sidecar or API)
entity → position mapping
co-occurrence tracking (optional)

spaCy integration options:

python sidecar service - zig sends text over unix socket or HTTP, python returns entities
rewrite backend in python (simpler integration, but slower websocket handling)
external NER API (cost, latency, but zero infra)
pure zig via ONNX runtime - export model to ONNX, use onnxruntime.zig + C++ tokenizer

benchmarked (see bench/):

measured firehose rate: ~60 posts/sec
spaCy en_core_web_sm on CPU: ~1200 docs/sec
spaCy en_core_web_md on CPU: ~1350 docs/sec
headroom: 20x - python sidecar is viable

decision: proceeding with option 1 (python sidecar). 20x headroom means IPC overhead won't matter. option 4 (pure zig/ONNX) is possible but requires more integration work for marginal benefit - revisit if python becomes a bottleneck.

open questions#

entity stability: how do we handle entity variations? "Bluesky", "bluesky", "bsky" - same entity? need normalization or clustering.
what's the "phase transition"?: in random percolation, p_c is universal. in semantic percolation, is there a natural threshold where topic attention suddenly connects into a spanning cluster?
visualization: do we show raw entity activity, or interpret it? (e.g., label regions, show entity names on hover)
multiple languages: spaCy has models for other languages. do we need multi-language support? (probably yes for ATProto)

implementation status#

phase 1: python NER sidecar ✓#

ner/server.py - HTTP service with POST /extract endpoint
returns {entities: [{text, label, start, end}]}
uses en_core_web_sm for speed (~1200 docs/sec)
filters to useful entity types (PERSON, ORG, GPE, PRODUCT, EVENT, WORK_OF_ART, FAC, NORP, LOC)

phase 2: zig integration ✓#

backend/src/http.zig - POST /entity endpoint
accepts {entities: [{text, label}]} and feeds to lattice
hash entities to grid positions (wyhash on lowercase text)

phase 3: python bridge ✓#

ner/bridge.py - turbostream -> spaCy -> Zig
connects to turbostream, extracts post text from commit.record.text
batches entities (20 per batch, 1 sec timeout)
POSTs to Zig's /entity endpoint

stability learnings:

httpx keep-alive doesn't play well with Zig's std.http.Server
solution: max_keepalive_connections=0 forces fresh connections
latency ~41ms/message (acceptable, 20x headroom over firehose rate)
0 dropped batches with this config

remaining work#

~~add flag to disable Zig's turbostream consumer in NER mode~~ → removed entirely, python bridge is sole consumer
type-aware hashing (include entity label in hash) → deferred, not needed
frontend entity visualization (grid.js shows trending entities with tooltips)
co-occurrence tracking (entity_graph.zig EdgeSet with timestamps)

references#

hailey's trending topics post
spaCy - NER library
en_core_web_trf - transformer-based english model
wikidata - structured knowledge base