this repo has no description coral.waow.tech

semantic percolation via named entity recognition#

informed by hailey's post on bluesky trending topics.

the insight#

bluesky's trending topics system doesn't use embeddings on raw text (too noisy - similar topics drift apart, unrelated posts cluster). instead it uses named entity recognition (NER) to extract structured entities (PERSON, ORG, PRODUCT, EVENT, etc.) which dramatically reduces surface area.

from hailey's post:

Immediately, we have reduced the "surface area" that we're working with significantly, which in turn makes it significantly easier to decide what a given thing is about.

this is the key. rather than treating posts as points in high-dimensional embedding space, extract the things they're about and work with those.

applying this to coral#

current state: events hash to random grid positions. pure entropy visualization.

potential state: entities extracted from posts map to grid positions. posts about the same thing occupy the same region. when enough posts mention an entity, that region becomes occupied. when related entities cluster, they can percolate.

what "percolation" could mean here#

in classical percolation, sites are randomly occupied with probability p. at p_c, a spanning cluster emerges.

in semantic percolation:

  • sites = entity-position mappings (or entity pairs, or entity-context combinations)
  • occupation = enough posts mentioning that entity within a time window
  • adjacency = entities that co-occur in posts, or are semantically related
  • percolation = a connected path of active entities spanning some threshold

the question: when a topic "trends", is it percolating? is there a phase transition structure in how ideas spread?

what we need#

1. entity extraction#

tool: spaCy with en_core_web_trf model

from hailey's post:

  • handles full firehose throughput on a single consumer GPU
  • works well with and without casing (important for social media)
  • outperforms BERT-based NER models

entity types to keep: PERSON, ORG, GPE (places), PRODUCT, EVENT, WORK_OF_ART, FAC (buildings/landmarks)

entity types to ignore: DATE, TIME, MONEY, QUANTITY, ORDINAL, CARDINAL

2. entity → position mapping#

this is the hard design question. options:

a) hash-based (simple)

  • hash entity text to grid coordinates
  • same entity always maps to same position
  • pros: deterministic, simple
  • cons: no spatial semantics, unrelated entities might be adjacent

b) embedding-based (complex)

  • embed entities, project to 2D via UMAP/t-SNE
  • similar entities cluster spatially
  • pros: semantically meaningful layout
  • cons: need stable projection, new entities shift everything

c) category-based (structured)

  • divide grid into regions by entity type (PERSON zone, ORG zone, etc.)
  • within region, hash or embed
  • pros: visual structure, interpretable
  • cons: arbitrary boundaries

d) wikidata-based (rich)

  • hailey mentions spaCy can integrate with wikidata
  • entities get canonical IDs, relationships, categories
  • could use wikidata's knowledge graph for adjacency
  • pros: ground truth, rich structure
  • cons: complexity, coverage gaps

for v1, probably option (a) with the grid divided into type-based regions (c). simple, deterministic, interpretable.

3. decay and windowing#

current system: N events in rolling window, cells decay by generation.

semantic version: entity must receive K mentions within time window W to be "occupied". occupation decays as mentions age out.

this naturally creates the fluctuation around threshold we want - entities trend up, fade down, occasionally cross into percolation.

4. adjacency definition#

classical percolation: 4-neighbor or 8-neighbor on grid.

semantic options:

  • grid neighbors (same as now, but positions are entity-derived)
  • co-occurrence (entities mentioned in same post are adjacent)
  • reply chains (entities in reply threads are connected)

co-occurrence is interesting because it creates semantic edges, not spatial ones. but then we're not doing site percolation on a lattice - we're doing something more like bond percolation on a graph.

hybrid approach: grid for visualization, but spanning detection uses co-occurrence graph. "percolation" = connected component in co-occurrence graph reaches threshold size.

5. infrastructure changes#

good news: turbostream already gives us post content. looking at music-atmosphere-feed:

const post_record = commit.object.get("record") orelse return error.NotAPost;
const hydrated = record_obj.get("hydrated_metadata");

turbostream provides:

  • commit.record - full post record including text field
  • hydrated_metadata - quoted/parent post content

coral's turbostream.zig already receives this - we just discard it and only use did + rkey for hashing. the text is right there.

backend additions needed:

  • extract record.text from turbostream messages (trivial - data already available)
  • spaCy integration for NER (python sidecar or API)
  • entity → position mapping
  • co-occurrence tracking (optional)

spaCy integration options:

  1. python sidecar service - zig sends text over unix socket or HTTP, python returns entities
  2. rewrite backend in python (simpler integration, but slower websocket handling)
  3. external NER API (cost, latency, but zero infra)
  4. pure zig via ONNX runtime - export model to ONNX, use onnxruntime.zig + C++ tokenizer

benchmarked (see bench/):

  • measured firehose rate: ~60 posts/sec
  • spaCy en_core_web_sm on CPU: ~1200 docs/sec
  • spaCy en_core_web_md on CPU: ~1350 docs/sec
  • headroom: 20x - python sidecar is viable

decision: proceeding with option 1 (python sidecar). 20x headroom means IPC overhead won't matter. option 4 (pure zig/ONNX) is possible but requires more integration work for marginal benefit - revisit if python becomes a bottleneck.

open questions#

  1. entity stability: how do we handle entity variations? "Bluesky", "bluesky", "bsky" - same entity? need normalization or clustering.

  2. what's the "phase transition"?: in random percolation, p_c is universal. in semantic percolation, is there a natural threshold where topic attention suddenly connects into a spanning cluster?

  3. visualization: do we show raw entity activity, or interpret it? (e.g., label regions, show entity names on hover)

  4. multiple languages: spaCy has models for other languages. do we need multi-language support? (probably yes for ATProto)

implementation status#

phase 1: python NER sidecar ✓#

  • ner/server.py - HTTP service with POST /extract endpoint
  • returns {entities: [{text, label, start, end}]}
  • uses en_core_web_sm for speed (~1200 docs/sec)
  • filters to useful entity types (PERSON, ORG, GPE, PRODUCT, EVENT, WORK_OF_ART, FAC, NORP, LOC)

phase 2: zig integration ✓#

  • backend/src/http.zig - POST /entity endpoint
  • accepts {entities: [{text, label}]} and feeds to lattice
  • hash entities to grid positions (wyhash on lowercase text)

phase 3: python bridge ✓#

  • ner/bridge.py - turbostream -> spaCy -> Zig
  • connects to turbostream, extracts post text from commit.record.text
  • batches entities (20 per batch, 1 sec timeout)
  • POSTs to Zig's /entity endpoint

stability learnings:

  • httpx keep-alive doesn't play well with Zig's std.http.Server
  • solution: max_keepalive_connections=0 forces fresh connections
  • latency ~41ms/message (acceptable, 20x headroom over firehose rate)
  • 0 dropped batches with this config

remaining work#

  • add flag to disable Zig's turbostream consumer in NER mode → removed entirely, python bridge is sole consumer
  • type-aware hashing (include entity label in hash) → deferred, not needed
  • frontend entity visualization (grid.js shows trending entities with tooltips)
  • co-occurrence tracking (entity_graph.zig EdgeSet with timestamps)

references#