semantic percolation refactor#

summary#

Replace the current random-hash lattice with a co-occurrence graph model where:

Nodes = named entities
Edges = co-occurrence (entities mentioned in same post)
Activity = mention rate (count / time window)
Percolation = giant connected component among active entities

This follows the heterogeneous percolation model from Xie et al. 2021 and the NER approach from Hailey's trending topics.

conceptual model#

clusters = topics: entities that get discussed together form clusters. "Trump", "WhiteHouse", "GOP" cluster together because posts about Trump mention them together. "Elon", "Tesla", "SpaceX" form a separate cluster.

percolation = discourse unification: when a real-world event spans topics (Trump and Elon at White House for Tesla), posts start mentioning entities from both clusters together. edges form between clusters. they merge. discourse percolates into a unified conversation.

this is the thing we're trying to visualize: the moment when separate topic-clusters merge into one.

arbitrary choices#

these are parameters and design decisions made arbitrarily to get the system running. each should be revisited as we learn from observing the system. think of this like setting up a Monte Carlo simulation - we make choices to define the rules, then watch what emerges.

1. edge definition: same-post co-occurrence#

choice: two entities are adjacent if they appeared in the same post.

alternatives considered:

reply/quote chains (entities in conversation threads)
temporal proximity (entities mentioned within N seconds)
semantic similarity (embedding distance)

why this choice: simplest to implement, captures "discussed together." a post mentioning Trump and Elon creates an edge because the author chose to discuss them together.

status: ARBITRARY. revisit if clusters don't match intuition about topics.

future consideration: temporal co-activity could also create edges. if two entities are both spiking at the same time (even without same-post co-occurrence), they may be part of the same event. e.g., during an earthquake, "earthquake" and "LA" might both spike without always appearing in the same post. the temporal burstiness itself is signal. this would be: "entities that trend together are connected." not yet implemented.

2. activity threshold: rate-based with hysteresis#

choice: entity is "active" based on smoothed rate with separate enter/exit thresholds

window: 5 minutes (30 buckets × 10s)
ACTIVITY_THRESHOLD_ENTER: 0.01/sec (~3 per 5 min) to become active
ACTIVITY_THRESHOLD_EXIT: 0.005/sec (~1.5 per 5 min) to stay active
smoothed_rate: EWMA with SMOOTHING_ALPHA (default 0.25)

hysteresis logic:

if is_active:
    is_active = smoothed_rate >= EXIT_THRESHOLD  # harder to leave
else:
    is_active = smoothed_rate >= ENTER_THRESHOLD  # harder to enter

why this choice: prevents entities from flickering in/out of the active list. once active, they stay active until activity drops significantly.

status: IMPLEMENTED. thresholds tuned empirically - initial 0.1/sec was too high, nothing was active.

3. percolation definition: NEEDS EMPIRICAL CALIBRATION#

current: show largest_cluster / active_entities as percentage, highlight when >50%

what we learned from the papers:

Newman-Ziff (lattice percolation):

on regular lattices, "percolation" = cluster wraps around periodic boundaries
p_c ≈ 0.593 for 2D square lattice
doesn't apply to us: our graph has no spatial structure. entities are hashed to positions arbitrarily - there's no "boundary to wrap around"

Xie et al (social network percolation):

models info spread as site percolation on directed follower graphs
order parameter P_∞ = cascade_size / total_nodes (giant out-component)
real networks percolate at ~1/10th the uniform-theory threshold
why? heterogeneous activity: m ~ k_o^α (active users get more followers)
closer to us but still different: they have a static network and vary retweet probability β analytically

the fundamental mismatch:

classical: fixed lattice, vary p, transition at p_c
xie: static follower graph, vary β, analyze GOUT
us: dynamically growing co-occurrence graph with no controllable parameter

our graph is neither:

NOT a regular lattice (no spatial structure)
NOT a static random graph (edges come from co-occurrence events)
a dynamically evolving undirected graph where edges = "discussed together"

practical approach - empirical calibration:

collect ratio time series over days/weeks
look for sharp jumps in the ratio
correlate jumps with real-world events that span topics
use that to determine what "discourse unification" looks like in practice
the 50% threshold is arbitrary - real threshold might be 30% or 70%

possible future definitions (need data to validate):

track cluster merging events: "percolation" = when two clusters of size > N merge
null model comparison: "percolation" = observed ratio >> random graph expectation
cluster size distribution shape: look for power-law vs exponential cutoff

status: ARBITRARY. showing percentage is honest. 50% threshold is placeholder. need empirical data to calibrate what "unified discourse" actually looks like.

4. entity position: hash-based#

choice: hash(lowercase(entity_text)) → (x, y) on 128x128 grid

alternatives considered:

embedding-based (UMAP/t-SNE projection)
category-based (PERSON zone, ORG zone, etc.)
force-directed (edges pull entities together)

why this choice: deterministic, stable, simple. same entity always appears in same place. no semantic meaning to position - it's purely for visualization.

status: ARBITRARY. grid is aesthetic choice. position has no meaning. could do force-directed layout where clusters visually clump.

5. edge decay: 30-minute window#

choice: edges have timestamps, only edges seen within EDGE_DECAY_MS (30 min) participate in clustering.

implementation:

EdgeSet.last_seen[i] tracks when each edge was last observed
EdgeSet.counts[i] tracks co-occurrence count per edge
clustering only unions edges where (now - edge_ts) < EDGE_DECAY_MS
old edges remain in memory but don't contribute to clusters

why this choice: allows clusters to dissolve when topics stop being discussed together. prevents permanent fusion of unrelated clusters that happened to co-occur once.

status: IMPLEMENTED. 30-minute decay window is arbitrary but seems reasonable for "current conversation" scope.

6. cluster algorithm: union-find#

choice: standard union-find with path compression, recomputed on active entities periodically

why this choice: O(N) for cluster detection, well-understood, from Newman-Ziff paper

status: NOT ARBITRARY. this is the right algorithm for the job.

baseline and trend detection#

the supernode problem#

entities like "Trump", "MAGA", "GOP" appear in nearly every political post. with count-based ranking, they always dominate trending - even when nothing interesting is happening with them. they're "always on" supernodes.

solution: baseline EMA + trend score#

each entity tracks a baseline rate using exponential moving average:

baseline = alpha * current_rate + (1 - alpha) * baseline

where alpha = 0.05 (slow adaptation, ~100 updates to converge).

trend score measures how much hotter than baseline:

trend = (current_rate - baseline) / baseline

trend > 0: entity is above its baseline (rising)
trend ≈ 0: entity is at baseline (steady state)
trend < 0: entity is below baseline (declining)

supernode mitigation#

degree-normalized edge weighting: at clustering time, raw pheromone weights are normalized by entity frequency:
```
normalized = raw_weight / sqrt(smoothed_rate_a * smoothed_rate_b)
```
this prevents supernodes from gluing unrelated clusters together. raw weights are preserved for edge rendering/sparks.
TREND_CLUSTER_MIN: only entities with trend >= 0.02 can participate in clustering. supernodes with flat/negative trend get cluster_id = 0 and don't form clusters.
UI ranking: frontend prioritizes trend over count. entities must have trend > 0 to appear in trending list.

empirical validation#

2-minute audit (see docs/03-baseline-audit.md) confirmed:

supernodes (Trump baseline=0.175, trend=-0.019) correctly identified as "always on"
rising topics (Homan trend=+0.117) correctly surfaced
~99% of entities have trend near 0 (at baseline)
only ~1% have trend > 0.3 (actually trending)

spam filtering via labeler#

ozone label stream#

the NER bridge subscribes to hailey's ozone labeler (ozone.hailey.at) which emits spam/suspicious labels for DIDs and post URIs.

filtered labels#

spam (hard filter):

spam, shopping-spam, general-spam, reply-link-spam
inauth-fundraising, coordinated-abuse, men-facet-abuse

suspicious (also filtered):

mass-follow-high, mass-follow-mid
elon-handle, new-acct-replies

implementation#

label cache seeded on startup via queryLabels API
real-time updates via subscribeLabels websocket (CBOR-encoded)
posts from labeled DIDs/URIs are dropped before NER processing
labels respect expiry timestamps and negation

architecture#

backend changes#

module: `entity_graph.zig`#

Core data structures (simplified):

const EntityId = u32;
const Timestamp = i64;  // milliseconds since epoch

const ActivityBuckets = struct {
    counts: [BUCKET_COUNT]u32,      // mentions per bucket (unbounded)
    bucket_epochs: [BUCKET_COUNT]i64, // which epoch each bucket belongs to

    pub fn record(self: *ActivityBuckets, ts: Timestamp) void;
    pub fn activityRate(self: *const ActivityBuckets, now: Timestamp) f32;
};

const EdgeSet = struct {
    edges: [MAX_EDGES_PER_ENTITY]EntityId,
    last_seen: [MAX_EDGES_PER_ENTITY]Timestamp,  // for decay
    counts: [MAX_EDGES_PER_ENTITY]u32,           // co-occurrence count
    count: u8,

    pub fn add(self: *EdgeSet, id: EntityId, now: Timestamp) bool;
    pub fn isRecent(self: *const EdgeSet, id: EntityId, now: Timestamp) bool;
    pub fn getCount(self: *const EdgeSet, id: EntityId) u32;
};

const Entity = struct {
    text: [64]u8,
    text_len: u8,
    label: [16]u8,
    label_len: u8,

    activity: ActivityBuckets,
    edges: EdgeSet,
    cluster_id: u32,
    grid_x: u16,
    grid_y: u16,

    baseline_rate: f32,   // EMA of historical activity
    smoothed_rate: f32,   // EMA for visibility decisions (faster)
    is_active: bool,      // hysteresis state
    last_seen: Timestamp, // for eviction

    pub fn trendScore(self: *const Entity, now: Timestamp) f32;
    pub fn updateBaseline(self: *Entity, current_rate: f32) void;
    pub fn updateHysteresis(self: *Entity) void;
};

const EntityGraph = struct {
    entities: [MAX_ENTITIES]Entity,
    count: u32,
    users: [MAX_USERS]User,
    user_count: u32,
    parent: [MAX_ENTITIES]i32,  // union-find
    bridges: BridgeCounter,     // cross-cluster co-occurrences
    stats: GraphStats,
    mutex: Mutex,

    pub fn recordPost(self: *EntityGraph, entities: []EntityData, did_hash: ?u64) void;
    pub fn updateClusters(self: *EntityGraph) void;
    fn edgeWeight(self: *EntityGraph, a: EntityId, b: EntityId, now: Timestamp) f32;
};

modified: `http.zig`#

Change /entity handler to call entity_graph.recordPost() with the full entity list from each post (not individually).

modified: `ws_server.zig`#

Broadcasts entity graph state via /entity-graph endpoint (JSON):

{
  "entities": [
    {
      "id": 0,
      "text": "Trump",
      "label": "PERSON",
      "rate": 0.17,
      "count": 52,
      "trend": -0.019,
      "baseline": 0.175,
      "cluster": 123,
      "cluster_score": 0.45,
      "cluster_label": "Trump + Epstein",
      "largest": false,
      "x": 45,
      "y": 78,
      "edges": [{"t": 5, "a": 12.5}, {"t": 12, "a": 45.0}]
    }
  ],
  "stats": {
    "total": 847,
    "active": 42,
    "clusters": 5,
    "largest": 28,
    "percolates": true,
    "threshold": 0.01,
    "users": 1523,
    "activeUsers": 89,
    "bridgeRate": 2.4
  }
}

frontend changes#

modified: `grid.js`#

Replace lattice rendering with entity rendering:

function renderEntities(entities, stats) {
  // clear canvas
  ctx.fillStyle = EMPTY_COLOR;
  ctx.fillRect(0, 0, width, height);

  // draw each entity as a dot/circle
  for (const entity of entities) {
    const x = entity.x * cellSize;
    const y = entity.y * cellSize;
    const radius = Math.max(2, entity.rate * 10);  // size by activity

    // color by cluster membership
    if (entity.cluster === stats.largestCluster) {
      ctx.fillStyle = LARGEST_CLUSTER_COLOR;
    } else {
      ctx.fillStyle = LABEL_COLORS[entity.label] || DEFAULT_COLOR;
    }

    ctx.beginPath();
    ctx.arc(x, y, radius, 0, Math.PI * 2);
    ctx.fill();
  }
}

modified: `index.html`#

Simplify to one grid view (remove 32x32 and 512x512 mini-grids).

Update stats display:

"active entities" instead of "density"
"semantic clusters" instead of "clusters"
keep "spanning" indicator
keep "crossings" (threshold crossings)

ner bridge changes#

`ner/bridge.py`#

subscribes to turbostream (graze.social) for post stream
subscribes to ozone labeler for spam/suspicious labels
extracts record.text from each post
runs spaCy NER (en_core_web_sm)
filters to useful entity types, normalizes text
POSTs all entities from one post together (preserves co-occurrence)
includes post metadata for "top post per topic" feature

payload = {
    "entities": [{"text": "Trump", "label": "PERSON"}, ...],
    "did": "did:plc:...",
    "post": {
        "at_uri": "at://did:plc:.../app.bsky.feed.post/...",
        "author_handle": "user.bsky.social",
        "author_followers": 1234
    }
}

migration plan#

Phase 1: Add entity_graph.zig alongside lattice.zig ✓
- core data structures implemented
- /entity-graph endpoint added
Phase 2: Modify NER bridge ✓
- entities grouped by post
- POST includes did + post metadata
Phase 3: Update frontend ✓
- grid.js renders trending entities
- surprise-based ranking implemented (z-like vs baseline)
Phase 4: Remove old code ✓
- turbostream.zig removed (python bridge is sole consumer)
- lattice.zig kept for percolation visualization (separate from entity graph)

parameters to tune#

Parameter	Current Value	Notes
`BUCKET_COUNT`	30	Number of time buckets
`BUCKET_DURATION_MS`	10,000	10s per bucket → 5 min total window
`ACTIVITY_THRESHOLD_ENTER`	0.01/s	Rate to become active
`ACTIVITY_THRESHOLD_EXIT`	0.005/s	Rate to stay active (hysteresis)
`SMOOTHING_ALPHA`	0.25	EWMA alpha for smoothed_rate (env: `SMOOTHING_ALPHA`)
`EDGE_DECAY_MS`	1,800,000	30 min edge decay window
`EDGE_WEIGHT_MIN`	0.05	Min pheromone weight for clustering (env: `EDGE_WEIGHT_MIN`)
`TREND_CLUSTER_MIN`	0.02	Min trend to participate in clustering (env: `TREND_CLUSTER_MIN`)
`EDGE_PHEROMONE_HALF_LIFE_MS`	600,000	Edge weight half-life (env: `EDGE_PHEROMONE_HALF_LIFE_MS`)
`EDGE_TOP_K`	5	Top-K edges per node for clustering (env: `EDGE_TOP_K`)
`EDGE_SAMPLE_WINDOW_MS`	10,000	Sampling epoch window (env: `EDGE_SAMPLE_WINDOW_MS`)
`EDGE_SAMPLE_MAX`	0.25	Max exploration probability (env: `EDGE_SAMPLE_MAX`)
`EDGE_SAMPLE_SCALE`	2.0	Weight scale for sampling (env: `EDGE_SAMPLE_SCALE`)
`SPANNING_THRESHOLD`	0.5	Fraction for percolation indicator
`MAX_ENTITIES`	1000	Memory bound
`MAX_EDGES_PER_ENTITY`	32	Co-occurrence cap per entity
`MAX_USERS`	2000	User tracking capacity
`GRID_SIZE`	128	For position hashing

infrastructure needs#

persistence (current)#

State is persisted to SQLite at /data/coral.db on a Fly volume.

persisted:

entities (id, text, label, grid position, baseline_rate, last_seen)
users (id, did_hash, last_seen)
edges (entity_a, entity_b, last_seen)
posts + entity_posts (top‑post tracking)
entity_baseline (EMA baselines)

not persisted:

activity buckets (mention counts reset on restart)
edge pheromone weights (recomputed from recent co-occurrences)

notes:

activity reset is intentional for now; trends re‑establish quickly
legacy mentions table removed (migration 007)

open questions#

Edge decay: Should co-occurrence edges decay over time? → IMPLEMENTED (30 min decay, see arbitrary choice #5)
Entity normalization: "Taylor Swift" vs "taylor swift" vs "Swift" - currently case-insensitive. NORP entities get plural stripping (Americans → American). no fuzzy matching beyond that.
Visualization density: With 1000 entities on a 128x128 grid, collisions will happen. Currently not addressed - entities just overlap.
Historical data: Do we want to show time series of percolation state? Would need to log crossings. Not implemented.
Cluster visualization: UI shows individual entities ranked by trend + cluster_score boost. cluster_label shown in tooltip (e.g., "Trump + Epstein"). Edge drawing not implemented.
~~What does "percolation" feel like?~~ → partially answered by baseline audit. supernodes (Trump, MAGA) have high baseline but flat/negative trend. rising topics (Homan, Australia) have positive trend. UI now prioritizes trend over count.

theoretical background (from papers)#

Newman-Ziff 2000 - lattice percolation#

the algorithm we use is correct: weighted union-find with path compression runs in O(N) time. our implementation matches theirs.

what percolation means on lattices:

occupy sites with probability p
percolation = when a cluster "wraps around" periodic boundaries (spans the system)
at p_c ≈ 0.593 (2D square lattice), spanning cluster emerges
order parameter = probability of being in spanning cluster

why this doesn't directly apply to us:

lattice has fixed spatial structure and boundaries
"spanning" means connecting opposite edges
our graph has no spatial structure - hash positions are arbitrary
there's nothing to "span"

their model:

directed follower network (static structure)
β = fraction of nodes who will retweet (occupation probability)
cascade = giant out-component (GOUT) among occupied nodes
P_∞ = cascade_size / total_nodes

key findings:

uniform percolation predicts β_c ≈ 2.5% for Weibo
actual cascades happen at β ≈ 0.25% (10x lower!)
why? heterogeneous activity: m ~ k_o^α (active users have more followers)
positive feedback: active users gain followers faster (Δk_o/Δt ~ k_o^σ m^τ)

threshold formula (Eq. 8):

β_c = Σ P(k_i, k_o)[1 - t_c^m(k_i,k_o)]
where t_c solves: Σ k_i k_o P(k_i, k_o)[1 - t_c^m(k_i,k_o)] = ⟨k⟩

critical exponent = 1 near threshold (mean-field behavior)

implications for coral:

their order parameter (P_∞ = GOUT/N) could apply to us
but they have a static network and vary β analytically
we have a dynamic network and no controllable β
we observe the system, don't simulate it

key insight: our system is harder to analyze#

classical percolation: vary p, find p_c, phase transition xie approach: fix network, vary β, find β_c, phase transition our situation:

graph structure evolving (edges appear as posts arrive)
"activity" evolving (mention rates change)
no parameter we control
we're always "somewhere on the curve" but the curve itself is changing

practical consequence: we probably can't predict a threshold a priori. we need to observe the system, look for sharp transitions, and calibrate empirically what "percolation" looks like in our specific domain.

references#

Newman & Ziff 2000 - efficient percolation algorithm
Xie et al. 2021 - heterogeneous percolation on social media
Hailey's trending topics - NER for topic detection