semantic percolation refactor#
summary#
Replace the current random-hash lattice with a co-occurrence graph model where:
- Nodes = named entities
- Edges = co-occurrence (entities mentioned in same post)
- Activity = mention rate (count / time window)
- Percolation = giant connected component among active entities
This follows the heterogeneous percolation model from Xie et al. 2021 and the NER approach from Hailey's trending topics.
conceptual model#
clusters = topics: entities that get discussed together form clusters. "Trump", "WhiteHouse", "GOP" cluster together because posts about Trump mention them together. "Elon", "Tesla", "SpaceX" form a separate cluster.
percolation = discourse unification: when a real-world event spans topics (Trump and Elon at White House for Tesla), posts start mentioning entities from both clusters together. edges form between clusters. they merge. discourse percolates into a unified conversation.
this is the thing we're trying to visualize: the moment when separate topic-clusters merge into one.
arbitrary choices#
these are parameters and design decisions made arbitrarily to get the system running. each should be revisited as we learn from observing the system. think of this like setting up a Monte Carlo simulation - we make choices to define the rules, then watch what emerges.
1. edge definition: same-post co-occurrence#
choice: two entities are adjacent if they appeared in the same post.
alternatives considered:
- reply/quote chains (entities in conversation threads)
- temporal proximity (entities mentioned within N seconds)
- semantic similarity (embedding distance)
why this choice: simplest to implement, captures "discussed together." a post mentioning Trump and Elon creates an edge because the author chose to discuss them together.
status: ARBITRARY. revisit if clusters don't match intuition about topics.
future consideration: temporal co-activity could also create edges. if two entities are both spiking at the same time (even without same-post co-occurrence), they may be part of the same event. e.g., during an earthquake, "earthquake" and "LA" might both spike without always appearing in the same post. the temporal burstiness itself is signal. this would be: "entities that trend together are connected." not yet implemented.
2. activity threshold: rate-based with hysteresis#
choice: entity is "active" based on smoothed rate with separate enter/exit thresholds
- window: 5 minutes (30 buckets × 10s)
ACTIVITY_THRESHOLD_ENTER: 0.01/sec (~3 per 5 min) to become activeACTIVITY_THRESHOLD_EXIT: 0.005/sec (~1.5 per 5 min) to stay activesmoothed_rate: EWMA withSMOOTHING_ALPHA(default 0.25)
hysteresis logic:
if is_active:
is_active = smoothed_rate >= EXIT_THRESHOLD # harder to leave
else:
is_active = smoothed_rate >= ENTER_THRESHOLD # harder to enter
why this choice: prevents entities from flickering in/out of the active list. once active, they stay active until activity drops significantly.
status: IMPLEMENTED. thresholds tuned empirically - initial 0.1/sec was too high, nothing was active.
3. percolation definition: NEEDS EMPIRICAL CALIBRATION#
current: show largest_cluster / active_entities as percentage, highlight when >50%
what we learned from the papers:
Newman-Ziff (lattice percolation):
- on regular lattices, "percolation" = cluster wraps around periodic boundaries
- p_c ≈ 0.593 for 2D square lattice
- doesn't apply to us: our graph has no spatial structure. entities are hashed to positions arbitrarily - there's no "boundary to wrap around"
Xie et al (social network percolation):
- models info spread as site percolation on directed follower graphs
- order parameter P_∞ = cascade_size / total_nodes (giant out-component)
- real networks percolate at ~1/10th the uniform-theory threshold
- why? heterogeneous activity: m ~ k_o^α (active users get more followers)
- closer to us but still different: they have a static network and vary retweet probability β analytically
the fundamental mismatch:
- classical: fixed lattice, vary p, transition at p_c
- xie: static follower graph, vary β, analyze GOUT
- us: dynamically growing co-occurrence graph with no controllable parameter
our graph is neither:
- NOT a regular lattice (no spatial structure)
- NOT a static random graph (edges come from co-occurrence events)
- a dynamically evolving undirected graph where edges = "discussed together"
practical approach - empirical calibration:
- collect ratio time series over days/weeks
- look for sharp jumps in the ratio
- correlate jumps with real-world events that span topics
- use that to determine what "discourse unification" looks like in practice
- the 50% threshold is arbitrary - real threshold might be 30% or 70%
possible future definitions (need data to validate):
- track cluster merging events: "percolation" = when two clusters of size > N merge
- null model comparison: "percolation" = observed ratio >> random graph expectation
- cluster size distribution shape: look for power-law vs exponential cutoff
status: ARBITRARY. showing percentage is honest. 50% threshold is placeholder. need empirical data to calibrate what "unified discourse" actually looks like.
4. entity position: hash-based#
choice: hash(lowercase(entity_text)) → (x, y) on 128x128 grid
alternatives considered:
- embedding-based (UMAP/t-SNE projection)
- category-based (PERSON zone, ORG zone, etc.)
- force-directed (edges pull entities together)
why this choice: deterministic, stable, simple. same entity always appears in same place. no semantic meaning to position - it's purely for visualization.
status: ARBITRARY. grid is aesthetic choice. position has no meaning. could do force-directed layout where clusters visually clump.
5. edge decay: 30-minute window#
choice: edges have timestamps, only edges seen within EDGE_DECAY_MS (30 min) participate in clustering.
implementation:
EdgeSet.last_seen[i]tracks when each edge was last observedEdgeSet.counts[i]tracks co-occurrence count per edge- clustering only unions edges where
(now - edge_ts) < EDGE_DECAY_MS - old edges remain in memory but don't contribute to clusters
why this choice: allows clusters to dissolve when topics stop being discussed together. prevents permanent fusion of unrelated clusters that happened to co-occur once.
status: IMPLEMENTED. 30-minute decay window is arbitrary but seems reasonable for "current conversation" scope.
6. cluster algorithm: union-find#
choice: standard union-find with path compression, recomputed on active entities periodically
why this choice: O(N) for cluster detection, well-understood, from Newman-Ziff paper
status: NOT ARBITRARY. this is the right algorithm for the job.
baseline and trend detection#
the supernode problem#
entities like "Trump", "MAGA", "GOP" appear in nearly every political post. with count-based ranking, they always dominate trending - even when nothing interesting is happening with them. they're "always on" supernodes.
solution: baseline EMA + trend score#
each entity tracks a baseline rate using exponential moving average:
baseline = alpha * current_rate + (1 - alpha) * baseline
where alpha = 0.05 (slow adaptation, ~100 updates to converge).
trend score measures how much hotter than baseline:
trend = (current_rate - baseline) / baseline
trend > 0: entity is above its baseline (rising)trend ≈ 0: entity is at baseline (steady state)trend < 0: entity is below baseline (declining)
supernode mitigation#
-
degree-normalized edge weighting: at clustering time, raw pheromone weights are normalized by entity frequency:
normalized = raw_weight / sqrt(smoothed_rate_a * smoothed_rate_b)this prevents supernodes from gluing unrelated clusters together. raw weights are preserved for edge rendering/sparks.
-
TREND_CLUSTER_MIN: only entities with
trend >= 0.02can participate in clustering. supernodes with flat/negative trend getcluster_id = 0and don't form clusters. -
UI ranking: frontend prioritizes
trendovercount. entities must havetrend > 0to appear in trending list.
empirical validation#
2-minute audit (see docs/03-baseline-audit.md) confirmed:
- supernodes (Trump baseline=0.175, trend=-0.019) correctly identified as "always on"
- rising topics (Homan trend=+0.117) correctly surfaced
- ~99% of entities have trend near 0 (at baseline)
- only ~1% have trend > 0.3 (actually trending)
spam filtering via labeler#
ozone label stream#
the NER bridge subscribes to hailey's ozone labeler (ozone.hailey.at) which emits spam/suspicious labels for DIDs and post URIs.
filtered labels#
spam (hard filter):
spam,shopping-spam,general-spam,reply-link-spaminauth-fundraising,coordinated-abuse,men-facet-abuse
suspicious (also filtered):
mass-follow-high,mass-follow-midelon-handle,new-acct-replies
implementation#
- label cache seeded on startup via
queryLabelsAPI - real-time updates via
subscribeLabelswebsocket (CBOR-encoded) - posts from labeled DIDs/URIs are dropped before NER processing
- labels respect expiry timestamps and negation
architecture#
backend changes#
module: entity_graph.zig#
Core data structures (simplified):
const EntityId = u32;
const Timestamp = i64; // milliseconds since epoch
const ActivityBuckets = struct {
counts: [BUCKET_COUNT]u32, // mentions per bucket (unbounded)
bucket_epochs: [BUCKET_COUNT]i64, // which epoch each bucket belongs to
pub fn record(self: *ActivityBuckets, ts: Timestamp) void;
pub fn activityRate(self: *const ActivityBuckets, now: Timestamp) f32;
};
const EdgeSet = struct {
edges: [MAX_EDGES_PER_ENTITY]EntityId,
last_seen: [MAX_EDGES_PER_ENTITY]Timestamp, // for decay
counts: [MAX_EDGES_PER_ENTITY]u32, // co-occurrence count
count: u8,
pub fn add(self: *EdgeSet, id: EntityId, now: Timestamp) bool;
pub fn isRecent(self: *const EdgeSet, id: EntityId, now: Timestamp) bool;
pub fn getCount(self: *const EdgeSet, id: EntityId) u32;
};
const Entity = struct {
text: [64]u8,
text_len: u8,
label: [16]u8,
label_len: u8,
activity: ActivityBuckets,
edges: EdgeSet,
cluster_id: u32,
grid_x: u16,
grid_y: u16,
baseline_rate: f32, // EMA of historical activity
smoothed_rate: f32, // EMA for visibility decisions (faster)
is_active: bool, // hysteresis state
last_seen: Timestamp, // for eviction
pub fn trendScore(self: *const Entity, now: Timestamp) f32;
pub fn updateBaseline(self: *Entity, current_rate: f32) void;
pub fn updateHysteresis(self: *Entity) void;
};
const EntityGraph = struct {
entities: [MAX_ENTITIES]Entity,
count: u32,
users: [MAX_USERS]User,
user_count: u32,
parent: [MAX_ENTITIES]i32, // union-find
bridges: BridgeCounter, // cross-cluster co-occurrences
stats: GraphStats,
mutex: Mutex,
pub fn recordPost(self: *EntityGraph, entities: []EntityData, did_hash: ?u64) void;
pub fn updateClusters(self: *EntityGraph) void;
fn edgeWeight(self: *EntityGraph, a: EntityId, b: EntityId, now: Timestamp) f32;
};
modified: http.zig#
Change /entity handler to call entity_graph.recordPost() with the full entity list from each post (not individually).
modified: ws_server.zig#
Broadcasts entity graph state via /entity-graph endpoint (JSON):
{
"entities": [
{
"id": 0,
"text": "Trump",
"label": "PERSON",
"rate": 0.17,
"count": 52,
"trend": -0.019,
"baseline": 0.175,
"cluster": 123,
"cluster_score": 0.45,
"cluster_label": "Trump + Epstein",
"largest": false,
"x": 45,
"y": 78,
"edges": [{"t": 5, "a": 12.5}, {"t": 12, "a": 45.0}]
}
],
"stats": {
"total": 847,
"active": 42,
"clusters": 5,
"largest": 28,
"percolates": true,
"threshold": 0.01,
"users": 1523,
"activeUsers": 89,
"bridgeRate": 2.4
}
}
frontend changes#
modified: grid.js#
Replace lattice rendering with entity rendering:
function renderEntities(entities, stats) {
// clear canvas
ctx.fillStyle = EMPTY_COLOR;
ctx.fillRect(0, 0, width, height);
// draw each entity as a dot/circle
for (const entity of entities) {
const x = entity.x * cellSize;
const y = entity.y * cellSize;
const radius = Math.max(2, entity.rate * 10); // size by activity
// color by cluster membership
if (entity.cluster === stats.largestCluster) {
ctx.fillStyle = LARGEST_CLUSTER_COLOR;
} else {
ctx.fillStyle = LABEL_COLORS[entity.label] || DEFAULT_COLOR;
}
ctx.beginPath();
ctx.arc(x, y, radius, 0, Math.PI * 2);
ctx.fill();
}
}
modified: index.html#
Simplify to one grid view (remove 32x32 and 512x512 mini-grids).
Update stats display:
- "active entities" instead of "density"
- "semantic clusters" instead of "clusters"
- keep "spanning" indicator
- keep "crossings" (threshold crossings)
ner bridge changes#
ner/bridge.py#
- subscribes to turbostream (graze.social) for post stream
- subscribes to ozone labeler for spam/suspicious labels
- extracts
record.textfrom each post - runs spaCy NER (
en_core_web_sm) - filters to useful entity types, normalizes text
- POSTs all entities from one post together (preserves co-occurrence)
- includes post metadata for "top post per topic" feature
payload = {
"entities": [{"text": "Trump", "label": "PERSON"}, ...],
"did": "did:plc:...",
"post": {
"at_uri": "at://did:plc:.../app.bsky.feed.post/...",
"author_handle": "user.bsky.social",
"author_followers": 1234
}
}
migration plan#
-
Phase 1: Add
entity_graph.zigalongsidelattice.zig✓- core data structures implemented
/entity-graphendpoint added
-
Phase 2: Modify NER bridge ✓
- entities grouped by post
- POST includes did + post metadata
-
Phase 3: Update frontend ✓
- grid.js renders trending entities
- surprise-based ranking implemented (z-like vs baseline)
-
Phase 4: Remove old code ✓
turbostream.zigremoved (python bridge is sole consumer)- lattice.zig kept for percolation visualization (separate from entity graph)
parameters to tune#
| Parameter | Current Value | Notes |
|---|---|---|
BUCKET_COUNT |
30 | Number of time buckets |
BUCKET_DURATION_MS |
10,000 | 10s per bucket → 5 min total window |
ACTIVITY_THRESHOLD_ENTER |
0.01/s | Rate to become active |
ACTIVITY_THRESHOLD_EXIT |
0.005/s | Rate to stay active (hysteresis) |
SMOOTHING_ALPHA |
0.25 | EWMA alpha for smoothed_rate (env: SMOOTHING_ALPHA) |
EDGE_DECAY_MS |
1,800,000 | 30 min edge decay window |
EDGE_WEIGHT_MIN |
0.05 | Min pheromone weight for clustering (env: EDGE_WEIGHT_MIN) |
TREND_CLUSTER_MIN |
0.02 | Min trend to participate in clustering (env: TREND_CLUSTER_MIN) |
EDGE_PHEROMONE_HALF_LIFE_MS |
600,000 | Edge weight half-life (env: EDGE_PHEROMONE_HALF_LIFE_MS) |
EDGE_TOP_K |
5 | Top-K edges per node for clustering (env: EDGE_TOP_K) |
EDGE_SAMPLE_WINDOW_MS |
10,000 | Sampling epoch window (env: EDGE_SAMPLE_WINDOW_MS) |
EDGE_SAMPLE_MAX |
0.25 | Max exploration probability (env: EDGE_SAMPLE_MAX) |
EDGE_SAMPLE_SCALE |
2.0 | Weight scale for sampling (env: EDGE_SAMPLE_SCALE) |
SPANNING_THRESHOLD |
0.5 | Fraction for percolation indicator |
MAX_ENTITIES |
1000 | Memory bound |
MAX_EDGES_PER_ENTITY |
32 | Co-occurrence cap per entity |
MAX_USERS |
2000 | User tracking capacity |
GRID_SIZE |
128 | For position hashing |
infrastructure needs#
persistence (current)#
State is persisted to SQLite at /data/coral.db on a Fly volume.
persisted:
entities(id, text, label, grid position, baseline_rate, last_seen)users(id, did_hash, last_seen)edges(entity_a, entity_b, last_seen)posts+entity_posts(top‑post tracking)entity_baseline(EMA baselines)
not persisted:
- activity buckets (mention counts reset on restart)
- edge pheromone weights (recomputed from recent co-occurrences)
notes:
- activity reset is intentional for now; trends re‑establish quickly
- legacy
mentionstable removed (migration 007)
open questions#
-
Edge decay: Should co-occurrence edges decay over time?→ IMPLEMENTED (30 min decay, see arbitrary choice #5) -
Entity normalization: "Taylor Swift" vs "taylor swift" vs "Swift" - currently case-insensitive. NORP entities get plural stripping (Americans → American). no fuzzy matching beyond that.
-
Visualization density: With 1000 entities on a 128x128 grid, collisions will happen. Currently not addressed - entities just overlap.
-
Historical data: Do we want to show time series of percolation state? Would need to log crossings. Not implemented.
-
Cluster visualization: UI shows individual entities ranked by trend + cluster_score boost. cluster_label shown in tooltip (e.g., "Trump + Epstein"). Edge drawing not implemented.
-
What does "percolation" feel like?→ partially answered by baseline audit. supernodes (Trump, MAGA) have high baseline but flat/negative trend. rising topics (Homan, Australia) have positive trend. UI now prioritizes trend over count.
theoretical background (from papers)#
Newman-Ziff 2000 - lattice percolation#
the algorithm we use is correct: weighted union-find with path compression runs in O(N) time. our implementation matches theirs.
what percolation means on lattices:
- occupy sites with probability p
- percolation = when a cluster "wraps around" periodic boundaries (spans the system)
- at p_c ≈ 0.593 (2D square lattice), spanning cluster emerges
- order parameter = probability of being in spanning cluster
why this doesn't directly apply to us:
- lattice has fixed spatial structure and boundaries
- "spanning" means connecting opposite edges
- our graph has no spatial structure - hash positions are arbitrary
- there's nothing to "span"
Xie et al 2021 - heterogeneous percolation on social networks#
their model:
- directed follower network (static structure)
- β = fraction of nodes who will retweet (occupation probability)
- cascade = giant out-component (GOUT) among occupied nodes
- P_∞ = cascade_size / total_nodes
key findings:
- uniform percolation predicts β_c ≈ 2.5% for Weibo
- actual cascades happen at β ≈ 0.25% (10x lower!)
- why? heterogeneous activity: m ~ k_o^α (active users have more followers)
- positive feedback: active users gain followers faster (Δk_o/Δt ~ k_o^σ m^τ)
threshold formula (Eq. 8):
β_c = Σ P(k_i, k_o)[1 - t_c^m(k_i,k_o)]
where t_c solves: Σ k_i k_o P(k_i, k_o)[1 - t_c^m(k_i,k_o)] = ⟨k⟩
critical exponent = 1 near threshold (mean-field behavior)
implications for coral:
- their order parameter (P_∞ = GOUT/N) could apply to us
- but they have a static network and vary β analytically
- we have a dynamic network and no controllable β
- we observe the system, don't simulate it
key insight: our system is harder to analyze#
classical percolation: vary p, find p_c, phase transition xie approach: fix network, vary β, find β_c, phase transition our situation:
- graph structure evolving (edges appear as posts arrive)
- "activity" evolving (mention rates change)
- no parameter we control
- we're always "somewhere on the curve" but the curve itself is changing
practical consequence: we probably can't predict a threshold a priori. we need to observe the system, look for sharp transitions, and calibrate empirically what "percolation" looks like in our specific domain.
references#
- Newman & Ziff 2000 - efficient percolation algorithm
- Xie et al. 2021 - heterogeneous percolation on social media
- Hailey's trending topics - NER for topic detection