# coral real-time semantic percolation from the Bluesky firehose. [live demo](https://coral-8hh.pages.dev) | [source](https://tangled.sh/@zzstoatzz.io/coral) ## what it does extracts named entities (people, organizations, places, events) from Bluesky posts and tracks how they cluster together. entities that get discussed together form edges. when a real-world event spans multiple topics, clusters merge - discourse percolates into a unified conversation. ## how it works 1. **NER bridge** consumes the [turbostream](https://graze.social) firehose, runs spaCy NER, extracts entities 2. **labeler integration** drops spam before it hits the graph (via [Hailey's labeler](https://labeler.hailey.at)) 3. **entity graph** tracks co-occurrences (entities in same post = edge), computes clusters via union-find 4. **pheromone edges** - edge weights decay exponentially, reinforced on repeated co-occurrence (ant colony optimization inspired) 5. **surprise trending** - entities ranked by statistical surprise vs baseline (z‑like), not raw counts 6. **frontend** visualizes entity activity, cluster structure, and firehose health ## theoretical background the system draws from several sources: **percolation theory** - we use the Newman-Ziff algorithm for efficient cluster detection. on lattices, percolation has a sharp phase transition at p_c ≈ 0.593. our graph isn't a lattice, so we calibrate empirically. **heterogeneous activity** - [Xie et al. 2021](https://arxiv.org/abs/2103.02804) showed that real social networks percolate at ~1/10th the uniform-theory threshold due to heterogeneous user activity. we weight mentions by user activity rate following this insight. **NER for topic detection** - inspired by [Hailey's trending topics](https://hailey.at/posts/3mcy5b5gfi222). rather than embeddings on raw text (too noisy), extract structured entities to reduce surface area. **ATProto labeler system** - spam filtering via [com.atproto.label](https://docs.bsky.app/docs/advanced-guides/moderation). we subscribe to Hailey's labeler stream and drop posts from accounts labeled as spam before NER processing.
design decisions these are documented as arbitrary choices to be revisited: | decision | choice | why | |----------|--------|-----| | edge definition | same-post co-occurrence | simplest, captures "discussed together" | | edge weights | pheromone decay (configurable half-life) | ant colony inspired, recent co-occurrences matter more | | activity threshold | 0.01 mentions/sec (~3 per 5 min) | rate normalizes across quiet/busy periods | | trending metric | surprise vs baseline (UI), trend ratio (backend) | anomaly detection, not popularity contest | | percolation threshold | largest_cluster / active > 50% | placeholder, needs empirical calibration | | entity position | hash(text) → (x, y) | deterministic, stable, **no semantic meaning yet** | | user weighting | planned (currently off) | power users count more (Xie 2021) | see [docs/02-semantic-percolation-plan.md](docs/02-semantic-percolation-plan.md) for full rationale.
## stack - **ner** (python): turbostream consumer + spaCy NER + labeler gate → POST to backend - **backend** (zig): entity graph + websocket server + SQLite persistence - **site**: static html/css/js on cloudflare pages ## run locally ```bash cd backend && zig build run # backend (entity graph + websocket) cd ner && uv run coral-bridge # NER bridge (turbostream → spaCy → backend) cd site && npx wrangler pages dev . # frontend ``` ## deploy ```bash cd backend && fly deploy cd ner && fly deploy cd site && npx wrangler pages deploy . --project-name coral ``` ## future work ideas being explored (not commitments): - **semantic positioning** - currently entities hash to arbitrary grid positions. could use embeddings to place semantically similar entities near each other, making the 2D layout a meaningful projection of topic space. unclear whether to embed entity names, representative posts, or cluster centroids. - **temporal co-activity edges** - entities that spike together might be related even without same-post co-occurrence. "earthquake" and "LA" could both trend during an event without always appearing together. - **percolation calibration** - the 50% threshold is arbitrary. need to correlate cluster merges with real-world events to understand what "discourse unification" actually looks like in the data. ## references - Newman & Ziff, [Efficient Monte Carlo algorithm and high-precision results for percolation](https://arxiv.org/abs/cond-mat/0005264), Phys. Rev. Lett. 85 (2000) - Xie et al., [Detecting and Modelling Real Percolation and Phase Transitions of Information on Social Media](https://arxiv.org/abs/2103.02804), Nature Human Behaviour (2021) - Hailey, [Bluesky Trending Topics](https://hailey.at/posts/3mcy5b5gfi222) - NER approach for topic detection - Stauffer & Aharony, *Introduction to Percolation Theory* - theoretical foundations - [ATProto Labels](https://docs.bsky.app/docs/advanced-guides/moderation) - moderation architecture