# coral

real-time semantic percolation from the Bluesky firehose.

[live demo](https://coral-8hh.pages.dev) | [source](https://tangled.sh/@zzstoatzz.io/coral)

## what it does

extracts named entities (people, organizations, places, events) from Bluesky posts and tracks how they cluster together. entities that get discussed together form edges. when a real-world event spans multiple topics, clusters merge - discourse percolates into a unified conversation.

## how it works

1. **NER bridge** consumes the [turbostream](https://graze.social) firehose, runs spaCy NER, extracts entities
2. **labeler integration** drops spam before it hits the graph (via [Hailey's labeler](https://labeler.hailey.at))
3. **entity graph** tracks co-occurrences (entities in same post = edge), computes clusters via union-find
4. **pheromone edges** - edge weights decay exponentially, reinforced on repeated co-occurrence (ant colony optimization inspired)
5. **surprise trending** - entities ranked by statistical surprise vs baseline (z‑like), not raw counts
6. **frontend** visualizes entity activity, cluster structure, and firehose health

## theoretical background

the system draws from several sources:

**percolation theory** - we use the Newman-Ziff algorithm for efficient cluster detection. on lattices, percolation has a sharp phase transition at p_c ≈ 0.593. our graph isn't a lattice, so we calibrate empirically.

**heterogeneous activity** - [Xie et al. 2021](https://arxiv.org/abs/2103.02804) showed that real social networks percolate at ~1/10th the uniform-theory threshold due to heterogeneous user activity. we weight mentions by user activity rate following this insight.

**NER for topic detection** - inspired by [Hailey's trending topics](https://hailey.at/posts/3mcy5b5gfi222). rather than embeddings on raw text (too noisy), extract structured entities to reduce surface area.

**ATProto labeler system** - spam filtering via [com.atproto.label](https://docs.bsky.app/docs/advanced-guides/moderation). we subscribe to Hailey's labeler stream and drop posts from accounts labeled as spam before NER processing.

<details>
<summary>design decisions</summary>

these are documented as arbitrary choices to be revisited:

| decision | choice | why |
|----------|--------|-----|
| edge definition | same-post co-occurrence | simplest, captures "discussed together" |
| edge weights | pheromone decay (configurable half-life) | ant colony inspired, recent co-occurrences matter more |
| activity threshold | 0.01 mentions/sec (~3 per 5 min) | rate normalizes across quiet/busy periods |
| trending metric | surprise vs baseline (UI), trend ratio (backend) | anomaly detection, not popularity contest |
| percolation threshold | largest_cluster / active > 50% | placeholder, needs empirical calibration |
| entity position | hash(text) → (x, y) | deterministic, stable, **no semantic meaning yet** |
| user weighting | planned (currently off) | power users count more (Xie 2021) |

see [docs/02-semantic-percolation-plan.md](docs/02-semantic-percolation-plan.md) for full rationale.

</details>

## stack

- **ner** (python): turbostream consumer + spaCy NER + labeler gate → POST to backend
- **backend** (zig): entity graph + websocket server + SQLite persistence
- **site**: static html/css/js on cloudflare pages

## run locally

```bash
cd backend && zig build run                 # backend (entity graph + websocket)
cd ner && uv run coral-bridge               # NER bridge (turbostream → spaCy → backend)
cd site && npx wrangler pages dev .         # frontend
```

## deploy

```bash
cd backend && fly deploy
cd ner && fly deploy
cd site && npx wrangler pages deploy . --project-name coral
```

## future work

ideas being explored (not commitments):

- **semantic positioning** - currently entities hash to arbitrary grid positions. could use embeddings to place semantically similar entities near each other, making the 2D layout a meaningful projection of topic space. unclear whether to embed entity names, representative posts, or cluster centroids.

- **temporal co-activity edges** - entities that spike together might be related even without same-post co-occurrence. "earthquake" and "LA" could both trend during an event without always appearing together.

- **percolation calibration** - the 50% threshold is arbitrary. need to correlate cluster merges with real-world events to understand what "discourse unification" actually looks like in the data.

## references

- Newman & Ziff, [Efficient Monte Carlo algorithm and high-precision results for percolation](https://arxiv.org/abs/cond-mat/0005264), Phys. Rev. Lett. 85 (2000)
- Xie et al., [Detecting and Modelling Real Percolation and Phase Transitions of Information on Social Media](https://arxiv.org/abs/2103.02804), Nature Human Behaviour (2021)
- Hailey, [Bluesky Trending Topics](https://hailey.at/posts/3mcy5b5gfi222) - NER approach for topic detection
- Stauffer & Aharony, *Introduction to Percolation Theory* - theoretical foundations
- [ATProto Labels](https://docs.bsky.app/docs/advanced-guides/moderation) - moderation architecture