this repo has no description coral.waow.tech

docs: update failure modes with reader/worker fix

consolidate websocket failure modes #2 and #3 into a single section —
they were symptoms of the same root cause (blocked read path preventing
ping/pong handling). document the reader/worker queue pattern that
actually fixed it.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

+42 -36
+29 -34
docs/04-architecture.md
··· 105 105 106 106 **result**: POST latency dropped from intermittent 2s+ timeouts to ~1.8ms. entity drops went from 5678/5820 to 0. 107 107 108 - ### 2. turbostream websocket disconnecting every ~44 seconds 108 + ### 2. turbostream websocket disconnects (blocked read path) 109 109 110 - **symptom**: NER bridge logs show `websocket closed, reconnecting in 1s...` every ~44 seconds. bridge reconnects immediately but the constant disconnect/reconnect cycle tanks effective throughput to ~12/s instead of ~50/s. 110 + **symptoms** (appeared in sequence as we iterated on fixes): 111 + - `websocket closed` every ~44s, throughput ~12/s (client-side ping timeout) 112 + - `code=1011 reason='keepalive ping timeout'` every ~2min (server-side ping timeout) 113 + - `code=1006 reason=''` every ~60s (abnormal closure) 111 114 112 - **root cause**: the `websockets` python library sends ping frames at `ping_interval` (30s) and expects a pong within `ping_timeout` (10s). fly.io's networking layer was silently dropping the websocket ping control frames. the pong never came back, so the client assumed the server was dead and closed the connection at 30 + 10 = 40s (close to the observed ~44s). 115 + all three were symptoms of the same root cause. 113 116 114 - **diagnosis**: turbostream worked perfectly from a local machine (~68 msg/s, no disconnects in 30s). the disconnection was specific to fly.io outbound websocket connections. 117 + **root cause**: the `websockets` library processes ping/pong control frames inside `recv()`. the original code did `async for raw in ws:` then ran NER + HTTP POST inside the loop body. while processing, the library couldn't call `recv()`, so it couldn't respond to pings. 115 118 116 - **fix**: disabled client-side pings. 119 + this explains the "looks good for 2-5 minutes then degrades" pattern: processing latency accumulates until keepalive timeouts hit. toggling client-side pings changed which side timed out first, but didn't fix the underlying problem. even `run_in_executor()` for spaCy didn't help — `await`ing the result still blocked the read loop. 120 + 121 + **fix**: decouple the websocket reader from processing. two concurrent asyncio tasks: 117 122 118 123 ```python 119 - websockets.connect( 120 - TURBOSTREAM_URL, 121 - ping_interval=None, # was 30 122 - ping_timeout=None, # was 10 123 - close_timeout=5, 124 - ) 125 - ``` 124 + queue = asyncio.Queue(maxsize=200) # ~4 seconds buffer at 50 msgs/sec 126 125 127 - turbostream's server handles keepalive on its end. disabling client pings removed the premature disconnect. 126 + async def reader(): 127 + """only recv() + enqueue. keeps pong responsive.""" 128 + async for raw in ws: 129 + stats["messages"] += 1 130 + try: 131 + queue.put_nowait(raw) 132 + except asyncio.QueueFull: 133 + stats["queue_drops"] += 1 128 134 129 - **result**: connection became stable (168+ seconds sustained, previously dying at ~44s). throughput went from ~12/s to ~50/s. also added close code/reason logging for future debugging: 130 - 131 - ```python 132 - except ConnectionClosed as e: 133 - print(f"websocket closed: code={e.code} reason='{e.reason}', ...") 135 + async def worker(): 136 + """consume from queue, do NER + POST.""" 137 + while True: 138 + raw = await queue.get() 139 + # JSON parse → extract_post → spam check → NER (in executor) → POST 134 140 ``` 135 141 136 - ### 3. server-side keepalive timeout (event loop starvation) 142 + the reader stays in `recv()` so websockets can handle ping/pong at all times. bounded queue with drop policy prevents unbounded backlog. client-side pings disabled (turbostream handles keepalive server-side). 137 143 138 - **symptom**: NER bridge logs show `websocket closed: code=1011 reason='keepalive ping timeout'` every ~2 minutes. the turbostream server closes the connection because it never receives our pong responses. 144 + **result**: zero disconnects, 0 drops, queue steady at 4-8/200, latency 3-10ms. previously dying every 60 seconds. 139 145 140 - **root cause**: `extract_entities()` (spaCy NER) is a synchronous CPU-bound call that blocks the asyncio event loop. while it runs (~1-5ms per call, but at 50 msgs/sec with ~19 needing NER, the event loop is frequently blocked), the `websockets` library cannot send pong responses to the server's keepalive pings. the server times out waiting for our pong and closes the connection. 146 + see `docs/05-websocket-stability.md` for the full debugging timeline. 141 147 142 - **fix**: offload spaCy to a thread pool so the event loop stays free for ping/pong: 143 - 144 - ```python 145 - loop = asyncio.get_running_loop() 146 - entities = await loop.run_in_executor(None, extract_entities, post["text"]) 147 - ``` 148 - 149 - also re-enabled client-side pings (30s interval, 20s timeout) since the event loop is no longer starved. 150 - 151 - **result**: the event loop can now respond to server pings immediately even while spaCy processes text in background threads. 152 - 153 - ### 4. fly.io API instability 148 + ### 3. fly.io API instability 154 149 155 150 **symptom**: `fly machine restart` and `fly deploy` commands fail with DNS resolution errors (`lookup api.machines.dev: no such host`) or EOF errors on the release API. 156 151 ··· 161 156 ### general lessons 162 157 163 158 - **use fly internal networking** for app-to-app communication. `coral.internal:3000` over IPv6 is dramatically more reliable than routing through the public proxy at `coral.fly.dev`. 164 - - **don't block the asyncio event loop** with CPU-bound work. use `run_in_executor()` for synchronous calls like spaCy NER so the websockets library can handle ping/pong. 159 + - **decouple websocket reads from processing**. `websockets` handles ping/pong inside `recv()`. if your loop body does slow work (NER, HTTP), use a reader task + queue + worker task so `recv()` is never stalled. `run_in_executor()` alone isn't enough — `await`ing it still blocks the read loop. 165 160 - **bind to `::` not `0.0.0.0`** if your fly app needs to accept connections from other fly apps on the internal network. 166 161 - **the firehose_rate EWMA** (`alpha=0.3`, updated every 2s) is a good health signal. threshold of 10/s (vs normal ~50/s) catches real degradation without false positives.
+13 -2
docs/05-websocket-stability.md
··· 80 80 81 81 new stats line includes `queue=N/200 queue_drops=N` to monitor backpressure. 82 82 83 - ## if this still fails 83 + ## result 84 + 85 + deployed `fcee752`. zero websocket disconnects since the reader/worker split. the old code (pings ON + thread pool, no reader/worker split) was getting `code=1006` every ~60s on the same machine. 86 + 87 + early observations after deploy: 88 + - 0 disconnects (old code: every 60s) 89 + - 0 drops, 0 queue drops 90 + - queue steady at 4-8/200 (never under pressure) 91 + - latency: 3-10ms (old code degraded to 71-158ms) 92 + - firehose rate: ~58/s, `firehose_degraded: false` 93 + 94 + ## if it fails again 84 95 85 - if we still see 1011/1006 after decoupling: 96 + if we see 1011/1006 after decoupling: 86 97 - the problem is infra (fly.io LB idle timeout, upstream keepalive policies, turbostream instability) 87 98 - test the same code from a non-fly.io host to isolate the variable 88 99 - check turbostream's server-side keepalive settings