atproto utils for zig zat.dev
atproto sdk zig

websocket read blocks forever when remote disappears without FIN/RST #5

closed opened by zzstoatzz.io

problem#

when the remote jetstream relay disappears without cleanly closing the TCP connection (e.g. server resize, network path break, firewall drop), JetstreamClient.subscribe hangs forever and never reconnects.

the reconnection logic in subscribe is solid — while (true) with exponential backoff, host rotation, cursor resumption — but it never gets a chance to run because connectAndRead never returns.

root cause#

in the websocket client's readLoop, after handshake completes, readTimeout(0) is called (infinite blocking). the underlying recv() syscall blocks indefinitely. no SO_KEEPALIVE or TCP_USER_TIMEOUT is set on the socket either.

call chain: subscribe()connectAndRead()client.readLoop()read()recv() — stuck forever.

observed behavior#

bufo-bot was connected to jetstream.waow.tech. the relay was resized (no clean TCP teardown). the bot's posts_checked counter froze, but the process stayed alive (stats server thread kept it up). no reconnection attempted. required a manual fly machines restart to recover.

suggested fix#

set SO_RCVTIMEO on the websocket connection after handshake in connectAndRead. the firehose is chatty enough that 60-90s of silence reliably indicates a dead connection. the websocket library already exposes readTimeout(ms) publicly:

// in connectAndRead, after handshake:
try client.readTimeout(60_000); // 60s read timeout

when the timeout fires, read() returns error.WouldBlock, readLoop returns, connectAndRead returns, and the existing reconnect loop in subscribe kicks in.

SO_KEEPALIVE could be layered on as belt-and-suspenders but SO_RCVTIMEO alone would fix this.

Labels

None yet.

assignee

None yet.

Participants 1
AT URI
at://did:plc:xbtmt2zjwlrfegqvch7fboei/sh.tangled.repo.issue/3mg7g4xydqn22