commits
when isAccountActive() returns false, query the PDS via
com.atproto.sync.getRepoStatus before dropping the commit/sync.
if the PDS reports the account as active, update the DB and pass
through — fixes stale deactivated status from missed #account events
(e.g. eurosky.social: 2,196 accounts incorrectly marked deactivated).
matches indigo's EnsureAccountActive → FetchAccountStatus pattern.
also fixes memory leak in pullHosts: toArrayList().items → written().
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
content migrated to relay repo's docs/ops-changelog.md.
memory leak is fixed (zat v0.2.14), experiments are closed,
open items tracked centrally.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- zat v0.2.16: websocket.zig reads full HTTP body on TCP split writes
- collection index: index on both create and update ops (matches indigo)
- TODO.md: update done list, promote error frame cleanup to current
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
uidForDidFromHost no longer auto-sets host_id. callers now resolve the
DID doc synchronously to verify the PDS endpoint matches the incoming
host before processing. events from non-authoritative hosts are dropped.
- validator.zig: add resolveHostAuthority(did, host_id) -> accept/migrate/reject
- frame_worker.zig: gate on host authority check for new/migrated accounts
- subscriber.zig: same check mirrored for direct subscriber path
- event_log.zig: uidForDidFromHost returns is_new without setting host_id
- #identity events exempt (any PDS can emit them, matching indigo)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
spec says t should not be present when op is -1. removes t: "#info"
from both buildErrorFrame and encodeInfoFrame headers.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- getHostStatus: 400 → 404 for missing host
- listReposByCollection: limit cap 1000 → 2000 (+ backing buffer)
- requestCrawl: all errors use XRPC {"error":"..","message":".."} format
- listRepos: always emit head and rev fields
- invalid websocket cursor: send #info InvalidCursor frame instead of
silently falling back to live stream
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This reverts commit 690958a9e73a8d9df33b0c5606602fb40758538a.
C+ (6.2/10) vs indigo's B (7.8/10). tiered plan to close the gap:
lossless ingestion, persist-before-broadcast, validation tightening,
api conformance.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
submit() now blocks when the target worker's queue is full, polling
shutdown every 100ms. cursor advances only after pool accepts the frame.
no more silent frame drops or premature cursor advancement.
new pool_backpressure metric tracks how often submit blocks >1ms.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
commits for the same DID can land on different workers when keyed by
host_id, causing chain break races (~727/sec). hash the DID from the
decoded payload ("repo" for #commit, "did" for #identity/#account/#sync)
so all events for the same account serialize on the same worker.
matches indigo's parallel scheduler which keys AddWork by repo DID.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
switch from dropping events that exceed per-host rate limits to blocking
until the window opens. this creates TCP backpressure that slows the
upstream PDS, matching indigo's waitForLimiter behavior. revert limits
to indigo defaults (50/s, 2500/hr, 20k/day) since they now trigger
backpressure rather than drops.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
after fixing per-second (50→200), per-tier metrics showed hourly limit
became the new bottleneck: 91% of drops at 50min uptime. small PDS
hosts with ~61 accounts were hitting the 2561/hr ceiling.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
per-tier metrics showed 100% of rate limiting was the per-second tier
during startup catchup. hour and day limits contributed zero drops.
200/s still protects against abusive hosts while giving small PDS
hosts headroom during burst catchup.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
no more guessing which tier is actually throttling.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
integer-second timestamps made the per-second window degenerate:
elapsed was always 0, so prev_count was always weighted at 100%,
effectively halving throughput vs fixed windows. millisecond
precision gives proper sub-second interpolation.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
fixed-window counters penalize bursts at window boundaries. sliding
windows interpolate prev window count by time elapsed, smoothing
burst tolerance. reverts 3x base limit hack — same limits as indigo now.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
rate limiter was using flat 2500/hour for all hosts regardless of size.
indigo scales by account count (2500 + N accounts per hour, etc).
large PDS hosts like bsky shards were getting rate-limited ~50%.
also gate malloc_trim behind Linux check so local builds work on macOS.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
toArrayList() leak is fixed in zat v0.2.14 — safe to reuse
connections now. resolver queue is saturated at 100K with
fresh TLS handshakes per resolve; persistent connections
should improve throughput significantly.
watch: queue length, skipped validations, memory slope.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
when requestCrawl re-validates an exhausted host via describeServer,
reset its status to 'active' and clear failed_attempts. without this,
exhausted hosts accumulate failures across cronjob cycles and
immediately re-exhaust on a single connection failure.
909 of 3140 tracked hosts were stuck in 'exhausted' state.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
new prometheus gauges for internal data structure capacities:
- validator_cache_map_cap, did_cache_map_cap, queued_set_map_cap
- evtbuf_cap, outbuf_cap, workers_count
switch mallinfo() to mallinfo2() for accurate multi-arena reporting
(size_t fields, all thread arenas instead of just main arena).
pure observability — zero behavioral change.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
build with -Duse_gpa=true to wrap all allocations in
GeneralPurposeAllocator. on clean shutdown (SIGTERM), GPA reports
every allocation that was never freed, with 8-frame stack traces.
zero overhead when disabled (default).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
SmpAllocator RSS grew at ~670 MiB/hour vs c_allocator's ~290 MiB/hour.
disproves glibc fragmentation hypothesis — the leak is genuine.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
glibc malloc fragments under cross-thread alloc/free patterns:
~2750 subscriber threads alloc frame data, 16 worker threads free it.
glibc's per-thread arenas hold freed pages indefinitely, causing
~290 MiB/hour linear RSS growth.
SmpAllocator uses mmap/munmap directly with thread-local freelists
and cross-thread reclamation — no glibc malloc involvement.
also adds allocation audit doc and EXPERIMENTS.md for tracking.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
calls malloc_trim(0) every 10 minutes to ask glibc to return free
pages to the OS. if RSS drops after each trim, the linear growth is
malloc fragmentation rather than a true leak.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This reverts commit f6f9d9b4caaa74211a7bc62c6116496e2d98bb0c.
glibc malloc was not returning freed arena pages to the OS under
high-churn multi-thread load (~700 frames/sec across 2250+ subscriber
threads + 16 worker threads), causing linear RSS growth from 0 to
~3.5 GiB over 12h despite correct defer arena.deinit() everywhere.
page_allocator uses mmap/munmap directly — pages are guaranteed to
return to the OS on deinit(). ArenaAllocator batches small allocs
into chunks, so actual mmap calls per frame are ~1-3 (negligible).
long-lived state (caches, DB, ring buffer) stays on c_allocator.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
phase 1: fix extractOps/checkCommitStructure to read firehose "path" field
instead of nonexistent "collection"/"rkey" — verify_commit_diff was dead code.
phase 2: add since/prevData chain continuity checks and future-rev rejection.
log-only + chain_breaks metric, no commits dropped yet.
phase 3: conditional upsert (WHERE rev < EXCLUDED.rev) on updateAccountState
to prevent concurrent workers from rolling back rev on same DID.
also: fix deploy docs in CLAUDE.md (justfile module syntax, kubeconfig path).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
CLAUDE.md loads into every conversation — keep it to guardrails only
(23 lines, down from 100). moved pg.zig API patterns, rocksdb-zig
traps, websocket.zig fork details, metrics gotchas, and deploy
operational tips to docs/gotchas.md.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
stash operational context from zat project memory into the zlay repo
where it belongs — pg.zig API patterns, websocket.zig fork details,
metrics gotchas (mallinfo overflow, File.reader API), and deploy
operational tips (minimal containers, helm SHA tag dance).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ReleaseSafe deployed successfully — ~1.1 GiB RSS at 2,255 hosts
(vs ~2.7 GiB debug). update all docs to reflect current state:
- README: remove "debug" default, remove "not in prod" caveat
- deployment.md: ReleaseSafe is production default, add frame pool
to memory tuning section, update resource usage
- design.md: 8 MiB stacks, updated RSS numbers, per-thread breakdown
- incident doc: current state shows successful ReleaseSafe deploy
- CLAUDE.md: operational context for AI-assisted development
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
bump default_stack_size from 4 MiB to 8 MiB — TLS handshake path
needs ~134 KiB in ReleaseSafe (tls.Client.init 84 KiB + KeyShare.init
48 KiB). only touched pages count as RSS.
symbol analysis of cross-compiled ReleaseSafe binary shows reader
thread per-frame hot path uses ~2.7 KiB of stack. the thread pool
(f0c7baf) moved all heavy work off reader threads, so the 3.9 MiB
per-thread RSS from the incident should not recur.
also updates docs to reflect thread pool architecture (design.md,
deployment.md, README.md) and adds readiness analysis to incident doc.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
the migration queue grew unbounded during warm-up because resolveLoop
only drained it when the DID queue was empty (which never happened with
~1000 new DIDs/sec vs ~40/sec drain rate). RSS grew ~17 MB/min.
like indigo and rsky, host validation now happens inline when resolving
a DID document — no separate migration queue. the resolve queue is
capped at 100K entries; dropped DIDs can be re-queued on future cache
misses since they're not added to the dedup set.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
separates lightweight reader threads from heavy frame processing workers.
reader threads keep cursor tracking, rate limiting, and frame type filtering;
heavy work (CBOR decode, validation, DB persist, broadcast) offloaded to a
configurable pool of N workers (default 16, env FRAME_WORKERS).
per-host ordering preserved via key-partitioned queues (host_id % N).
broadcast_order mutex unchanged, just fewer contenders.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
the queued_set dedup was accidentally removed in b96335c (LRU refactor).
without it, every cache miss for the same DID appends a fresh dupe to the
queue — 18K entries within 6 minutes. adds regression test.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
8 MiB × 2,500 threads touches too many pages over time — RSS grew
monotonically to 8 GiB (3 OOM kills in 6 hours). 4 MiB is 2x the
proven debug floor. also updates incident doc to reflect current state.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
during startup with ~2,700 threads, mutex contention blocks the
single-threaded metrics server indefinitely. switch all metric-
gathering methods (validator queues, LRU caches, event buffer,
ring buffer) to tryLock — returns 0 instead of blocking when
the lock is contended. prometheus tolerates missing data points.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
the single-threaded metrics server on :3001 blocks on accept/read with
no timeout. if any client connects but doesn't complete the request
(e.g. during startup storm), the thread blocks forever and all
subsequent scrapes fail. sets SO_RCVTIMEO=5s on accepted connections.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
pullHosts pages through bsky.network API (~6 HTTPS requests), taking
30+ seconds. previously blocked start() → main() → HTTP server.
now everything (pullHosts + listActiveHosts + spawnWorker) runs in
the background thread, so the HTTP server and health probes come up
within seconds of process start.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
throttled batching (25/2s) made memory WORSE — extended the overlap
of threads in TLS handshake phase, reaching 8+ GiB. unthrottled
spawning produces a higher but shorter spike (~3 GiB for ~10s) as
many connections fail fast and free memory quickly. background thread
preserved so HTTP server + health probes start immediately.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
confirmed OOM kill (not stack overflow) via dmesg. ReleaseSafe uses
~3.9 MiB RSS/thread vs 0.4 MiB debug — 10x difference from inlining.
2,750 threads need ~10.7 GiB, far beyond container limit. throttled
startup and background spawning don't help with steady-state RSS.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
move host loading and throttled worker spawning into a dedicated
background thread so start() returns immediately. this lets the
HTTP server and health probes come up before all 2,750 hosts are
connected — matching indigo's ResubscribeAllHosts pattern where
Subscribe spawns a goroutine and returns instantly.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
previous throttle (50/200ms) still OOM-killed — all 2,750 threads
accumulated within 11s, overwhelming the cgroup limit during
concurrent TLS handshakes. new rate: 25 per 2s spreads startup
over ~3.7 minutes, bounding concurrent handshakes.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
spawn workers in batches of 50 with 200ms pauses between batches.
all 2,750 hosts confirmed OOM-killed by cgroup at ~3 GiB RSS when
spawned simultaneously — thundering herd of TLS handshakes.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- add /_healthz (trivial liveness, no DB) and /_readyz (DB check)
- consolidate all thread stack sizes to reference main.default_stack_size
- add build_info{git_sha,optimize} canary metric via build options
- embed git SHA and optimize mode at compile time
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- mallinfo: arena, in-use, free, mmap bytes
- /proc/self/status: VmHWM, RssAnon (in addition to existing Threads)
- resolve queue: queue length, dedup set count
- metrics buffer bumped to 64 KiB
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
zig 0.15's ReleaseSafe inlines TLS/crypto safety checks into stack
frames that exceed 4 MiB. Debug and ReleaseFast work at 4 MiB but
ReleaseFast has no safety checks (double free detected by glibc).
8 MiB with ReleaseSafe gives both safety and enough stack space.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
subscriber, resolver, consumer, crawl, flush, and backfill threads
were hardcoded at 2 MiB while only GC/metrics used default_stack_size.
TLS handshake paths on some PDS hosts overflow at 2 MiB.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
the migration queue had no dedup — unlike the resolve queue which uses
queued_set. every message from a DID with a host mismatch would dupe
the DID string from c_allocator and append to the queue. with thousands
of mismatched DIDs producing messages faster than the 4 resolver threads
could drain, the queue + duped strings grew without bound (~430 MiB/hr).
- add migration_pending set to dedup migration queue entries
- on confirmed migration: remove from pending (allow re-evaluation)
- on rejected migration: leave in pending (suppress re-queueing)
- evictKey (#identity events) clears pending for that DID
- add relay_validator_migration_pending prometheus metric
- convert validator cache and DID cache to proper LRU (from lru.zig)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
the validator key cache was set to 500K entries but the pod OOMs at
~450K — eviction never triggers. indigo uses 5M with LRU+TTL eviction;
rsky uses 262K. 250K matches rsky and fits within the 3 GiB pod limit.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
_health on port 3000 competed with consumer WebSocket traffic, causing
k8s probe timeouts under load. also returned unconditional 200 without
actually checking database connectivity.
- _health now runs SELECT 1 against postgres, returns 500 on failure
- metrics port (3001) serves /_health alongside /metrics, with routing
- k8s probes move to port 3001 (in relay deploy config)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- drop commits where rev <= stored rev before persist (indigo ingest.go:114)
- verify DID→PDS host binding on first-seen accounts (async, reject on mismatch)
- on signature failure, evict cached key + re-resolve (sync spec guidance)
- add spec conformance tests for size limits and unknown frame types
- document deliberate policy divergences from indigo in design.md
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
when isAccountActive() returns false, query the PDS via
com.atproto.sync.getRepoStatus before dropping the commit/sync.
if the PDS reports the account as active, update the DB and pass
through — fixes stale deactivated status from missed #account events
(e.g. eurosky.social: 2,196 accounts incorrectly marked deactivated).
matches indigo's EnsureAccountActive → FetchAccountStatus pattern.
also fixes memory leak in pullHosts: toArrayList().items → written().
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
uidForDidFromHost no longer auto-sets host_id. callers now resolve the
DID doc synchronously to verify the PDS endpoint matches the incoming
host before processing. events from non-authoritative hosts are dropped.
- validator.zig: add resolveHostAuthority(did, host_id) -> accept/migrate/reject
- frame_worker.zig: gate on host authority check for new/migrated accounts
- subscriber.zig: same check mirrored for direct subscriber path
- event_log.zig: uidForDidFromHost returns is_new without setting host_id
- #identity events exempt (any PDS can emit them, matching indigo)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- getHostStatus: 400 → 404 for missing host
- listReposByCollection: limit cap 1000 → 2000 (+ backing buffer)
- requestCrawl: all errors use XRPC {"error":"..","message":".."} format
- listRepos: always emit head and rev fields
- invalid websocket cursor: send #info InvalidCursor frame instead of
silently falling back to live stream
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
submit() now blocks when the target worker's queue is full, polling
shutdown every 100ms. cursor advances only after pool accepts the frame.
no more silent frame drops or premature cursor advancement.
new pool_backpressure metric tracks how often submit blocks >1ms.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
commits for the same DID can land on different workers when keyed by
host_id, causing chain break races (~727/sec). hash the DID from the
decoded payload ("repo" for #commit, "did" for #identity/#account/#sync)
so all events for the same account serialize on the same worker.
matches indigo's parallel scheduler which keys AddWork by repo DID.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
switch from dropping events that exceed per-host rate limits to blocking
until the window opens. this creates TCP backpressure that slows the
upstream PDS, matching indigo's waitForLimiter behavior. revert limits
to indigo defaults (50/s, 2500/hr, 20k/day) since they now trigger
backpressure rather than drops.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
rate limiter was using flat 2500/hour for all hosts regardless of size.
indigo scales by account count (2500 + N accounts per hour, etc).
large PDS hosts like bsky shards were getting rate-limited ~50%.
also gate malloc_trim behind Linux check so local builds work on macOS.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
toArrayList() leak is fixed in zat v0.2.14 — safe to reuse
connections now. resolver queue is saturated at 100K with
fresh TLS handshakes per resolve; persistent connections
should improve throughput significantly.
watch: queue length, skipped validations, memory slope.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
when requestCrawl re-validates an exhausted host via describeServer,
reset its status to 'active' and clear failed_attempts. without this,
exhausted hosts accumulate failures across cronjob cycles and
immediately re-exhaust on a single connection failure.
909 of 3140 tracked hosts were stuck in 'exhausted' state.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
new prometheus gauges for internal data structure capacities:
- validator_cache_map_cap, did_cache_map_cap, queued_set_map_cap
- evtbuf_cap, outbuf_cap, workers_count
switch mallinfo() to mallinfo2() for accurate multi-arena reporting
(size_t fields, all thread arenas instead of just main arena).
pure observability — zero behavioral change.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
glibc malloc fragments under cross-thread alloc/free patterns:
~2750 subscriber threads alloc frame data, 16 worker threads free it.
glibc's per-thread arenas hold freed pages indefinitely, causing
~290 MiB/hour linear RSS growth.
SmpAllocator uses mmap/munmap directly with thread-local freelists
and cross-thread reclamation — no glibc malloc involvement.
also adds allocation audit doc and EXPERIMENTS.md for tracking.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
glibc malloc was not returning freed arena pages to the OS under
high-churn multi-thread load (~700 frames/sec across 2250+ subscriber
threads + 16 worker threads), causing linear RSS growth from 0 to
~3.5 GiB over 12h despite correct defer arena.deinit() everywhere.
page_allocator uses mmap/munmap directly — pages are guaranteed to
return to the OS on deinit(). ArenaAllocator batches small allocs
into chunks, so actual mmap calls per frame are ~1-3 (negligible).
long-lived state (caches, DB, ring buffer) stays on c_allocator.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
phase 1: fix extractOps/checkCommitStructure to read firehose "path" field
instead of nonexistent "collection"/"rkey" — verify_commit_diff was dead code.
phase 2: add since/prevData chain continuity checks and future-rev rejection.
log-only + chain_breaks metric, no commits dropped yet.
phase 3: conditional upsert (WHERE rev < EXCLUDED.rev) on updateAccountState
to prevent concurrent workers from rolling back rev on same DID.
also: fix deploy docs in CLAUDE.md (justfile module syntax, kubeconfig path).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
stash operational context from zat project memory into the zlay repo
where it belongs — pg.zig API patterns, websocket.zig fork details,
metrics gotchas (mallinfo overflow, File.reader API), and deploy
operational tips (minimal containers, helm SHA tag dance).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ReleaseSafe deployed successfully — ~1.1 GiB RSS at 2,255 hosts
(vs ~2.7 GiB debug). update all docs to reflect current state:
- README: remove "debug" default, remove "not in prod" caveat
- deployment.md: ReleaseSafe is production default, add frame pool
to memory tuning section, update resource usage
- design.md: 8 MiB stacks, updated RSS numbers, per-thread breakdown
- incident doc: current state shows successful ReleaseSafe deploy
- CLAUDE.md: operational context for AI-assisted development
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
bump default_stack_size from 4 MiB to 8 MiB — TLS handshake path
needs ~134 KiB in ReleaseSafe (tls.Client.init 84 KiB + KeyShare.init
48 KiB). only touched pages count as RSS.
symbol analysis of cross-compiled ReleaseSafe binary shows reader
thread per-frame hot path uses ~2.7 KiB of stack. the thread pool
(f0c7baf) moved all heavy work off reader threads, so the 3.9 MiB
per-thread RSS from the incident should not recur.
also updates docs to reflect thread pool architecture (design.md,
deployment.md, README.md) and adds readiness analysis to incident doc.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
the migration queue grew unbounded during warm-up because resolveLoop
only drained it when the DID queue was empty (which never happened with
~1000 new DIDs/sec vs ~40/sec drain rate). RSS grew ~17 MB/min.
like indigo and rsky, host validation now happens inline when resolving
a DID document — no separate migration queue. the resolve queue is
capped at 100K entries; dropped DIDs can be re-queued on future cache
misses since they're not added to the dedup set.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
separates lightweight reader threads from heavy frame processing workers.
reader threads keep cursor tracking, rate limiting, and frame type filtering;
heavy work (CBOR decode, validation, DB persist, broadcast) offloaded to a
configurable pool of N workers (default 16, env FRAME_WORKERS).
per-host ordering preserved via key-partitioned queues (host_id % N).
broadcast_order mutex unchanged, just fewer contenders.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
during startup with ~2,700 threads, mutex contention blocks the
single-threaded metrics server indefinitely. switch all metric-
gathering methods (validator queues, LRU caches, event buffer,
ring buffer) to tryLock — returns 0 instead of blocking when
the lock is contended. prometheus tolerates missing data points.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
the single-threaded metrics server on :3001 blocks on accept/read with
no timeout. if any client connects but doesn't complete the request
(e.g. during startup storm), the thread blocks forever and all
subsequent scrapes fail. sets SO_RCVTIMEO=5s on accepted connections.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
pullHosts pages through bsky.network API (~6 HTTPS requests), taking
30+ seconds. previously blocked start() → main() → HTTP server.
now everything (pullHosts + listActiveHosts + spawnWorker) runs in
the background thread, so the HTTP server and health probes come up
within seconds of process start.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
throttled batching (25/2s) made memory WORSE — extended the overlap
of threads in TLS handshake phase, reaching 8+ GiB. unthrottled
spawning produces a higher but shorter spike (~3 GiB for ~10s) as
many connections fail fast and free memory quickly. background thread
preserved so HTTP server + health probes start immediately.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
confirmed OOM kill (not stack overflow) via dmesg. ReleaseSafe uses
~3.9 MiB RSS/thread vs 0.4 MiB debug — 10x difference from inlining.
2,750 threads need ~10.7 GiB, far beyond container limit. throttled
startup and background spawning don't help with steady-state RSS.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
move host loading and throttled worker spawning into a dedicated
background thread so start() returns immediately. this lets the
HTTP server and health probes come up before all 2,750 hosts are
connected — matching indigo's ResubscribeAllHosts pattern where
Subscribe spawns a goroutine and returns instantly.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- add /_healthz (trivial liveness, no DB) and /_readyz (DB check)
- consolidate all thread stack sizes to reference main.default_stack_size
- add build_info{git_sha,optimize} canary metric via build options
- embed git SHA and optimize mode at compile time
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
zig 0.15's ReleaseSafe inlines TLS/crypto safety checks into stack
frames that exceed 4 MiB. Debug and ReleaseFast work at 4 MiB but
ReleaseFast has no safety checks (double free detected by glibc).
8 MiB with ReleaseSafe gives both safety and enough stack space.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
the migration queue had no dedup — unlike the resolve queue which uses
queued_set. every message from a DID with a host mismatch would dupe
the DID string from c_allocator and append to the queue. with thousands
of mismatched DIDs producing messages faster than the 4 resolver threads
could drain, the queue + duped strings grew without bound (~430 MiB/hr).
- add migration_pending set to dedup migration queue entries
- on confirmed migration: remove from pending (allow re-evaluation)
- on rejected migration: leave in pending (suppress re-queueing)
- evictKey (#identity events) clears pending for that DID
- add relay_validator_migration_pending prometheus metric
- convert validator cache and DID cache to proper LRU (from lru.zig)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
_health on port 3000 competed with consumer WebSocket traffic, causing
k8s probe timeouts under load. also returned unconditional 200 without
actually checking database connectivity.
- _health now runs SELECT 1 against postgres, returns 500 on failure
- metrics port (3001) serves /_health alongside /metrics, with routing
- k8s probes move to port 3001 (in relay deploy config)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- drop commits where rev <= stored rev before persist (indigo ingest.go:114)
- verify DID→PDS host binding on first-seen accounts (async, reject on mismatch)
- on signature failure, evict cached key + re-resolve (sync spec guidance)
- add spec conformance tests for size limits and unknown frame types
- document deliberate policy divergences from indigo in design.md
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>