dashboard attribution panels, ops changelog, deploy recipe updates, indigo cache tuning

+113

docs/ops-changelog.md

··· 1 + # ops changelog 2 + 3 + reverse-chronological log of operational changes, debugging sessions, and 4 + deployment decisions. complements `context.md` (the big picture) and zlay's 5 + `EXPERIMENTS.md` (memory leak experiments specifically). 6 + 7 + --- 8 + 9 + ## 2026-03-08 (later) 10 + 11 + ### fix: memory leak root cause found and fixed (zat v0.2.14) 12 + 13 + **root cause**: `HttpTransport.fetch()` in zat called `aw.toArrayList().items` 14 + which transfers ownership of the internal buffer out of the `Allocating` writer, 15 + resetting it to empty. the subsequent `defer aw.deinit()` then freed an empty 16 + slice (no-op). every HTTP request leaked one response-body-sized buffer. 17 + 18 + **fix**: replace `aw.toArrayList().items` with `aw.written()`, which reads the 19 + data without transferring ownership. `defer aw.deinit()` then properly frees 20 + the buffer. one-line fix in zat `transport.zig:85`. 21 + 22 + **how we found it**: added grafana attribution panels (memory attribution + 23 + leak rate) showing malloc_in_use growing linearly while all tracked app 24 + structures were stable. this proved the leak was inside glibc malloc, not 25 + RocksDB/mmap/thread stacks. external review narrowed to the resolver HTTP 26 + path; code audit of `toArrayList()` vs `written()` semantics confirmed it. 27 + 28 + **deploy**: bumped zlay to zat v0.2.14 (819dffe), deployed via `publish-remote`. 29 + 30 + **verification**: watch RSS growth rate in grafana. should stabilize instead of 31 + the previous ~411 MiB/hr linear climb. allow 1+ hours for confidence. 32 + 33 + ## 2026-03-08 34 + 35 + ### fix: exhausted hosts not reviving on requestCrawl (4574192) 36 + 37 + **problem**: zlay had 2337 connected PDS hosts vs indigo's 2647. DB showed 38 + 909 hosts stuck in `exhausted` status. the reconnect cronjob was running fine 39 + (every 4h, ~2800 successful requestCrawl calls) but exhausted hosts accumulated 40 + `failed_attempts` across cronjob cycles. a host that was temporarily down would 41 + hit 15 failures, get marked exhausted, and then on the next cronjob cycle the 42 + new worker would inherit the old failure count — one more failure and it's 43 + immediately re-exhausted. 44 + 45 + **fix**: reset `status='active'` and `failed_attempts=0` in `addHost()` after 46 + a host passes `describeServer`. if a PDS is reachable right now, it deserves a 47 + fresh start regardless of past failures. 48 + 49 + **verification**: after deploy + manual cronjob trigger, watch `connected_inbound` 50 + in grafana. should climb from ~2337 toward ~2600+ as exhausted hosts revive. 51 + 52 + ### add memory attribution metrics (c3fd5f4) 53 + 54 + **what**: new prometheus gauges for internal data structure capacities: 55 + `relay_validator_cache_map_cap`, `relay_did_cache_map_cap`, 56 + `relay_queued_set_map_cap`, `relay_evtbuf_cap`, `relay_outbuf_cap`, 57 + `relay_workers_count`. 58 + 59 + **why**: the memory leak investigation has eliminated glibc fragmentation 60 + (SmpAllocator experiment made it worse) and all per-frame arenas (page_allocator 61 + experiment had no effect). the leak is genuine allocated-and-never-freed memory 62 + at ~290 MiB/hour. these metrics let us compute `RSS - sum(known structures)` to 63 + see what's unaccounted for. 64 + 65 + **note**: also attempted `mallinfo2()` for accurate all-arena malloc reporting, 66 + but zig's bundled glibc headers don't include it. reverted to `mallinfo()` which 67 + only reports the main arena (unreliable for multi-threaded programs >2 GiB). 68 + 69 + ### revert GPA experiment after OOM sawtooth 70 + 71 + **what**: the GPA (GeneralPurposeAllocator) leak detection build caused RSS to 72 + grow at ~16 GiB/hour (55x worse than the base leak). hit the 8 GiB OOM limit 73 + every ~30 minutes, causing 7-8 crash/restart cycles over 3 hours. 74 + 75 + **why it was so bad**: GPA tracks per-allocation metadata (size, stack trace, 76 + mutex) for every alloc. at ~700 frames/sec × ~37 allocs/frame = ~26K tracked 77 + allocations/sec. the metadata itself consumed far more memory than the actual 78 + leak. 79 + 80 + **lesson**: GPA needs either dramatically reduced load (few PDS hosts) or a 81 + much larger node (32+ GiB). see zlay `EXPERIMENTS.md` exp-002 for details. 82 + 83 + ## 2026-03-07 84 + 85 + ### exp-001: SmpAllocator — FAILED, reverted 86 + 87 + replaced `std.heap.c_allocator` (glibc malloc) with `std.heap.smp_allocator` 88 + (zig's mmap-based multi-threaded allocator). hypothesis: glibc per-thread arena 89 + fragmentation was the root cause of RSS growth. 90 + 91 + **result**: RSS grew at ~670 MiB/hour (worse than c_allocator's ~290 MiB/hour). 92 + this definitively disproves glibc fragmentation as the cause. the leak is 93 + genuinely allocated-and-never-freed memory. 94 + 95 + reverted to c_allocator + restored `MALLOC_ARENA_MAX=2` and 96 + `MALLOC_TRIM_THRESHOLD_=131072` env vars. 97 + 98 + ## 2026-03-06 99 + 100 + ### zlay submitted to relay leaderboard 101 + 102 + zlay.waow.tech appeared on the community relay leaderboard at 99.13% coverage 103 + (177,973 users, 1,177,207 events). relay.waow.tech (indigo) at 98.77% just below. 104 + 105 + ### memory leak investigation started 106 + 107 + zlay's RSS grows linearly from ~1 GiB to the 8 GiB limit over ~12 hours, then 108 + OOM. investigation so far has ruled out per-frame arena churn, glibc 109 + fragmentation, and all four dependencies (zat, websocket, rocksdb, pg.zig — 110 + though pg.zig has a minor bug in `QueryRowUnsafe.deinit()` that can leak on 111 + drain error). 112 + 113 + see zlay `EXPERIMENTS.md` and `docs/allocation-audit.md` for the full trail.

+1 -1

indigo/deploy/relay-values.yaml

··· 12 12 # DATABASE_URL injected from secret via envFrom 13 13 RELAY_PERSIST_DIR: /data 14 14 RELAY_REPLAY_WINDOW: "2h" 15 - RELAY_IDENT_CACHE_SIZE: "2000000" 15 + RELAY_IDENT_CACHE_SIZE: "500000" 16 16 LOG_LEVEL: "info" 17 17 GOMEMLIMIT: "3GiB" 18 18 GOMAXPROCS: "4"

+122 -4

zlay/deploy/zlay-dashboard.json

··· 193 193 ] 194 194 }, 195 195 { 196 - "title": "caches", 196 + "title": "memory attribution", 197 197 "type": "timeseries", 198 198 "gridPos": { "h": 8, "w": 12, "x": 0, "y": 16 }, 199 199 "datasource": { "type": "prometheus", "uid": "prometheus" }, 200 200 "fieldConfig": { 201 201 "defaults": { 202 + "unit": "bytes", 203 + "color": { "mode": "palette-classic" }, 204 + "custom": { 205 + "fillOpacity": 15, 206 + "lineWidth": 2, 207 + "spanNulls": false, 208 + "stacking": { "mode": "normal" } 209 + } 210 + }, 211 + "overrides": [ 212 + { 213 + "matcher": { "id": "byName", "options": "RSS" }, 214 + "properties": [ 215 + { "id": "custom.fillOpacity", "value": 0 }, 216 + { "id": "custom.lineWidth", "value": 3 }, 217 + { "id": "custom.stacking", "value": { "mode": "none" } }, 218 + { "id": "color", "value": { "mode": "fixed", "fixedColor": "white" } } 219 + ] 220 + }, 221 + { 222 + "matcher": { "id": "byName", "options": "unaccounted" }, 223 + "properties": [ 224 + { "id": "custom.fillOpacity", "value": 30 }, 225 + { "id": "color", "value": { "mode": "fixed", "fixedColor": "red" } } 226 + ] 227 + } 228 + ] 229 + }, 230 + "targets": [ 231 + { 232 + "expr": "max(relay_process_rss_bytes{job=\"zlay\"})", 233 + "legendFormat": "RSS", 234 + "refId": "A" 235 + }, 236 + { 237 + "expr": "max(relay_validator_cache_map_cap{job=\"zlay\"}) * 48", 238 + "legendFormat": "validator cache", 239 + "refId": "B" 240 + }, 241 + { 242 + "expr": "max(relay_did_cache_map_cap{job=\"zlay\"}) * 48", 243 + "legendFormat": "DID cache", 244 + "refId": "C" 245 + }, 246 + { 247 + "expr": "max(relay_queued_set_map_cap{job=\"zlay\"}) * 40", 248 + "legendFormat": "resolve set", 249 + "refId": "D" 250 + }, 251 + { 252 + "expr": "max(relay_outbuf_cap{job=\"zlay\"})", 253 + "legendFormat": "outbuf", 254 + "refId": "E" 255 + }, 256 + { 257 + "expr": "max(relay_evtbuf_cap{job=\"zlay\"}) * 256", 258 + "legendFormat": "evtbuf", 259 + "refId": "F" 260 + }, 261 + { 262 + "expr": "max(relay_workers_count{job=\"zlay\"}) * 1048576", 263 + "legendFormat": "thread stacks (~1M used each)", 264 + "refId": "G" 265 + }, 266 + { 267 + "expr": "max(relay_malloc_arena_bytes{job=\"zlay\"}) + max(relay_malloc_mmap_bytes{job=\"zlay\"})", 268 + "legendFormat": "malloc (arena+mmap)", 269 + "refId": "H" 270 + }, 271 + { 272 + "expr": "max(relay_process_rss_bytes{job=\"zlay\"}) - max(relay_malloc_arena_bytes{job=\"zlay\"}) - max(relay_malloc_mmap_bytes{job=\"zlay\"}) - max(relay_workers_count{job=\"zlay\"}) * 1048576", 273 + "legendFormat": "unaccounted", 274 + "refId": "I" 275 + } 276 + ] 277 + }, 278 + { 279 + "title": "leak rate", 280 + "type": "timeseries", 281 + "gridPos": { "h": 8, "w": 12, "x": 12, "y": 16 }, 282 + "datasource": { "type": "prometheus", "uid": "prometheus" }, 283 + "fieldConfig": { 284 + "defaults": { 285 + "unit": "Bps", 286 + "color": { "mode": "palette-classic" }, 287 + "custom": { 288 + "fillOpacity": 15, 289 + "lineWidth": 2, 290 + "spanNulls": false 291 + } 292 + }, 293 + "overrides": [] 294 + }, 295 + "targets": [ 296 + { 297 + "expr": "deriv(relay_process_rss_bytes{job=\"zlay\"}[10m])", 298 + "legendFormat": "RSS growth", 299 + "refId": "A" 300 + }, 301 + { 302 + "expr": "deriv(relay_malloc_in_use_bytes{job=\"zlay\"}[10m])", 303 + "legendFormat": "malloc in_use growth", 304 + "refId": "B" 305 + }, 306 + { 307 + "expr": "deriv(relay_malloc_arena_bytes{job=\"zlay\"}[10m])", 308 + "legendFormat": "malloc arena growth", 309 + "refId": "C" 310 + } 311 + ] 312 + }, 313 + { 314 + "title": "caches", 315 + "type": "timeseries", 316 + "gridPos": { "h": 8, "w": 12, "x": 0, "y": 24 }, 317 + "datasource": { "type": "prometheus", "uid": "prometheus" }, 318 + "fieldConfig": { 319 + "defaults": { 202 320 "color": { "mode": "palette-classic" }, 203 321 "custom": { 204 322 "fillOpacity": 15, ··· 250 368 { 251 369 "title": "errors", 252 370 "type": "timeseries", 253 - "gridPos": { "h": 8, "w": 12, "x": 12, "y": 16 }, 371 + "gridPos": { "h": 8, "w": 12, "x": 12, "y": 24 }, 254 372 "datasource": { "type": "prometheus", "uid": "prometheus" }, 255 373 "fieldConfig": { 256 374 "defaults": { ··· 290 408 { 291 409 "title": "resolver", 292 410 "type": "timeseries", 293 - "gridPos": { "h": 8, "w": 8, "x": 0, "y": 24 }, 411 + "gridPos": { "h": 8, "w": 8, "x": 0, "y": 32 }, 294 412 "datasource": { "type": "prometheus", "uid": "prometheus" }, 295 413 "fieldConfig": { 296 414 "defaults": { ··· 320 438 { 321 439 "title": "disk usage", 322 440 "type": "timeseries", 323 - "gridPos": { "h": 8, "w": 8, "x": 16, "y": 24 }, 441 + "gridPos": { "h": 8, "w": 8, "x": 16, "y": 32 }, 324 442 "datasource": { "type": "prometheus", "uid": "prometheus" }, 325 443 "fieldConfig": { 326 444 "defaults": {

+32

zlay/justfile

··· 116 116 --set grafana.adminPassword="${GRAFANA_ADMIN_PASSWORD:-prom-operator}" \ 117 117 --wait --timeout 5m 118 118 kubectl apply -f deploy/zlay-servicemonitor.yaml 119 + kubectl apply -f deploy/zlay-reconnect-cronjob.yaml 119 120 120 121 echo "==> applying grafana ingress" 121 122 sed "s|GRAFANA_DOMAIN_PLACEHOLDER|$ZLAY_METRICS_DOMAIN|g" ../shared/deploy/grafana-ingress.yaml \ ··· 174 175 kubectl rollout status deployment/zlay -n zlay --timeout=120s 175 176 176 177 echo "==> deployed ${IMAGE}" 178 + DEPLOY 179 + 180 + # build with GPA leak detection enabled (exp-002). SIGTERM to get leak report. 181 + # usage: just zlay publish-gpa ReleaseSafe 182 + publish-gpa optimize="ReleaseSafe": 183 + #!/usr/bin/env bash 184 + set -euo pipefail 185 + ssh root@$(just server-ip) <<'DEPLOY' 186 + set -euo pipefail 187 + cd /opt/zlay 188 + git pull --ff-only 189 + 190 + TAG=$(git rev-parse --short HEAD) 191 + IMAGE="atcr.io/zzstoatzz.io/zlay:{{ optimize }}-gpa-${TAG}" 192 + 193 + echo "==> building binary (${TAG}, {{ optimize }}, GPA enabled)" 194 + zig build -Doptimize={{ optimize }} -Duse_gpa=true -Dtarget=x86_64-linux-gnu 195 + 196 + echo "==> building container image (${IMAGE})" 197 + buildah bud -t "${IMAGE}" -f Dockerfile.runtime . 198 + 199 + echo "==> importing into k3s containerd" 200 + buildah push "${IMAGE}" docker-archive:/tmp/zlay.tar:"${IMAGE}" 201 + ctr -n k8s.io images import /tmp/zlay.tar 202 + rm -f /tmp/zlay.tar 203 + 204 + echo "==> updating deployment image" 205 + kubectl set image deployment/zlay -n zlay main="${IMAGE}" 206 + kubectl rollout status deployment/zlay -n zlay --timeout=120s 207 + 208 + echo "==> deployed ${IMAGE} (GPA enabled — SIGTERM to get leak report)" 177 209 DEPLOY 178 210 179 211 # --- status ---