A fork of mtelver's day10 project

Add gap analysis and fresh docs design

Gap analysis comparing day10 to ocaml-docs-ci for docs.ocaml.org migration:
- Feature comparison matrix
- Identified critical gaps (epochs, change detection)
- Noted that OCluster distribution is not needed (single machine in practice)
- Revised timeline: 16 weeks instead of 22

Fresh docs design ("always fresh, always safe"):
- Always solve against current opam-repository (no stale cross-refs)
- Atomic package-level updates via directory swap
- Epoch transitions for major structural changes
- Build and docs phases independent (doc failures don't block builds)
- Retry with backoff, fail fast on errors
- Webhook trigger + cron fallback
- Zulip notifications on failures
- Permanent log retention

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

+964
+697
docs/GAP_ANALYSIS.md
··· 1 + # Gap Analysis: Replacing ocaml-docs-ci with day10 2 + 3 + **Date:** 2026-02-03 4 + **Purpose:** Comprehensive comparison of `day10` (OHC) and `ocaml-docs-ci` to identify features, gaps, and requirements for replacing ocaml-docs-ci as the documentation CI system for docs.ocaml.org. 5 + 6 + --- 7 + 8 + ## Table of Contents 9 + 10 + 1. [Executive Summary](#executive-summary) 11 + 2. [Architecture Overview](#architecture-overview) 12 + 3. [Feature Comparison Matrix](#feature-comparison-matrix) 13 + 4. [Detailed Gap Analysis](#detailed-gap-analysis) 14 + 5. [Ecosystem Integration](#ecosystem-integration) 15 + 6. [Implementation Roadmap](#implementation-roadmap) 16 + 7. [Risk Assessment](#risk-assessment) 17 + 18 + --- 19 + 20 + ## Executive Summary 21 + 22 + ### Current State 23 + 24 + | Aspect | day10 | ocaml-docs-ci | 25 + |--------|-------|---------------| 26 + | **Primary Purpose** | Health checking OPAM packages (build + docs) | CI pipeline for docs.ocaml.org | 27 + | **Architecture** | Standalone CLI with fork-based parallelism | OCurrent-based reactive pipeline | 28 + | **Container Runtime** | runc/OCI with overlay2 layers | OCluster (single machine in practice) | 29 + | **Doc Generation** | Uses odoc_driver_voodoo | Uses voodoo-do + odoc_driver_voodoo | 30 + | **State Management** | File-based (layer.json) | SQLite database + OCurrent cache | 31 + | **Scalability** | Single machine, forked workers | Single machine (OCluster theoretical) | 32 + 33 + ### Key Findings 34 + 35 + **Important Context:** While ocaml-docs-ci has OCluster infrastructure for theoretically distributed execution, **in practice it runs on a single machine**. This significantly reduces the gap between the two systems. 36 + 37 + **day10 Strengths:** 38 + - Simpler, more portable architecture (Linux/Windows/FreeBSD) 39 + - Efficient overlay2-based incremental building 40 + - Direct container control without orchestration overhead 41 + - Standalone operation without external services 42 + - Comparable parallelism model (fork-based vs single-machine OCluster) 43 + 44 + **ocaml-docs-ci Strengths:** 45 + - Production-proven for docs.ocaml.org 46 + - Reactive pipeline with automatic rebuilding 47 + - Rich monitoring and status APIs 48 + - Epoch-based atomic updates 49 + - Web UI for status visibility 50 + 51 + ### Migration Complexity: **MODERATE** 52 + 53 + Since both systems effectively run on single machines, the gap is smaller than it might appear from the architecture diagrams. The core documentation generation is identical (both use voodoo/odoc_driver_voodoo). The main gaps are in orchestration (reactive vs manual), state management, and deployment infrastructure (epochs). 54 + 55 + --- 56 + 57 + ## Architecture Overview 58 + 59 + ### day10 Architecture 60 + 61 + ``` 62 + ┌─────────────────────────────────────────────────────────────┐ 63 + │ day10 CLI │ 64 + ├─────────────────────────────────────────────────────────────┤ 65 + │ Commands: health-check | ci | batch | list | sync-docs │ 66 + └─────────────────────┬───────────────────────────────────────┘ 67 + 68 + ┌────────────┼────────────┐ 69 + ▼ ▼ ▼ 70 + ┌─────────────┐ ┌──────────┐ ┌──────────────┐ 71 + │ Solver │ │ Builder │ │ Doc Gen │ 72 + │ opam-0install│ │ runc │ │odoc_driver │ 73 + └─────────────┘ └──────────┘ └──────────────┘ 74 + │ │ │ 75 + └────────────┼────────────┘ 76 + 77 + ┌────────────────────────┐ 78 + │ Overlay2 Layers │ 79 + │ (cache_dir/) │ 80 + │ ├── base/fs │ 81 + │ ├── build-{hash}/ │ 82 + │ ├── doc-{hash}/ │ 83 + │ └── layer.json │ 84 + └────────────────────────┘ 85 + ``` 86 + 87 + **Key Characteristics:** 88 + - Single-machine execution with fork-based parallelism 89 + - Layer-based caching with overlay2 filesystem 90 + - Deterministic hash-based layer identification 91 + - Direct runc container execution 92 + 93 + ### ocaml-docs-ci Architecture 94 + 95 + ``` 96 + ┌─────────────────────────────────────────────────────────────┐ 97 + │ ocaml-docs-ci │ 98 + │ (OCurrent Pipeline) │ 99 + ├─────────────────────────────────────────────────────────────┤ 100 + │ Stages: Track → Solve → Prep → Bless → Compile → Publish │ 101 + └─────────────────────┬───────────────────────────────────────┘ 102 + 103 + ┌─────────────────┼─────────────────┐ 104 + ▼ ▼ ▼ 105 + ┌─────────┐ ┌───────────┐ ┌──────────────┐ 106 + │ Solver │ │ OCluster │ │ Storage │ 107 + │ Service │ │ (Workers) │ │ Server │ 108 + │(Cap'n P)│ │ │ │ (SSH/rsync) │ 109 + └─────────┘ └───────────┘ └──────────────┘ 110 + 111 + ┌──────────┴──────────┐ 112 + ▼ ▼ 113 + ┌─────────────────┐ ┌─────────────────┐ 114 + │ prep/ │ │ html/ │ 115 + │ (voodoo-prep) │ │ (HTML output) │ 116 + └─────────────────┘ └─────────────────┘ 117 + 118 + 119 + ┌─────────────────┐ 120 + │ docs.ocaml.org │ 121 + │ (epoch symlinks)│ 122 + └─────────────────┘ 123 + ``` 124 + 125 + **Key Characteristics:** 126 + - OCluster infrastructure (but single-machine in practice) 127 + - Reactive pipeline (rebuilds on changes) 128 + - SQLite for state tracking 129 + - Cap'n Proto for service communication 130 + - Epoch-based atomic deployments 131 + 132 + **Note:** Despite the distributed architecture in the diagram, ocaml-docs-ci currently runs all workers on a single machine, making it comparable to day10's fork-based approach. 133 + 134 + --- 135 + 136 + ## Feature Comparison Matrix 137 + 138 + ### Core Features 139 + 140 + | Feature | day10 | ocaml-docs-ci | Gap Level | 141 + |---------|-------|---------------|-----------| 142 + | **Package Building** | ✅ Full | ✅ Full | None | 143 + | **Documentation Generation** | ✅ odoc_driver_voodoo | ✅ voodoo + odoc_driver | None | 144 + | **Dependency Solving** | ✅ opam-0install | ✅ opam-0install (service) | Minor | 145 + | **Multiple OCaml Versions** | ✅ Configurable | ✅ Multiple tracked | None | 146 + | **Blessing System** | ✅ Implemented | ✅ Implemented | None | 147 + | **Incremental Building** | ✅ overlay2 layers | ✅ prep caching | Different approach | 148 + 149 + ### Orchestration & Scheduling 150 + 151 + | Feature | day10 | ocaml-docs-ci | Gap Level | 152 + |---------|-------|---------------|-----------| 153 + | **Parallelism** | ✅ Fork-based (--fork N) | ✅ OCluster (single machine) | Similar | 154 + | **Distributed Execution** | ❌ Single machine | ⚠️ Single machine (theory: multi) | None (in practice) | 155 + | **Reactive Rebuilding** | ❌ Manual trigger | ✅ OCurrent reactive | **MAJOR GAP** | 156 + | **Job Queuing** | ❌ None | ✅ OCluster scheduler | Minor | 157 + | **Automatic Change Detection** | ❌ Manual | ✅ Git-based tracking | **MAJOR GAP** | 158 + 159 + ### State Management 160 + 161 + | Feature | day10 | ocaml-docs-ci | Gap Level | 162 + |---------|-------|---------------|-----------| 163 + | **Build State Tracking** | ✅ layer.json files | ✅ SQLite database | Different | 164 + | **Solution Caching** | ✅ Per-commit hash | ✅ Per-commit hash | Similar | 165 + | **Pipeline History** | ❌ None | ✅ Full history in DB | **MAJOR GAP** | 166 + | **Package Status Tracking** | ⚠️ Basic (JSON) | ✅ Full (DB + API) | **Moderate** | 167 + | **Epoch Management** | ❌ None | ✅ Full (atomic updates) | **MAJOR GAP** | 168 + 169 + ### External Integrations 170 + 171 + | Feature | day10 | ocaml-docs-ci | Gap Level | 172 + |---------|-------|---------------|-----------| 173 + | **opam-repository Tracking** | ✅ Local path | ✅ Git clone + tracking | Minor | 174 + | **Storage Backend** | ✅ Local filesystem | ✅ SSH/rsync server | **Moderate** | 175 + | **Web UI** | ❌ None | ✅ OCurrent web | **MAJOR GAP** | 176 + | **API for Querying** | ❌ None | ✅ Cap'n Proto API | **MAJOR GAP** | 177 + | **GitHub Integration** | ❌ None | ✅ Via opam-repo | Minor | 178 + 179 + ### Output & Publishing 180 + 181 + | Feature | day10 | ocaml-docs-ci | Gap Level | 182 + |---------|-------|---------------|-----------| 183 + | **HTML Generation** | ✅ Full | ✅ Full | None | 184 + | **Search Index** | ✅ Via odoc_driver | ✅ Via voodoo-gen | None | 185 + | **Atomic Deployment** | ❌ None | ✅ Epoch symlinks | **MAJOR GAP** | 186 + | **Valid Package List** | ❌ None | ✅ Published list | **Moderate** | 187 + | **Sync to Remote** | ✅ sync-docs command | ✅ rsync integration | Similar | 188 + 189 + ### Platform Support 190 + 191 + | Feature | day10 | ocaml-docs-ci | Gap Level | 192 + |---------|-------|---------------|-----------| 193 + | **Linux x86_64** | ✅ | ✅ | None | 194 + | **Linux arm64** | ✅ | ✅ | None | 195 + | **Windows** | ✅ containerd | ❌ Linux only | day10 ahead | 196 + | **FreeBSD** | ✅ | ❌ | day10 ahead | 197 + | **Multi-arch builds** | ✅ | ✅ | None | 198 + 199 + --- 200 + 201 + ## Detailed Gap Analysis 202 + 203 + ### 1. CRITICAL GAPS (Must Have) 204 + 205 + #### 1.1 Reactive Pipeline / Change Detection 206 + 207 + **ocaml-docs-ci has:** 208 + - OCurrent-based reactive pipeline that automatically rebuilds when inputs change 209 + - Git-based tracking of opam-repository commits 210 + - Automatic detection of new/updated packages 211 + - Dependency-aware rebuilding (if A changes, rebuild dependents) 212 + 213 + **day10 lacks:** 214 + - No automatic change detection 215 + - Manual triggering required 216 + - No concept of "pipeline" - just single-shot execution 217 + 218 + **Implementation Options:** 219 + 1. **Add OCurrent integration** - Wrap day10 in OCurrent pipeline 220 + 2. **Implement custom watcher** - Poll opam-repo, track changes, trigger builds 221 + 3. **External orchestration** - Use GitHub Actions/Jenkins to trigger day10 222 + 223 + **Recommended:** Option 1 or 3. Adding full OCurrent would be significant work but provides the richest feature set. 224 + 225 + --- 226 + 227 + #### 1.2 ~~Distributed Execution~~ (Not a Real Gap) 228 + 229 + **Reality check:** While ocaml-docs-ci has OCluster infrastructure, **it runs on a single machine in practice**. This means: 230 + 231 + - Both systems effectively use single-machine parallelism 232 + - day10's fork-based approach (`--fork N`) is comparable to ocaml-docs-ci's actual operation 233 + - OCluster adds overhead without providing real distribution benefits in current deployment 234 + 235 + **Conclusion:** This is **not a gap** for the migration. day10's existing parallelism model is sufficient. 236 + 237 + **Future consideration:** If true distribution becomes needed, day10 could add OCluster support, but this is not required for feature parity with the current production system. 238 + 239 + --- 240 + 241 + #### 1.3 Epoch-Based Deployment 242 + 243 + **ocaml-docs-ci has:** 244 + - Epoch system for versioned artifact collections 245 + - Atomic promotion via symlinks (html-current → html-live) 246 + - Garbage collection of old epochs 247 + - Safe rollback capability 248 + 249 + **day10 lacks:** 250 + - No epoch concept 251 + - Direct file output 252 + - No atomic update mechanism 253 + 254 + **Implementation Required:** 255 + - Add epoch directory management 256 + - Implement symlink-based promotion 257 + - Add epoch cleanup/GC functionality 258 + - Support for `html-current` → `html-live` workflow 259 + 260 + --- 261 + 262 + #### 1.4 Web UI & Monitoring 263 + 264 + **ocaml-docs-ci has:** 265 + - OCurrent-based web dashboard 266 + - Real-time pipeline status 267 + - Job logs viewable in browser 268 + - Package-level status tracking 269 + 270 + **day10 lacks:** 271 + - No web interface 272 + - CLI-only interaction 273 + - No real-time monitoring 274 + 275 + **Implementation Options:** 276 + 1. **Use OCurrent web** - If integrating with OCurrent 277 + 2. **Build custom web UI** - Separate web service reading day10 state 278 + 3. **Static status pages** - Generate HTML status reports 279 + 280 + **Recommended:** Option 1 if using OCurrent, otherwise Option 3 for minimal viable monitoring. 281 + 282 + --- 283 + 284 + #### 1.5 Remote API 285 + 286 + **ocaml-docs-ci has:** 287 + - Cap'n Proto RPC API for querying pipeline state 288 + - Package status queries 289 + - Pipeline health checks 290 + - CLI client (ocaml-docs-ci-client) 291 + 292 + **day10 lacks:** 293 + - No remote API 294 + - No programmatic access to state 295 + - Cannot query status without reading files 296 + 297 + **Implementation Options:** 298 + 1. **Add Cap'n Proto service** - Match ocaml-docs-ci interface 299 + 2. **REST API** - Simpler but different from existing ecosystem 300 + 3. **GraphQL** - Modern but overkill for this use case 301 + 302 + **Recommended:** Option 1 for compatibility with existing tooling. 303 + 304 + --- 305 + 306 + ### 2. MODERATE GAPS (Should Have) 307 + 308 + #### 2.1 Database-Backed State 309 + 310 + **ocaml-docs-ci:** SQLite database tracking pipeline runs, package statuses, build history 311 + 312 + **day10:** File-based state (layer.json, JSON outputs) 313 + 314 + **Gap Impact:** Harder to query historical data, no pipeline-level tracking 315 + 316 + **Implementation:** Add SQLite or similar for tracking builds over time 317 + 318 + --- 319 + 320 + #### 2.2 Solver Service Architecture 321 + 322 + **ocaml-docs-ci:** External solver service via Cap'n Proto, can run multiple solvers in parallel 323 + 324 + **day10:** In-process solving, one solve at a time per fork 325 + 326 + **Gap Impact:** Potentially slower for large solve operations 327 + 328 + **Implementation:** Could extract solver to service, but current approach works 329 + 330 + --- 331 + 332 + #### 2.3 Valid Package List Publishing 333 + 334 + **ocaml-docs-ci:** Publishes list of successfully-built packages for ocaml.org filtering 335 + 336 + **day10:** No concept of valid package list 337 + 338 + **Implementation:** Add post-build step to generate/publish valid package manifest 339 + 340 + --- 341 + 342 + ### 3. MINOR GAPS (Nice to Have) 343 + 344 + #### 3.1 Storage Server Integration 345 + 346 + **ocaml-docs-ci:** SSH/rsync to remote storage server, automatic sync 347 + 348 + **day10:** Local filesystem, manual sync-docs command 349 + 350 + **Gap Impact:** Requires additional orchestration for remote deployment 351 + 352 + --- 353 + 354 + #### 3.2 Multiple opam-repository Sources 355 + 356 + **ocaml-docs-ci:** Tracks specific git repository with commit history 357 + 358 + **day10:** Supports multiple local paths, no git tracking 359 + 360 + **Gap Impact:** Cannot automatically detect new packages 361 + 362 + --- 363 + 364 + ### 4. DAY10 ADVANTAGES 365 + 366 + Features day10 has that ocaml-docs-ci lacks: 367 + 368 + | Feature | Benefit | 369 + |---------|---------| 370 + | **Windows Support** | Can build Windows packages | 371 + | **FreeBSD Support** | Can build BSD packages | 372 + | **Simpler Deployment** | No cluster infrastructure needed | 373 + | **Layer-based Caching** | More efficient disk usage with overlay2 | 374 + | **Standalone Operation** | Works without external services (OCluster, solver-service) | 375 + | **Direct Container Control** | Lower latency, no scheduler overhead | 376 + | **Equivalent Parallelism** | Fork-based model matches ocaml-docs-ci's actual single-machine operation | 377 + | **Simpler Debugging** | No distributed system complexity to troubleshoot | 378 + 379 + --- 380 + 381 + ## Ecosystem Integration 382 + 383 + ### Voodoo Integration 384 + 385 + Both day10 and ocaml-docs-ci use the same documentation toolchain: 386 + 387 + ``` 388 + ┌─────────────────┐ 389 + │ voodoo-prep │ 390 + │ (artifact prep) │ 391 + └────────┬────────┘ 392 + 393 + ┌───────────────┴───────────────┐ 394 + ▼ ▼ 395 + ┌─────────────────┐ ┌─────────────────┐ 396 + │ voodoo-do │ │odoc_driver_voodoo│ 397 + │ (compile/link) │ │ (all-in-one) │ 398 + └────────┬────────┘ └────────┬────────┘ 399 + │ │ 400 + └───────────────┬───────────────┘ 401 + 402 + ┌─────────────────┐ 403 + │ voodoo-gen │ 404 + │ (HTML output) │ 405 + └─────────────────┘ 406 + ``` 407 + 408 + **day10 uses:** odoc_driver_voodoo (modern unified approach) 409 + **ocaml-docs-ci uses:** Both voodoo-do and odoc_driver_voodoo 410 + 411 + **Integration Status:** ✅ Compatible - both can produce compatible output 412 + 413 + ### OCluster Integration (Optional - Not Required for Parity) 414 + 415 + **Note:** Since ocaml-docs-ci runs on a single machine in practice, OCluster integration is **not required** for feature parity. day10's existing fork-based parallelism provides equivalent functionality. 416 + 417 + ``` 418 + Current ocaml-docs-ci reality: 419 + ┌─────────────────────────────────────────────────────────────┐ 420 + │ OCluster Scheduler │ 421 + │ (Single Machine) │ 422 + └─────────────────────────┬───────────────────────────────────┘ 423 + 424 + 425 + ┌───────────┐ 426 + │ Worker │ ← All workers on same machine 427 + │ (linux- │ 428 + │ x86_64) │ 429 + └───────────┘ 430 + ``` 431 + 432 + **If future scaling is needed**, day10 could add OCluster: 433 + 1. Add `current_ocluster` dependency 434 + 2. Generate OBuilder specs from day10 build commands 435 + 3. Submit jobs via OCluster API 436 + 4. Collect results from worker output 437 + 438 + But this is a **future enhancement**, not a migration requirement. 439 + 440 + ### Solver Service Integration 441 + 442 + The solver-service repository provides a standalone solving service: 443 + 444 + ``` 445 + ┌──────────────┐ Cap'n Proto ┌────────────────┐ 446 + │ day10 │ ─────────────────── │ solver-service │ 447 + │ (client) │ solve() │ (server) │ 448 + └──────────────┘ └────────────────┘ 449 + ``` 450 + 451 + **Current day10:** In-process opam-0install 452 + **Migration option:** Use solver-service for consistency with ecosystem 453 + 454 + --- 455 + 456 + ## Implementation Roadmap 457 + 458 + ### Phase 1: Core Infrastructure (Weeks 1-4) 459 + 460 + **Goal:** Establish foundation for docs.ocaml.org integration 461 + 462 + | Task | Priority | Effort | Dependencies | 463 + |------|----------|--------|--------------| 464 + | 1.1 Add epoch management | P0 | Medium | None | 465 + | 1.2 Implement valid package list | P0 | Low | None | 466 + | 1.3 Add remote storage sync (SSH/rsync) | P0 | Medium | None | 467 + | 1.4 SQLite state tracking | P1 | Medium | None | 468 + 469 + **Deliverable:** day10 can produce epoch-structured output compatible with docs.ocaml.org 470 + 471 + ### Phase 2: Change Detection (Weeks 5-8) 472 + 473 + **Goal:** Automatic rebuilding on opam-repository changes 474 + 475 + | Task | Priority | Effort | Dependencies | 476 + |------|----------|--------|--------------| 477 + | 2.1 Git-based opam-repo tracking | P0 | Medium | None | 478 + | 2.2 Change detection algorithm | P0 | High | 2.1 | 479 + | 2.3 Dependency-aware rebuild | P1 | High | 2.2 | 480 + | 2.4 Incremental solution updates | P1 | Medium | 2.2 | 481 + 482 + **Deliverable:** day10 can detect and rebuild changed packages automatically 483 + 484 + ### Phase 3: ~~Distributed Execution~~ Skipped 485 + 486 + **Not required:** Since ocaml-docs-ci runs on a single machine in practice, day10's existing fork-based parallelism (`--fork N`) provides equivalent functionality. OCluster integration can be added later if true distribution becomes necessary. 487 + 488 + **Time saved:** 6 weeks 489 + 490 + ### Phase 3 (was 4): Monitoring & API (Weeks 9-12) 491 + 492 + **Goal:** Production observability and integration 493 + 494 + | Task | Priority | Effort | Dependencies | 495 + |------|----------|--------|--------------| 496 + | 3.1 Cap'n Proto API service | P1 | High | 1.4 | 497 + | 3.2 Status query endpoints | P1 | Medium | 3.1 | 498 + | 3.3 Web dashboard (or static pages) | P2 | Medium | 3.1 | 499 + | 3.4 Health check endpoints | P2 | Low | 3.1 | 500 + 501 + **Note:** API/monitoring is lower priority if day10 runs as a batch job (like ocaml-docs-ci in practice). 502 + 503 + **Deliverable:** day10 provides status visibility (at minimum via static pages/JSON) 504 + 505 + ### Phase 4 (was 5): Migration & Cutover (Weeks 13-16) 506 + 507 + **Goal:** Replace ocaml-docs-ci in production 508 + 509 + | Task | Priority | Effort | Dependencies | 510 + |------|----------|--------|--------------| 511 + | 4.1 Parallel run comparison | P0 | Medium | All above | 512 + | 4.2 Output compatibility validation | P0 | Medium | 4.1 | 513 + | 4.3 Gradual traffic shift | P0 | Low | 4.2 | 514 + | 4.4 Full cutover | P0 | Low | 4.3 | 515 + | 4.5 ocaml-docs-ci deprecation | P2 | Low | 4.4 | 516 + 517 + **Deliverable:** day10 is the production system for docs.ocaml.org 518 + 519 + ### Revised Timeline Summary 520 + 521 + | Phase | Original | Revised | Savings | 522 + |-------|----------|---------|---------| 523 + | Core Infrastructure | Weeks 1-4 | Weeks 1-4 | - | 524 + | Change Detection | Weeks 5-8 | Weeks 5-8 | - | 525 + | Distributed Execution | Weeks 9-14 | Skipped | 6 weeks | 526 + | Monitoring & API | Weeks 15-18 | Weeks 9-12 | - | 527 + | Migration | Weeks 19-22 | Weeks 13-16 | - | 528 + | **Total** | **22 weeks** | **16 weeks** | **6 weeks** | 529 + 530 + --- 531 + 532 + ## Risk Assessment 533 + 534 + ### High Risk 535 + 536 + | Risk | Probability | Impact | Mitigation | 537 + |------|-------------|--------|------------| 538 + | Output format incompatibility | Low | High | Comprehensive comparison testing | 539 + | Epoch management bugs | Medium | High | Extensive testing, staged rollout | 540 + 541 + ### Medium Risk 542 + 543 + | Risk | Probability | Impact | Mitigation | 544 + |------|-------------|--------|------------| 545 + | Performance regression | Medium | Medium | Benchmark early, optimize iteratively | 546 + | Change detection complexity | Medium | Medium | Start with simple polling approach | 547 + | State tracking gaps | Medium | Medium | Design carefully, review with team | 548 + 549 + ### Low Risk 550 + 551 + | Risk | Probability | Impact | Mitigation | 552 + |------|-------------|--------|------------| 553 + | Voodoo incompatibility | Low | High | Already using same tools | 554 + | Platform regressions | Low | Low | Existing test coverage | 555 + | Parallelism issues | Low | Low | Both systems use single-machine model | 556 + 557 + **Note:** OCluster integration risk removed since it's not required for parity. 558 + 559 + --- 560 + 561 + ## Recommendations 562 + 563 + ### Immediate Actions 564 + 565 + 1. **Validate voodoo compatibility** - Confirm day10 and ocaml-docs-ci produce identical HTML output for the same package 566 + 2. **Design epoch system** - Document epoch structure and promotion workflow 567 + 3. **Prototype change detection** - Simple git-based tracking of opam-repository changes 568 + 569 + ### Architecture Decision 570 + 571 + **Recommended Approach:** Incremental enhancement of day10 572 + 573 + Since both systems run on single machines in practice, day10's architecture is actually well-suited for the task. The migration is simpler than the theoretical architecture comparison suggests. 574 + 575 + **Key additions needed:** 576 + 1. **Epoch management** - For atomic deployments (similar to ocaml-docs-ci) 577 + 2. **Change detection** - Git-based tracking of opam-repository 578 + 3. **Valid package list** - For ocaml.org integration 579 + 4. **Status reporting** - JSON/static HTML for visibility 580 + 581 + **Not needed for parity:** 582 + - OCluster integration (single-machine in practice) 583 + - Full OCurrent reactive pipeline (can use simpler cron/polling) 584 + - Cap'n Proto API (if batch job model is acceptable) 585 + 586 + ### Simplest Migration Path 587 + 588 + Rather than adding OCurrent complexity, consider a simpler operational model: 589 + 590 + ```bash 591 + # Cron job or systemd timer 592 + while true; do 593 + git -C /opam-repo pull 594 + if [ $(git rev-parse HEAD) != $(cat /state/last-commit) ]; then 595 + day10 batch --cache-dir /cache --opam-repository /opam-repo \ 596 + --html-output /data/html-current @changed-packages.json 597 + # Atomic promotion 598 + ln -sfn /data/html-current /data/html-live 599 + git rev-parse HEAD > /state/last-commit 600 + fi 601 + sleep 3600 602 + done 603 + ``` 604 + 605 + This provides: 606 + - Automatic change detection 607 + - Incremental rebuilding 608 + - Atomic deployments 609 + - No additional infrastructure 610 + 611 + ### Alternative: OCurrent Wrapper 612 + 613 + If reactive behavior and web UI are required, wrap day10 in OCurrent: 614 + 615 + ```ocaml 616 + (* Hypothetical OCurrent pipeline using day10 *) 617 + let pipeline = 618 + let packages = track_opam_repo () in 619 + let solutions = Current.list_map solve packages in 620 + let builds = Current.list_map (day10_build ~config) solutions in 621 + let docs = Current.list_map (day10_docs ~config) builds in 622 + publish_epoch docs 623 + ``` 624 + 625 + This adds complexity but provides OCurrent's monitoring and caching. 626 + 627 + --- 628 + 629 + ## Appendix A: File Structure Comparison 630 + 631 + ### day10 Output Structure 632 + 633 + ``` 634 + cache_dir/ 635 + ├── {os_key}/ 636 + │ ├── base/fs/ 637 + │ ├── build-{hash}/ 638 + │ │ ├── fs/ 639 + │ │ └── layer.json 640 + │ └── doc-{hash}/ 641 + │ ├── fs/ 642 + │ │ └── html/ 643 + │ │ ├── p/{pkg}/{ver}/ 644 + │ │ └── u/{universe}/{pkg}/{ver}/ 645 + │ └── layer.json 646 + └── solutions/ 647 + └── {repo-sha}/ 648 + └── {pkg}.json 649 + ``` 650 + 651 + ### ocaml-docs-ci Output Structure 652 + 653 + ``` 654 + /data/ 655 + ├── prep/ 656 + │ └── universes/{u}/{pkg}/{ver}/ 657 + ├── compile/ 658 + │ ├── p/{pkg}/{ver}/ 659 + │ └── u/{u}/{pkg}/{ver}/ 660 + ├── linked/ 661 + │ ├── p/{pkg}/{ver}/ 662 + │ └── u/{u}/{pkg}/{ver}/ 663 + ├── html-raw/ 664 + │ ├── p/{pkg}/{ver}/ 665 + │ └── u/{u}/{pkg}/{ver}/ 666 + └── epoch-{hash}/ 667 + └── html/ 668 + └── (symlinks to html-raw) 669 + ``` 670 + 671 + --- 672 + 673 + ## Appendix B: Glossary 674 + 675 + | Term | Definition | 676 + |------|------------| 677 + | **Epoch** | A versioned collection of documentation artifacts, enabling atomic updates | 678 + | **Blessed** | The canonical/primary documentation version for a package (lives in `p/`) | 679 + | **Universe** | A specific set of package dependencies, identified by hash | 680 + | **Layer** | An overlay2 filesystem layer containing build artifacts | 681 + | **OCluster** | OCaml's distributed build cluster system | 682 + | **OCurrent** | Reactive CI/CD pipeline framework for OCaml | 683 + | **voodoo** | Documentation preparation and generation toolchain | 684 + | **odoc_driver_voodoo** | Unified driver for odoc compilation/linking/generation | 685 + 686 + --- 687 + 688 + ## Appendix C: Related Repositories 689 + 690 + | Repository | Purpose | URL | 691 + |------------|---------|-----| 692 + | ocaml-docs-ci | Current docs.ocaml.org CI | github.com/ocurrent/ocaml-docs-ci | 693 + | voodoo | Doc preparation tools | github.com/ocaml-doc/voodoo | 694 + | ocluster | Distributed build cluster | github.com/ocurrent/ocluster | 695 + | solver-service | Dependency solving service | github.com/ocurrent/solver-service | 696 + | odoc | Documentation compiler | github.com/ocaml/odoc | 697 +
+267
docs/plans/2026-02-03-fresh-docs-design.md
··· 1 + # Fresh Docs with Graceful Degradation 2 + 3 + **Date:** 2026-02-03 4 + **Status:** Proposed 5 + **Author:** Brainstorming session 6 + 7 + ## Overview 8 + 9 + This document describes the design for day10's documentation generation strategy, which differs fundamentally from ocaml-docs-ci. The key principle is "always fresh, always safe" - docs are rebuilt against the current opam-repository state, but existing working docs are never destroyed by a failed rebuild. 10 + 11 + ## Background 12 + 13 + ### The Problem with ocaml-docs-ci 14 + 15 + ocaml-docs-ci computes a solution once per package and caches it forever. This causes: 16 + 17 + - **Link rot**: Package A's docs link to dependency B v2.0, but B is now at v5.0 18 + - **Stale cross-references**: Over time, docs reference increasingly outdated dependency versions 19 + - **Append-only constraint**: New builds can never overwrite old builds 20 + 21 + ### day10's Approach 22 + 23 + day10 always solves against the current opam-repository state: 24 + 25 + - **Fresh cross-references**: Docs always link to current dependency versions 26 + - **Graceful degradation**: Only replace docs when the new build succeeds 27 + - **Fast recovery**: Layer caching means re-runs after fixing issues are fast 28 + 29 + ## Design 30 + 31 + ### Core Principle 32 + 33 + Every run: 34 + 1. Solve all packages against current opam-repository 35 + 2. Build all packages (layer cache makes unchanged builds fast) 36 + 3. Generate docs where dependency docs succeeded 37 + 4. Atomically swap successful docs into place 38 + 5. Preserve existing docs on failure 39 + 40 + ### Two-Level Update Strategy 41 + 42 + #### Level 1: Package Swaps (frequent) 43 + 44 + For normal operation - individual packages rebuild as dependencies change. 45 + 46 + Each package's docs live in a self-contained directory: 47 + ``` 48 + html/p/{package}/{version}/ 49 + ``` 50 + 51 + Update sequence for successful rebuild: 52 + 1. Write new docs to `html/p/{package}/{version}.new/` 53 + 2. Swap directories: 54 + ``` 55 + mv html/p/{package}/{version} html/p/{package}/{version}.old 56 + mv html/p/{package}/{version}.new html/p/{package}/{version} 57 + ``` 58 + 3. Remove `.old` directory 59 + 60 + If the build fails, no swap occurs - the original directory remains untouched. 61 + 62 + **Recovery from interrupted swap:** If the process dies between renames, the next run detects `.new` or `.old` directories and cleans up before proceeding. 63 + 64 + #### Level 2: Epoch Transitions (rare) 65 + 66 + For major structural changes: 67 + - New odoc version with different HTML output format 68 + - URL scheme changes 69 + - Full rebuild from scratch 70 + 71 + Epoch mechanism: 72 + ``` 73 + /data/ 74 + ├── epoch-abc123/ ← currently live 75 + │ └── html/p/... 76 + ├── epoch-def456/ ← being built 77 + │ └── html/p/... 78 + └── html-live -> epoch-abc123/html ← symlink 79 + ``` 80 + 81 + During epoch transition: 82 + 1. Old epoch continues serving traffic 83 + 2. New epoch builds completely in parallel 84 + 3. Atomically switch the `html-live` symlink when ready 85 + 4. Keep old epoch briefly for rollback, then garbage collect 86 + 87 + ### Pipeline Structure 88 + 89 + The pipeline has two independent phases with different dependency rules: 90 + 91 + | Phase | Depends On | Blocked By | 92 + |-------|------------|------------| 93 + | **Build** | Dependency *builds* | Dependency build failure | 94 + | **Docs** | Package build + dependency *docs* | Build failure OR dependency docs failure | 95 + 96 + #### Failure Propagation Example 97 + 98 + ``` 99 + ocaml-base-compiler build: ✓ 100 + ocaml-base-compiler docs: ✗ (odoc bug) 101 + 102 + ├─► astring build: ✓ (proceeds - only needs build artifacts) 103 + │ astring docs: ⊘ (skipped - dependency docs missing) 104 + │ │ 105 + │ └─► yaml build: ✓ (proceeds) 106 + │ yaml docs: ⊘ (skipped - transitive docs failure) 107 + 108 + └─► fmt build: ✓ 109 + fmt docs: ⊘ (skipped) 110 + ``` 111 + 112 + #### Benefits 113 + 114 + 1. **Fast recovery** - When odoc is fixed, all builds are cache hits; only docs regenerate 115 + 2. **Complete build reporting** - Get build status and logs for all packages 116 + 3. **Isolated blast radius** - Docs-only problems don't block builds 117 + 4. **Better diagnostics** - Clear distinction between "build failed" vs "docs skipped" 118 + 119 + #### Status Values 120 + 121 + Each package reports one of: 122 + - `build: success, docs: success` - Fully working 123 + - `build: success, docs: failed` - Build ok, docs generation failed 124 + - `build: success, docs: skipped` - Build ok, docs skipped (dependency docs missing) 125 + - `build: failed, docs: skipped` - Build failed, docs not attempted 126 + 127 + ### Error Handling 128 + 129 + #### Principle: Fail Fast, Fail Clearly 130 + 131 + Any error within a layer causes the entire layer to fail. No partial successes. 132 + 133 + #### Retry Within Run 134 + 135 + Before marking a layer as failed, retry with exponential backoff: 136 + 137 + ``` 138 + Attempt 1: immediate 139 + Attempt 2: wait 5s 140 + Attempt 3: wait 15s 141 + → Give up, mark failed 142 + ``` 143 + 144 + This handles transient failures without waiting for the next run. 145 + 146 + #### What Counts as Failure 147 + 148 + - Non-zero exit code from build/odoc 149 + - Timeout exceeded 150 + - OOM killed 151 + - Any exception during layer creation 152 + 153 + ### Operational Model 154 + 155 + #### Triggering 156 + 157 + **Primary: Webhook on opam-repository push** 158 + 159 + A lightweight HTTP endpoint receives GitHub webhook: 160 + ``` 161 + POST /webhook/opam-repository 162 + → Validate signature 163 + → Trigger day10 run (async) 164 + → Queue if run already in progress 165 + ``` 166 + 167 + **Fallback: Daily cron** 168 + ``` 169 + 0 4 * * * flock -n /var/run/day10.lock day10 batch ... 170 + ``` 171 + 172 + #### Run Sequence 173 + 174 + 1. Pull latest opam-repository 175 + 2. Solve all target packages against current state 176 + 3. Build all packages (layer cache = fast for unchanged) 177 + 4. Generate docs where dependency docs succeeded 178 + 5. Atomic swap successful docs, preserve old on failure 179 + 180 + ### Notifications 181 + 182 + On run completion with failures, post to Zulip: 183 + 184 + ``` 185 + 📦 day10 run completed 186 + 187 + ✓ 3,542 packages built 188 + ✓ 3,201 docs generated 189 + ✗ 12 build failures 190 + ✗ 8 doc failures (23 skipped due to dependencies) 191 + 192 + Failed builds: 193 + - some-package.1.2.3: exit code 2 194 + - another-pkg.0.5.0: timeout after 600s 195 + 196 + Failed docs: 197 + - broken-docs.1.0.0: odoc error 198 + 199 + Full logs: /var/log/day10/runs/2026-02-03-1234/ 200 + ``` 201 + 202 + ### Log Retention 203 + 204 + All logs kept permanently: 205 + 206 + ``` 207 + /var/log/day10/ 208 + ├── runs/ 209 + │ └── 2026-02-03-1234/ 210 + │ ├── summary.json 211 + │ ├── build/ 212 + │ │ ├── some-package.1.2.3.log 213 + │ │ └── another-pkg.0.5.0.log 214 + │ └── docs/ 215 + │ └── broken-docs.1.0.0.log 216 + └── latest -> runs/2026-02-03-1234 217 + ``` 218 + 219 + Logs include stdout, stderr, exit code, timing, and retry attempts. 220 + 221 + ## Implementation Changes Required 222 + 223 + ### day10 Core 224 + 225 + 1. **Staging directory support** 226 + - Write docs to `{package}/{version}.new/` during generation 227 + - Only swap on success 228 + - Clean up `.new` and `.old` artifacts on startup 229 + 230 + 2. **Failure preservation** 231 + - If build/docs fail, don't touch existing output 232 + - Report "kept old docs" vs "updated docs" in output 233 + 234 + 3. **Epoch awareness** 235 + - New `--epoch` flag to specify epoch directory 236 + - New `promote-epoch` command for symlink switch 237 + 238 + 4. **Build/docs phase separation** 239 + - Track build success independently from docs success 240 + - Continue builds even when dependency docs fail 241 + - Skip docs only when dependency docs missing 242 + 243 + ### New Components 244 + 245 + 1. **Webhook handler** - Small HTTP service to receive GitHub webhooks 246 + 2. **Zulip notifier** - Integration with existing Zulip library 247 + 3. **Log management** - Structured logging to permanent storage 248 + 249 + ## Comparison to ocaml-docs-ci 250 + 251 + | Aspect | ocaml-docs-ci | day10 (this design) | 252 + |--------|---------------|---------------------| 253 + | Solutions | Cached forever | Fresh every run | 254 + | Cross-references | Drift over time | Always current | 255 + | On doc failure | Blocks dependent builds | Builds continue, only docs skip | 256 + | Update mechanism | Append-only | Atomic swap on success | 257 + | Infrastructure | OCurrent + OCluster | day10 + webhook + cron | 258 + | Recovery | Complex rebuild process | Re-run (layer cache hits) | 259 + | Notifications | OCurrent web UI | Zulip | 260 + 261 + ## Open Questions 262 + 263 + None at this time. 264 + 265 + ## References 266 + 267 + - [Gap Analysis: day10 vs ocaml-docs-ci](/workspace/docs/GAP_ANALYSIS.md)