AethelOS Production Readiness Plan#
This document outlines the implementation plan for addressing TODOs and "In a real OS" comments found in the codebase. These improvements are necessary to transition from a demonstration OS to a production-ready system.
Status Overview#
| Item | Status | Date Completed | Notes |
|---|---|---|---|
| 1. Preemptive Multitasking | 🟡 PLANNED | - | See PREEMPTIVE_MULTITASKING_PLAN.md |
| 2. Interrupt-safe Statistics | ✅ COMPLETE | 2025-10-25 | Uses without_interrupts() wrapper |
| 3. Production Memory Allocator | ✅ COMPLETE | 2025-10-25 | Buddy allocator with 64B-64KB blocks |
| 4. Memory Deallocation | ✅ COMPLETE | 2025-10-25 | Integrated with buddy allocator |
Overview#
Found 4 critical areas requiring implementation:
- Preemptive Multitasking (PLANNED - detailed plan available)
- ✅ Interrupt-safe Statistics (COMPLETE)
- ✅ Production Memory Allocator (COMPLETE)
- ✅ Memory Deallocation (COMPLETE)
1. Preemptive Multitasking#
Location: heartwood/src/attunement/idt.rs:47
Current State:
- System uses cooperative multitasking (threads explicitly call
yield_now()) - Timer interrupt handler does NOT trigger context switches
- Preemptive context switching was disabled due to issues with interrupting critical sections (e.g., Drop implementations)
Required Changes:
Phase 1: Critical Section Protection#
- Implement interrupt-safe spinlocks with interrupt disable/enable
- Add
cli/stiwrapper types that disable/enable interrupts in critical sections - Audit all critical sections (especially in scheduler, allocator, and I/O drivers)
- Replace existing
spin::Mutexwith interrupt-safe mutexes where needed
Phase 2: Context Switch Safety#
- Design interrupt-safe context switching mechanism
- Ensure stack switching is atomic and safe during interrupts
- Implement proper save/restore of all CPU registers during preemptive switches
- Add interrupt nesting counter to prevent context switches during nested interrupts
Phase 3: Scheduler Integration#
- Modify timer interrupt handler to call scheduler's context switch
- Implement time quantum/slice management (e.g., 10ms per thread)
- Add preemption flags to thread control blocks
- Implement priority-based preemption (higher priority threads can preempt lower ones)
Phase 4: Testing & Validation#
- Test with compute-intensive threads that don't yield
- Verify critical sections are not interrupted mid-operation
- Stress test with many threads competing for CPU
- Validate that Drop implementations complete without interruption
Dependencies:
- Must complete interrupt-safe allocator first (Item #3)
- Requires interrupt-safe locks throughout the codebase
Estimated Complexity: High Priority: Medium (nice to have, but cooperative multitasking works for now)
2. Interrupt-Safe Statistics#
Location: heartwood/src/loom_of_fate/system_threads.rs:263
Current State:
- Stats display is commented out in welcome message
- Calling
stats()locks the scheduler - If a timer interrupt fires while the lock is held, deadlock occurs
Required Changes:
Approach A: Interrupt-Safe Stats Function (Recommended)#
- Create a lock-free or interrupt-safe stats snapshot mechanism
- Use atomic operations to read thread counts and states
- Disable interrupts briefly while copying stats to a local buffer
- Format and display stats from the local buffer (no locks held)
Implementation:
pub struct StatsSnapshot {
thread_count: usize,
running_threads: usize,
sleeping_threads: usize,
// ... other stats
}
pub fn get_stats_snapshot() -> StatsSnapshot {
// Disable interrupts temporarily
let _guard = InterruptDisableGuard::new();
// Quickly copy stats without complex locking
let loom = unsafe { &*LOOM.as_ptr() };
StatsSnapshot {
thread_count: loom.threads.len(),
// ... copy other stats
}
// Interrupts re-enabled when guard drops
}
Approach B: Cache Stats Before Threads Start#
- Calculate and store stats before enabling interrupts
- Display cached stats in welcome message
- Simpler but less dynamic (stats won't update)
Recommended: Approach A for more dynamic and useful stats
Dependencies: None (can be implemented independently) Estimated Complexity: Low Priority: High (improves user experience and system visibility)
3. Production Memory Allocator#
Location: heartwood/src/mana_pool/allocator.rs:43
Current State:
- Using simple bump allocator
- Never reclaims memory
- Not thread-safe
- Will exhaust heap quickly under real workloads
Required Changes:
Phase 1: Choose Allocator Strategy#
Option A: Buddy Allocator
- Splits memory into power-of-2 sized blocks
- Fast allocation and deallocation
- Some internal fragmentation
- Good for kernel use
Option B: Slab Allocator
- Pre-allocates objects of common sizes
- Extremely fast for fixed-size allocations
- Reduces fragmentation
- Ideal for kernel objects (thread blocks, file handles, etc.)
Option C: Hybrid (Recommended)
- Use slab allocator for common kernel structures
- Use buddy allocator for general-purpose allocations
- Best of both worlds
Phase 2: Implement Buddy Allocator#
-
Data Structures:
- Free list for each order (e.g., 4KB, 8KB, 16KB, ... 1MB)
- Bitmap or linked list to track free/allocated blocks
- Metadata about each block's order
-
Core Operations:
alloc(): Find smallest suitable block, split if neededdealloc(): Coalesce with buddy blocks when freedsplit_block(): Split larger blocks into smaller onescoalesce(): Merge adjacent free blocks
-
Thread Safety:
- Wrap allocator in interrupt-safe spinlock
- Disable interrupts during allocation/deallocation
- Keep critical sections as short as possible
Phase 3: Implement Slab Allocator (Optional)#
-
Create object caches for common sizes:
- Thread control blocks
- File descriptors
- Network buffers
- Page tables
-
Each slab contains:
- Array of objects
- Free list of available objects
- Reference to next slab
-
Fast path: Pop from free list (O(1))
-
Slow path: Allocate new slab from buddy allocator
Phase 4: Integration#
- Replace
BumpAllocatorwith new allocator - Update
GlobalAllocimplementation - Add allocation statistics and debugging
- Test with existing code (should be drop-in replacement)
Dependencies: None (can be implemented independently) Estimated Complexity: Medium-High Priority: High (critical for production use)
4. Memory Deallocation#
Location: heartwood/src/mana_pool/allocator.rs:61
Current State:
dealloc()is a no-op- Memory is never reclaimed
- Will leak memory for any temporary allocations
Required Changes:
This is directly addressed by implementing the production allocator (#3 above). The buddy/slab allocators both support proper deallocation.
Implementation Notes:
- Buddy allocator: Mark block as free, attempt to coalesce with buddy
- Slab allocator: Return object to slab's free list
- Must handle double-free detection (debug builds)
- Consider memory poisoning in debug mode to catch use-after-free
Dependencies: Requires Item #3 (Production Allocator) Estimated Complexity: Medium (part of allocator implementation) Priority: High (same as #3)
Implementation Roadmap#
Phase 1: Foundation (High Priority)#
-
Interrupt-Safe Stats (Item #2)
- Low complexity, immediate user benefit
- No dependencies
- Estimated time: 2-4 hours
-
Production Memory Allocator (Items #3 & #4)
- Start with buddy allocator
- Add thread safety with interrupt-safe locks
- Estimated time: 2-3 days
Phase 2: Advanced Features (Medium Priority)#
-
Slab Allocator (Optional enhancement to #3)
- Build on top of buddy allocator
- Optimize for common kernel objects
- Estimated time: 1-2 days
-
Preemptive Multitasking (Item #1)
- Requires allocator and locks to be interrupt-safe first
- Extensive testing needed
- Estimated time: 3-5 days
Phase 3: Optimization & Polish#
- Add memory allocation statistics
- Implement memory pressure handling
- Add OOM (out-of-memory) handler
- Performance tuning and profiling
Testing Strategy#
For Each Item:#
-
Unit Tests:
- Test allocator operations (alloc, free, coalesce)
- Test stats snapshot under various conditions
- Test context switching edge cases
-
Integration Tests:
- Run existing demos with new allocator
- Verify threads still work correctly
- Test under memory pressure
-
Stress Tests:
- Allocate/free memory rapidly
- Many threads competing for resources
- Long-running system stability tests
-
Regression Tests:
- Ensure existing functionality still works
- No new deadlocks or race conditions introduced
Success Criteria#
Item #1: Preemptive Multitasking#
- Compute-intensive thread can be preempted by timer
- No deadlocks in critical sections
- Drop implementations complete without interruption
- System remains stable under heavy load
Item #2: Interrupt-Safe Stats#
- Stats display in welcome message without deadlock
- Stats update correctly even with interrupts enabled
- No performance degradation
Item #3 & #4: Production Allocator#
- Can allocate and free memory correctly
- Memory is reclaimed and reused
- Thread-safe and interrupt-safe
- No memory corruption or leaks
- Performance acceptable (< 1us for typical allocations)
- Works as drop-in replacement for bump allocator
References#
Allocator Resources#
- "The Buddy System" - Knuth, TAOCP Vol 1
- Linux kernel slab allocator (SLUB/SLOB)
- OSDev Wiki: Page Frame Allocation
Preemption Resources#
- "Operating Systems: Three Easy Pieces" - Chapter on Concurrency
- OSDev Wiki: Interrupt Service Routines
- x86_64 interrupt handling best practices
Synchronization Resources#
- "The Art of Multiprocessor Programming" - Herlihy & Shavit
- Interrupt-safe locking patterns
- Spinlock implementation best practices
Notes#
- All implementations should follow AethelOS naming conventions (e.g., "Mana Pool" for memory)
- Maintain the poetic/philosophical tone in documentation
- Consider creating new modules:
mana_pool::buddyfor buddy allocatormana_pool::slabfor slab allocatorattunement::preemptionfor preemptive scheduling
- Add extensive comments explaining design decisions