AethelOS Production Readiness Plan#

This document outlines the implementation plan for addressing TODOs and "In a real OS" comments found in the codebase. These improvements are necessary to transition from a demonstration OS to a production-ready system.

Status Overview#

Item	Status	Date Completed	Notes
1. Preemptive Multitasking	🟡 PLANNED	-	See PREEMPTIVE_MULTITASKING_PLAN.md
2. Interrupt-safe Statistics	✅ COMPLETE	2025-10-25	Uses `without_interrupts()` wrapper
3. Production Memory Allocator	✅ COMPLETE	2025-10-25	Buddy allocator with 64B-64KB blocks
4. Memory Deallocation	✅ COMPLETE	2025-10-25	Integrated with buddy allocator

Overview#

Found 4 critical areas requiring implementation:

Preemptive Multitasking (PLANNED - detailed plan available)
✅ Interrupt-safe Statistics (COMPLETE)
✅ Production Memory Allocator (COMPLETE)
✅ Memory Deallocation (COMPLETE)

1. Preemptive Multitasking#

Location: heartwood/src/attunement/idt.rs:47

Current State:

System uses cooperative multitasking (threads explicitly call yield_now())
Timer interrupt handler does NOT trigger context switches
Preemptive context switching was disabled due to issues with interrupting critical sections (e.g., Drop implementations)

Required Changes:

Phase 1: Critical Section Protection#

Implement interrupt-safe spinlocks with interrupt disable/enable
Add cli/sti wrapper types that disable/enable interrupts in critical sections
Audit all critical sections (especially in scheduler, allocator, and I/O drivers)
Replace existing spin::Mutex with interrupt-safe mutexes where needed

Phase 2: Context Switch Safety#

Design interrupt-safe context switching mechanism
Ensure stack switching is atomic and safe during interrupts
Implement proper save/restore of all CPU registers during preemptive switches
Add interrupt nesting counter to prevent context switches during nested interrupts

Phase 3: Scheduler Integration#

Modify timer interrupt handler to call scheduler's context switch
Implement time quantum/slice management (e.g., 10ms per thread)
Add preemption flags to thread control blocks
Implement priority-based preemption (higher priority threads can preempt lower ones)

Phase 4: Testing & Validation#

Test with compute-intensive threads that don't yield
Verify critical sections are not interrupted mid-operation
Stress test with many threads competing for CPU
Validate that Drop implementations complete without interruption

Dependencies:

Must complete interrupt-safe allocator first (Item #3)
Requires interrupt-safe locks throughout the codebase

Estimated Complexity: High Priority: Medium (nice to have, but cooperative multitasking works for now)

2. Interrupt-Safe Statistics#

Location: heartwood/src/loom_of_fate/system_threads.rs:263

Current State:

Stats display is commented out in welcome message
Calling stats() locks the scheduler
If a timer interrupt fires while the lock is held, deadlock occurs

Required Changes:

Approach A: Interrupt-Safe Stats Function (Recommended)#

Create a lock-free or interrupt-safe stats snapshot mechanism
Use atomic operations to read thread counts and states
Disable interrupts briefly while copying stats to a local buffer
Format and display stats from the local buffer (no locks held)

Implementation:

pub struct StatsSnapshot {
    thread_count: usize,
    running_threads: usize,
    sleeping_threads: usize,
    // ... other stats
}

pub fn get_stats_snapshot() -> StatsSnapshot {
    // Disable interrupts temporarily
    let _guard = InterruptDisableGuard::new();

    // Quickly copy stats without complex locking
    let loom = unsafe { &*LOOM.as_ptr() };
    StatsSnapshot {
        thread_count: loom.threads.len(),
        // ... copy other stats
    }
    // Interrupts re-enabled when guard drops
}

Approach B: Cache Stats Before Threads Start#

Calculate and store stats before enabling interrupts
Display cached stats in welcome message
Simpler but less dynamic (stats won't update)

Recommended: Approach A for more dynamic and useful stats

Dependencies: None (can be implemented independently) Estimated Complexity: Low Priority: High (improves user experience and system visibility)

3. Production Memory Allocator#

Location: heartwood/src/mana_pool/allocator.rs:43

Current State:

Using simple bump allocator
Never reclaims memory
Not thread-safe
Will exhaust heap quickly under real workloads

Required Changes:

Phase 1: Choose Allocator Strategy#

Option A: Buddy Allocator

Splits memory into power-of-2 sized blocks
Fast allocation and deallocation
Some internal fragmentation
Good for kernel use

Option B: Slab Allocator

Pre-allocates objects of common sizes
Extremely fast for fixed-size allocations
Reduces fragmentation
Ideal for kernel objects (thread blocks, file handles, etc.)

Option C: Hybrid (Recommended)

Use slab allocator for common kernel structures
Use buddy allocator for general-purpose allocations
Best of both worlds

Phase 2: Implement Buddy Allocator#

Data Structures:
- Free list for each order (e.g., 4KB, 8KB, 16KB, ... 1MB)
- Bitmap or linked list to track free/allocated blocks
- Metadata about each block's order
Core Operations:
- alloc(): Find smallest suitable block, split if needed
- dealloc(): Coalesce with buddy blocks when freed
- split_block(): Split larger blocks into smaller ones
- coalesce(): Merge adjacent free blocks
Thread Safety:
- Wrap allocator in interrupt-safe spinlock
- Disable interrupts during allocation/deallocation
- Keep critical sections as short as possible

Phase 3: Implement Slab Allocator (Optional)#

Create object caches for common sizes:
- Thread control blocks
- File descriptors
- Network buffers
- Page tables
Each slab contains:
- Array of objects
- Free list of available objects
- Reference to next slab
Fast path: Pop from free list (O(1))
Slow path: Allocate new slab from buddy allocator

Phase 4: Integration#

Replace BumpAllocator with new allocator
Update GlobalAlloc implementation
Add allocation statistics and debugging
Test with existing code (should be drop-in replacement)

Dependencies: None (can be implemented independently) Estimated Complexity: Medium-High Priority: High (critical for production use)

4. Memory Deallocation#

Location: heartwood/src/mana_pool/allocator.rs:61

Current State:

dealloc() is a no-op
Memory is never reclaimed
Will leak memory for any temporary allocations

Required Changes:

This is directly addressed by implementing the production allocator (#3 above). The buddy/slab allocators both support proper deallocation.

Implementation Notes:

Buddy allocator: Mark block as free, attempt to coalesce with buddy
Slab allocator: Return object to slab's free list
Must handle double-free detection (debug builds)
Consider memory poisoning in debug mode to catch use-after-free

Dependencies: Requires Item #3 (Production Allocator) Estimated Complexity: Medium (part of allocator implementation) Priority: High (same as #3)

Implementation Roadmap#

Phase 1: Foundation (High Priority)#

Interrupt-Safe Stats (Item #2)
- Low complexity, immediate user benefit
- No dependencies
- Estimated time: 2-4 hours
Production Memory Allocator (Items #3 & #4)
- Start with buddy allocator
- Add thread safety with interrupt-safe locks
- Estimated time: 2-3 days

Phase 2: Advanced Features (Medium Priority)#

Slab Allocator (Optional enhancement to #3)
- Build on top of buddy allocator
- Optimize for common kernel objects
- Estimated time: 1-2 days
Preemptive Multitasking (Item #1)
- Requires allocator and locks to be interrupt-safe first
- Extensive testing needed
- Estimated time: 3-5 days

Phase 3: Optimization & Polish#

Add memory allocation statistics
Implement memory pressure handling
Add OOM (out-of-memory) handler
Performance tuning and profiling

Testing Strategy#

For Each Item:#

Unit Tests:
- Test allocator operations (alloc, free, coalesce)
- Test stats snapshot under various conditions
- Test context switching edge cases
Integration Tests:
- Run existing demos with new allocator
- Verify threads still work correctly
- Test under memory pressure
Stress Tests:
- Allocate/free memory rapidly
- Many threads competing for resources
- Long-running system stability tests
Regression Tests:
- Ensure existing functionality still works
- No new deadlocks or race conditions introduced

Success Criteria#

Item #1: Preemptive Multitasking#

Compute-intensive thread can be preempted by timer
No deadlocks in critical sections
Drop implementations complete without interruption
System remains stable under heavy load

Item #2: Interrupt-Safe Stats#

Stats display in welcome message without deadlock
Stats update correctly even with interrupts enabled
No performance degradation

Item #3 & #4: Production Allocator#

Can allocate and free memory correctly
Memory is reclaimed and reused
Thread-safe and interrupt-safe
No memory corruption or leaks
Performance acceptable (< 1us for typical allocations)
Works as drop-in replacement for bump allocator

References#

Allocator Resources#

"The Buddy System" - Knuth, TAOCP Vol 1
Linux kernel slab allocator (SLUB/SLOB)
OSDev Wiki: Page Frame Allocation

Preemption Resources#

"Operating Systems: Three Easy Pieces" - Chapter on Concurrency
OSDev Wiki: Interrupt Service Routines
x86_64 interrupt handling best practices

Synchronization Resources#

"The Art of Multiprocessor Programming" - Herlihy & Shavit
Interrupt-safe locking patterns
Spinlock implementation best practices

Notes#

All implementations should follow AethelOS naming conventions (e.g., "Mana Pool" for memory)
Maintain the poetic/philosophical tone in documentation
Consider creating new modules:
- mana_pool::buddy for buddy allocator
- mana_pool::slab for slab allocator
- attunement::preemption for preemptive scheduling
Add extensive comments explaining design decisions