OR-1 dataflow CPU sketch

docs: add cycle-accurate timing design plan

Design for adding env.timeout(1) between pipeline stages in PE, SM,
and network delivery. PE adopts process-per-token model for natural
pipelining. 3 implementation phases covering emu/, tests, and monitor.

Orual 9621b1f3 2fb166a6

+232
+232
docs/design-plans/2026-02-27-cycle-timing.md
··· 1 + # Cycle-Accurate Timing for OR1 Emulator 2 + 3 + ## Summary 4 + 5 + The OR1 emulator uses SimPy for discrete-event simulation, but currently all operations complete at the same simulated time (t=0). This means tokens are matched, instructions fetched, and results emitted in zero simulated time — correct for functional verification but unrealistic for evaluating timing-dependent behaviour. This work adds cycle-accurate timing by inserting `env.timeout(1)` delays between pipeline stages, giving each component a realistic operational latency: 5 cycles for a dyadic token through a Processing Element (PE), 4 cycles for a monadic token, and a minimum of 3 cycles for Structure Memory (SM) operations, with 1 additional cycle of network delivery latency between any two components. 6 + 7 + The approach exploits SimPy's process model rather than fighting it. The PE adopts a "process-per-token" design: the dequeue loop spawns a new SimPy coroutine for each token it receives, so multiple tokens can occupy different pipeline stages simultaneously within a single PE without any explicit locking. The SM retains its existing single-process loop, with timeouts inserted between dequeue, process, and respond stages. Network delivery is made explicit by wrapping `store.put()` in a short coroutine that yields a 1-cycle timeout before depositing the token. All three changes are additive — the event callback system, matching store semantics, and monitor step controls remain structurally unchanged; only the simulated timestamps at which events fire are different. 8 + 9 + ## Definition of Done 10 + 11 + 1. **PE pipeline timing**: Each PE processes tokens in discrete cycles — 5 cycles for dyadic tokens (dequeue → match → fetch → execute → emit), 4 cycles for monadic (dequeue → fetch → execute → emit). Each stage separated by `env.timeout(1)`. 12 + 2. **SM operation timing**: Each SM processes operations in minimum 3 cycles (dequeue → process → respond). EXEC takes 3 + N cycles where N is the number of tokens being injected. Each stage separated by `env.timeout(1)`. 13 + 3. **Network delivery latency**: Token delivery from one component to another takes 1 cycle (`env.timeout(1)` wrapping `store.put()`). 14 + 4. **Parallel execution**: All PEs and SMs advance concurrently — multiple components can be at different pipeline stages simultaneously within the same simulated time. 15 + 5. **Pipelined PE execution**: Multiple tokens can be in-flight within a single PE simultaneously (one at each pipeline stage), if this falls out naturally from the SimPy process-per-token model. 16 + 6. **All existing tests pass** with updated `env.run(until=...)` values where necessary. No semantic test changes — only timing budget increases. 17 + 7. **Monitor still works**: `step_tick` advances one cycle at a time. `step_event` still processes one SimPy event. `run_until` still runs to a target time. The web UI and REPL remain functional. 18 + 19 + **Out of scope**: FIFO backpressure modelling, memory access latency variation, clock domain crossings. 20 + 21 + ## Acceptance Criteria 22 + 23 + ### cycle-timing.AC1: PE processes dyadic tokens in 5 cycles 24 + - **cycle-timing.AC1.1 Success:** Dyadic token dequeue→match→fetch→execute→emit spans exactly 5 sim-time units 25 + - **cycle-timing.AC1.2 Success:** Each stage fires its event callback at the correct sim-time 26 + - **cycle-timing.AC1.3 Edge:** IRAMWriteToken processed in 2 cycles (dequeue + write) 27 + 28 + ### cycle-timing.AC2: PE processes monadic tokens in 4 cycles 29 + - **cycle-timing.AC2.1 Success:** Monadic token dequeue→fetch→execute→emit spans exactly 4 sim-time units 30 + - **cycle-timing.AC2.2 Success:** DyadToken arriving at a monadic instruction also takes 4 cycles (skip match) 31 + 32 + ### cycle-timing.AC3: PE pipeline allows multiple tokens in flight 33 + - **cycle-timing.AC3.1 Success:** Two tokens injected 1 cycle apart overlap in the pipeline (token B begins while A is still processing) 34 + - **cycle-timing.AC3.2 Success:** Matching store access is safe — two dyadic tokens at different pipeline stages don't corrupt each other's entries 35 + - **cycle-timing.AC3.3 Edge:** PE dequeues at most 1 token per cycle (serialized intake) 36 + 37 + ### cycle-timing.AC4: SM processes operations with correct cycle counts 38 + - **cycle-timing.AC4.1 Success:** READ on FULL cell takes 3 cycles (dequeue, process, send result) 39 + - **cycle-timing.AC4.2 Success:** WRITE takes 2 cycles (dequeue, write cell) 40 + - **cycle-timing.AC4.3 Success:** EXEC takes 3 + N cycles (dequeue, process, N token injections) 41 + - **cycle-timing.AC4.4 Success:** Deferred read + later write satisfaction: total time accounts for both operations 42 + 43 + ### cycle-timing.AC5: Network delivery takes 1 cycle 44 + - **cycle-timing.AC5.1 Success:** Token emitted at time T arrives in destination FIFO at time T+1 45 + - **cycle-timing.AC5.2 Success:** PE→SM and SM→PE paths both have 1-cycle latency 46 + - **cycle-timing.AC5.3 Edge:** System.inject() remains zero-delay (pre-sim setup) 47 + 48 + ### cycle-timing.AC6: Parallel execution 49 + - **cycle-timing.AC6.1 Success:** Two PEs processing tokens simultaneously advance in lockstep (both at cycle N at the same sim-time) 50 + - **cycle-timing.AC6.2 Success:** PE and SM process concurrently (PE executing while SM handling a different request) 51 + 52 + ### cycle-timing.AC7: Existing tests pass 53 + - **cycle-timing.AC7.1 Success:** Full test suite passes after `until` value updates 54 + - **cycle-timing.AC7.2 Failure:** No test requires semantic changes (only timing budget increases) 55 + 56 + ### cycle-timing.AC8: Monitor compatibility 57 + - **cycle-timing.AC8.1 Success:** step_tick advances one cycle and returns events at that time 58 + - **cycle-timing.AC8.2 Success:** step_event processes exactly one SimPy event 59 + - **cycle-timing.AC8.3 Success:** run_until reaches target time correctly 60 + - **cycle-timing.AC8.4 Success:** Web UI and REPL remain functional 61 + 62 + ## Glossary 63 + 64 + - **SimPy**: A Python discrete-event simulation library. Simulation time advances only when a process explicitly yields a timeout or waits on a resource; no real time passes between events. 65 + - **`env.timeout(N)`**: A SimPy primitive that suspends the current coroutine for N units of simulated time, analogous to a clock cycle boundary. 66 + - **`simpy.Store`**: A SimPy resource that acts as an unbounded (or capacity-limited) FIFO queue. Processes yield `store.get()` to wait for an item or `store.put(item)` to deposit one. 67 + - **`env.process()`**: Registers a Python generator as a concurrent SimPy process. Multiple processes advance interleaved at each simulated time step. 68 + - **PE (Processing Element)**: The compute unit in the OR1 dataflow CPU. Each PE holds an IRAM of instructions, a matching store for pairing dyadic token operands, and an input FIFO. 69 + - **SM (Structure Memory)**: The memory subsystem implementing I-structure semantics. Each SM instance manages a bank of cells with presence tracking and handles read, write, atomic, and EXEC operations. 70 + - **Dyadic token**: A token carrying one operand of a two-operand instruction. Two dyadic tokens (left and right) must be matched before the instruction can execute. 71 + - **Monadic token**: A token that carries the sole operand of a one-operand instruction. No matching is needed — it proceeds directly to fetch and execute. 72 + - **IRAMWriteToken**: A special token that deposits new instructions into a PE's IRAM rather than triggering computation. 73 + - **IRAM**: Instruction RAM — the per-PE array of `ALUInst` and `SMInst` records indexed by offset. Fetched during the pipeline's fetch stage. 74 + - **Matching store**: The 2D array (`[ctx_slots][offsets]`) inside each PE where the first dyadic token of a pair waits until its partner arrives. 75 + - **I-structure semantics**: A single-assignment memory discipline: a cell transitions through EMPTY → RESERVED → FULL. A read on a non-FULL cell defers until a write satisfies it. 76 + - **Deferred read**: When an SM READ targets an EMPTY or RESERVED cell, the SM records the return address and satisfies the read later when a WRITE arrives at that cell. 77 + - **EXEC (MemOp)**: An SM operation that injects tokens stored in T0 shared memory back into the network. Takes 3 + N cycles where N is the number of tokens injected. 78 + - **T0 / T1 memory tiers**: T1 (addresses below `tier_boundary`) uses per-SM I-structure cells with presence tracking. T0 (at or above `tier_boundary`) is shared raw storage across all SM instances. 79 + - **Process-per-token model**: The PE architecture change introduced in this work. Instead of one long-running SimPy process that handles tokens sequentially, the dequeue loop spawns a new SimPy process for each token, enabling natural pipeline overlap. 80 + - **Pipeline stage**: One discrete step in token processing (dequeue, match, fetch, execute, emit). Each stage is separated by one `env.timeout(1)`, consuming one simulated cycle. 81 + - **Zero-time execution**: The previous emulator behaviour where all pipeline stages completed without consuming any simulated time, so every event occurred at t=0. 82 + - **`step_tick` / `step_event`**: Monitor commands. `step_tick` advances simulation by one cycle (one simulated time unit). `step_event` processes exactly one pending SimPy event. 83 + 84 + ## Architecture 85 + 86 + Add cycle-accurate timing to the SimPy-based emulator by inserting `env.timeout(1)` between pipeline stages in PE, SM, and network delivery paths. The PE adopts a process-per-token model where the dequeue loop spawns a new SimPy process for each token, enabling natural pipelining — multiple tokens in flight at different pipeline stages simultaneously. The SM retains its single-process model with timeouts inserted between stages. Network delivery wraps `store.put()` in a 1-cycle delay process. 87 + 88 + ### PE Pipeline (process-per-token) 89 + 90 + The PE's `_run()` method becomes a dequeue-and-dispatch loop: 91 + 92 + 1. `yield self.input_store.get()` — wait for a token in the FIFO 93 + 2. `yield self.env.timeout(1)` — dequeue takes 1 cycle (serializes intake to 1 token/cycle) 94 + 3. `self.env.process(self._process_token(token))` — spawn pipeline process 95 + 96 + The new `_process_token(token)` method walks through stages with timeouts: 97 + 98 + - **Dyadic path** (5 cycles total): dequeue(1) → match(1) → fetch(1) → execute(1) → emit(1) 99 + - **Monadic path** (4 cycles total): dequeue(1) → fetch(1) → execute(1) → emit(1) 100 + - **IRAMWriteToken**: dequeue(1) → write IRAM(1) — 2 cycles total 101 + 102 + Because the dequeue serializes at 1 token per cycle, and each spawned process advances one stage per cycle, pipelining emerges naturally. Two tokens at the same pipeline stage never conflict because they're offset by at least one cycle. 103 + 104 + Matching store access is safe: only one token reaches the match stage per cycle (guaranteed by the 1-cycle dequeue serialization). No additional locking or synchronization needed. 105 + 106 + ### SM Pipeline (sequential with timeouts) 107 + 108 + The SM processes one token at a time. Timeouts are inserted into the existing `_run()` loop: 109 + 110 + 1. `yield self.input_store.get()` — wait for token 111 + 2. `yield self.env.timeout(1)` — dequeue cycle 112 + 3. Process operation (1+ cycles depending on op) 113 + 4. `yield self.env.timeout(1)` — response cycle (if sending result) 114 + 115 + Cycle counts per operation: 116 + - **READ (FULL cell)**: 3 cycles — dequeue, process, send result 117 + - **READ (EMPTY → deferred)**: 2 cycles — dequeue, set WAITING (response deferred) 118 + - **WRITE (normal)**: 2 cycles — dequeue, write cell 119 + - **WRITE (satisfies deferred)**: 3 cycles — dequeue, write + satisfy, send result 120 + - **CLEAR/ALLOC**: 2 cycles — dequeue, modify cell 121 + - **Atomic (RD_INC, RD_DEC, CMP_SW)**: 3 cycles — dequeue, read-modify-write, send result 122 + - **EXEC**: 3 + N cycles — dequeue, process, then N cycles to inject N tokens 123 + 124 + The SM stays single-process because the deferred-read register is a single shared resource — concurrent processing would require a fundamentally different design. 125 + 126 + ### Network Delivery (1-cycle latency) 127 + 128 + All token emission paths (`_emit`, `_emit_sm`, SM `_send_result`) wrap the `store.put()` call in a delivery process: 129 + 130 + ```python 131 + def _deliver(self, store, token): 132 + yield self.env.timeout(1) # 1-cycle network latency 133 + yield store.put(token) 134 + ``` 135 + 136 + This applies to PE→PE, PE→SM, and SM→PE token flows. The emitting process spawns the delivery as a separate SimPy process via `env.process()` so it doesn't block the emitter's pipeline. 137 + 138 + `System.inject()` remains unchanged — direct list append for pre-simulation seed injection, no timing. 139 + 140 + `System.send()` gains the 1-cycle delivery delay, matching the network model. 141 + 142 + ### Timing Example 143 + 144 + Two-PE dataflow: PE0 executes ADD, sends result to PE1 which executes MUL. 145 + 146 + ``` 147 + t=0: PE0 dequeues token A (left operand, dyadic) 148 + t=1: PE0 matches A (stores in matching store, waits for partner) 149 + ...later... 150 + t=5: PE0 dequeues token B (right operand for same instruction) 151 + t=6: PE0 matches B (completes pair with A) 152 + t=7: PE0 fetches ADD instruction 153 + t=8: PE0 executes ADD 154 + t=9: PE0 emits result → delivery process spawned 155 + t=10: result arrives in PE1's FIFO (1-cycle network) 156 + t=11: PE1 dequeues result token 157 + ... 158 + ``` 159 + 160 + With pipelining, PE0 can dequeue the next token at t=1 (while the first is in the match stage), achieving 1-token-per-cycle throughput at steady state. 161 + 162 + ## Existing Patterns 163 + 164 + The emulator already uses SimPy's process model extensively. PE and SM both have `_run()` generator methods wrapped in `env.process()`. Token delivery uses `simpy.Store` with `get()`/`put()`. The change adds `env.timeout()` calls within existing generator methods and introduces `env.process()` for per-token pipelines — both are standard SimPy patterns already used in the codebase. 165 + 166 + The event callback system (`on_event`) is already in place and continues to work unchanged — events just fire at different sim-times than before (non-zero). 167 + 168 + Tests use `env.run(until=X)` to advance the simulation. The pattern continues; only the `until` values increase to accommodate cycle-accurate timing. 169 + 170 + ## Implementation Phases 171 + 172 + <!-- START_PHASE_1 --> 173 + ### Phase 1: Core Timing Changes 174 + 175 + **Goal:** Add cycle-accurate timing to PE, SM, and network delivery. 176 + 177 + **Components:** 178 + - `emu/pe.py` — refactor `_run()` into dequeue loop + `_process_token()` with per-stage timeouts; spawn process per token 179 + - `emu/sm.py` — add `yield env.timeout(1)` between dequeue, processing, and response stages in `_run()` and all handlers 180 + - `emu/network.py` — add 1-cycle delivery wrapper for `System.send()`; PE and SM emit paths spawn delivery processes 181 + - `tests/test_pe.py`, `tests/test_sm.py`, `tests/test_network.py` — update `env.run(until=...)` values 182 + - `tests/test_pe_events.py`, `tests/test_sm_events.py`, `tests/test_network_events.py` — update timing values 183 + - New `tests/test_cycle_timing.py` — verify PE cycle counts (5 dyadic, 4 monadic), SM cycle counts, network latency, pipelining behaviour 184 + 185 + **Dependencies:** None (first phase) 186 + 187 + **Done when:** All emulator tests pass with cycle-accurate timing. New timing tests verify correct cycle counts for PE, SM, and network. 188 + <!-- END_PHASE_1 --> 189 + 190 + <!-- START_PHASE_2 --> 191 + ### Phase 2: Integration and End-to-End Test Migration 192 + 193 + **Goal:** Ensure all integration, E2E, and remaining tests pass with the new timing model. 194 + 195 + **Components:** 196 + - `tests/test_integration.py` — update timing budgets 197 + - `tests/test_e2e.py` — update timing budgets 198 + - `tests/test_sm_tiers.py` — update timing budgets for T0/T1 tests 199 + - `tests/test_exec_bootstrap.py` — update timing budgets for EXEC tests 200 + - Any other test files discovered during migration 201 + 202 + **Dependencies:** Phase 1 (core timing) 203 + 204 + **Done when:** Full test suite passes. `python -m pytest tests/ -v` green. 205 + <!-- END_PHASE_2 --> 206 + 207 + <!-- START_PHASE_3 --> 208 + ### Phase 3: Monitor Adaptation 209 + 210 + **Goal:** Verify and update the monitor to work correctly with cycle-accurate timing. 211 + 212 + **Components:** 213 + - `monitor/backend.py` — verify `_handle_step_tick` and `_handle_step_event` semantics still correct 214 + - `tests/test_backend.py` — update timing values 215 + - `tests/test_snapshot.py` — update timing values 216 + - `tests/test_repl.py` — update timing values 217 + - `tests/test_monitor_server.py` — update timing values 218 + - `tests/test_monitor_graph_json.py` — update timing values if needed 219 + - Verify web UI and REPL function correctly via manual testing 220 + 221 + **Dependencies:** Phase 2 (all tests passing) 222 + 223 + **Done when:** All monitor tests pass. Manual verification that web UI and REPL work with stepped simulation. 224 + <!-- END_PHASE_3 --> 225 + 226 + ## Additional Considerations 227 + 228 + **Future pipelining refinement:** The process-per-token model is a foundation. Future work could add pipeline stalls (e.g., when the matching store detects a hazard), variable-latency stages, or pipeline interlocks. These are not needed now but the architecture supports them. 229 + 230 + **Timing scale change:** Tests that previously ran with `env.run(until=10)` may need `until=100` or higher. The simulation runs longer in wall-clock terms but SimPy is lightweight enough that this is negligible. 231 + 232 + **Monitor step_tick semantics:** With zero-time execution, `step_tick` processed all events at t=0 in one go. With cycle timing, `step_tick` advances to the next time step and processes all events there (typically 1-3 events per cycle per component). This is the correct behaviour — each tick now corresponds to one clock cycle.