OR-1 dataflow CPU sketch

docs updates, pulled in datasheets

Orual 3416127c 9aaae175

+653 -229
+2 -2
CLAUDE.md
··· 68 68 - `asm/errors.py` — Structured error types with source context (ErrorCategory includes MACRO and CALL) 69 69 - `asm/opcodes.py` — Opcode mnemonic mapping and arity classification 70 70 - `asm/lower.py` — CST to IRGraph lowering pass 71 - - `asm/expand.py` — Macro expansion and function call wiring pass 72 - - `asm/builtins.py` — Built-in macro library 71 + - `asm/expand.py` — Macro expansion (opcode params, parameterized qualifiers, `@ret` wiring, variadic repetition) and function call wiring pass 72 + - `asm/builtins.py` — Built-in macro library (`#loop_counted`, `#loop_while`, `#permit_inject`, `#reduce_2`/`_3`/`_4`) 73 73 - `asm/resolve.py` — Name resolution pass 74 74 - `asm/place.py` — Placement validation and auto-placement 75 75 - `asm/allocate.py` — IRAM offset and context slot allocation
+5 -3
asm/CLAUDE.md
··· 15 15 ## Pipeline Passes 16 16 17 17 1. **Lower** (`lower.py`): Lark CST -> IRGraph. Creates IRNodes, IREdges, IRRegions (function/location scopes), IRDataDefs, SystemConfig from @system pragma. Qualifies names with function scope (e.g., `$main.&add`). May contain MacroCall nodes and MacroDef regions. 18 - 2. **Expand** (`expand.py`): Macro expansion and function call wiring. Clones macro bodies, substitutes parameters, evaluates const expressions, qualifies expanded names with scope prefixes. Processes function call sites, allocates context slots per call. After expand, IR contains only concrete IRNode/IREdge entries. No ParamRef placeholders, no MacroDef regions, no IRMacroCall entries remain. 18 + 2. **Expand** (`expand.py`): Macro expansion and function call wiring. Clones macro bodies, substitutes parameters (including opcodes via `${op}`, placement via `|${pe}`, ports via `:${port}`, context slots via `[${ctx}]`), evaluates const expressions, expands variadic repetition blocks, rewrites `@ret`/`@ret_name` macro outputs, qualifies expanded names with scope prefixes. Processes function call sites with `@ret` trampolines, `free_ctx` insertion, and cross-context wiring. After expand, IR contains only concrete IRNode/IREdge entries. No ParamRef placeholders, no MacroDef regions, no IRMacroCall entries remain. 19 19 3. **Resolve** (`resolve.py`): Validates all edge endpoints exist. Detects scope violations (cross-function label refs). Generates Levenshtein "did you mean" suggestions. 20 20 4. **Place** (`place.py`): Validates explicit PE placements. Auto-places unplaced nodes via greedy bin-packing with locality heuristic (prefer PE with most connected neighbours). 21 21 5. **Allocate** (`allocate.py`): Assigns IRAM offsets (dyadic first, then monadic). Assigns context slots (one per function scope per PE). Resolves symbolic destinations to `Addr(a, port, pe)`. ··· 33 33 - `TypeAwareOpToMnemonicDict` and `TypeAwareMonadicOpsSet` in opcodes.py: required because IntEnum subclasses share numeric values across types (e.g., `ArithOp.ADD == 0 == MemOp.READ`), so plain dict/set lookups would collide 34 34 - Errors use `IRGraph.errors` accumulation: all issues are reported rather than stopping at the first error 35 35 - `#` sigil for macro namespace: avoids collision with other sigils ($, &, @) 36 - - `@ret` reserved prefix for return markers: qualifies return label references in function calls 36 + - `@ret` reserved prefix for return markers: in function bodies, creates trampolines with cross-context routing and `free_ctx`; in macro bodies, rewrites edges to call-site destinations (no context management) 37 37 - Per-call-site context slot allocation: each function call site gets its own context slot, managed by CallSite metadata 38 - - Built-in macros prepended to user source: system macro definitions are automatically available in every program 38 + - Opcode parameters (`${op}`) resolved via `MNEMONIC_TO_OP`: enables generic macros like `#reduce_2 add` 39 + - Parameterized qualifiers (`|${pe}`, `:${port}`, `[${ctx}]`) resolved during expansion via `PlacementRef`, `PortRef`, `CtxSlotRef` 40 + - Built-in macros prepended to user source: `#loop_counted`, `#loop_while`, `#permit_inject` (variadic), `#reduce_2`/`_3`/`_4` (parameterized opcode) 39 41 40 42 ## Invariants 41 43
+8 -36
asm/builtins.py
··· 45 45 &gate |> @ret_exit:R 46 46 } 47 47 48 - ; --- Permit injection (per-arity variants) --- 49 - ; Each injects const 1 tokens. Outputs wired via @ret. 50 - ; Call with: #permit_inject_1 |> out=&gate_node 51 - #permit_inject_1 |> { 52 - &p0 <| const, 1 53 - &p0 |> @ret_out 54 - } 55 - 56 - ; Call with: #permit_inject_2 |> out0=&gate_a, out1=&gate_b 57 - #permit_inject_2 |> { 58 - &p0 <| const, 1 59 - &p1 <| const, 1 60 - &p0 |> @ret_out0 61 - &p1 |> @ret_out1 62 - } 63 - 64 - ; Call with: #permit_inject_3 |> out0=&g0, out1=&g1, out2=&g2 65 - #permit_inject_3 |> { 66 - &p0 <| const, 1 67 - &p1 <| const, 1 68 - &p2 <| const, 1 69 - &p0 |> @ret_out0 70 - &p1 |> @ret_out1 71 - &p2 |> @ret_out2 72 - } 73 - 74 - ; Call with: #permit_inject_4 |> out0=&g0, out1=&g1, out2=&g2, out3=&g3 75 - #permit_inject_4 |> { 76 - &p0 <| const, 1 77 - &p1 <| const, 1 78 - &p2 <| const, 1 79 - &p3 <| const, 1 80 - &p0 |> @ret_out0 81 - &p1 |> @ret_out1 82 - &p2 |> @ret_out2 83 - &p3 |> @ret_out3 48 + ; --- Permit injection (variadic) --- 49 + ; Injects one const(1) seed token per target. 50 + ; Call with: #permit_inject &gate_a, &gate_b, &gate_c 51 + #permit_inject *targets |> { 52 + $( 53 + &p <| const, 1 54 + &p |> ${targets} 55 + ),* 84 56 } 85 57 86 58 ; --- Binary reduction trees (parameterized opcode) ---
+46 -32
design-notes/alu-and-output-design.md
··· 230 230 not-taken side instead of the data value. See the Output Formatter section 231 231 for the not-taken trigger semantics. 232 232 233 - **Reserved opcode space (2 slots):** future candidates include hardware 234 - multiply, predicate store read/write, and debug/trace instructions. 233 + **Reserved opcode space (2 slots):** candidates include `MAP_PAGE` 234 + (write '610 mapping register) and `SET_PAGE` (select active IRAM 235 + bank via page latch — see `pe-design.md` IRAM Bank Switching section), 236 + hardware multiply, predicate store read/write, and debug/trace 237 + instructions. `MAP_PAGE` + `SET_PAGE` together cost 2 opcode slots 238 + but enable banked IRAM with one '610 + one latch per PE. 235 239 236 240 ### SM Instruction Dispatch 237 241 ··· 417 421 result:8/16 computed or passthrough data value 418 422 bool_out:1 boolean signal for SWITCH/GATE 419 423 420 - From IRAM instruction word (latched since stage 3): 424 + From IRAM half 1 (latched during Stage 3, cycle 2): 425 + has_dest2:1 bit 15: single vs dual destination 421 426 dest1_PE:2 target PE 422 - dest1_offset:4-7 instruction address at target (width varies by format) 423 - dest1_ctx:4 context slot at target 427 + dest1_offset:5 instruction address at target 424 428 dest1_port:1 L/R operand 425 - dest1_gen:2 generation counter for target slot 426 - dest1_type:2 output token format selector 427 - dest1_not_taken_op:1 for SWITCH: 0=NOOP, 1=FREE on not-taken trigger 429 + const_ext:7 (single-dest only) extended constant for CONST16 430 + dest2_PE:2 (dual-dest only) second target PE 431 + dest2_offset:5 (dual-dest only) second target offset 428 432 429 - dest2_PE:2 (same fields for second destination) 430 - dest2_offset:4-7 431 - dest2_ctx:4 432 - dest2_port:1 433 - dest2_gen:2 434 - dest2_type:2 435 - dest2_not_taken_op:1 433 + From IRAM half 0 (latched during Stage 3, cycle 1): 434 + ctx_mode:2 00=INHERIT, 01=CTX_OVRD, 10=CHANGE_TAG 435 + 436 + From pipeline latches (ctx_mode 00, INHERIT): 437 + ctx:4 inherited from executing token 438 + gen:2 inherited from executing token 436 439 437 - has_dest2:1 whether dest2 is active (enables DUAL mode) 440 + From IRAM const field (ctx_mode 01, CTX_OVRD): 441 + ctx:4 const[7:4] 442 + gen:2 const[3:2] 443 + 444 + From left operand bypass latch (ctx_mode 10, CHANGE_TAG): 445 + flit_1:16 entire flit 1 comes from left operand data 438 446 439 447 From decoder EEPROM: 440 448 output_mode:2 DUAL / SINGLE / SUPPRESS / SWITCH 441 449 output_data_sel:1 result = ALU output vs passthrough input 442 450 ``` 451 + 452 + See `iram-and-function-calls.md` for the definitive IRAM half 0/half 1 453 + bit layouts and ctx_mode semantics. 443 454 444 455 ### Output Modes 445 456 ··· 581 592 582 593 ### Source of ctx and gen in Output Tokens 583 594 584 - All destination fields (PE, offset, ctx, port, gen) come from the IRAM 585 - instruction word. They are fully specified at compile/load time. 595 + The ctx and gen fields in output tokens are controlled by the `ctx_mode` 596 + field in IRAM half 0 (2 bits). See `iram-and-function-calls.md` for 597 + full details. Summary: 586 598 587 - - **Same-activation outputs** (result feeds next instruction in same 588 - function fragment, same PE): ctx in the IRAM dest field matches the 589 - input token's ctx. The compiler set this up when it assigned context 590 - slots. 591 - - **Cross-activation outputs** (function calls, returns, cross-PE): ctx 592 - in the IRAM dest field is the pre-reserved context slot for the target 593 - activation. The compiler reserved these slots at compile time. 594 - - **gen for the destination slot**: known at compile time for statically- 595 - scheduled code. For dynamic scenarios (future), gen would come from 596 - the target PE's slot allocator, communicated via a setup token during 597 - function linkage. 599 + - **INHERIT (ctx_mode 00):** ctx and gen are inherited from the executing 600 + token's pipeline latches. The output token arrives at the destination 601 + in the same context as the originating computation. This is the common 602 + case for intra-function edges. 603 + - **CTX_OVRD (ctx_mode 01):** ctx and gen are taken from the IRAM const 604 + field (bits [7:4] → ctx, bits [3:2] → gen). Used for function call 605 + wiring where the output token must arrive in a different context slot 606 + than the one it was computed in. The compiler bakes the target context 607 + into the const field at load time. 608 + - **CHANGE_TAG (ctx_mode 10):** the entire flit 1 of the output token 609 + is taken from the left operand data, bypassing normal token formation. 610 + Used for return routing where the return address was carried as data 611 + through the function body. 598 612 599 - v0 assumption: all ctx and gen values are compile-time constants baked 600 - into IRAM instruction words. Dynamic context allocation is a future 601 - extension. 613 + PE and offset always come from IRAM half 1 (except in CHANGE_TAG mode). 614 + Port comes from IRAM half 1 dest1_port (single-dest) or is implied by 615 + the token format. 602 616 603 617 --- 604 618
+11 -8
design-notes/architecture-overview.md
··· 51 51 ### Data Width (Tentative) 52 52 - **16-bit** data words within PEs and SM (see `bus-architecture-and-width-decoupling.md`) 53 53 - **16-bit external bus**, with multi-flit token encoding (2 flits standard) 54 - - Three independent width domains: external bus (16-bit), IRAM (32-48 bits, 55 - decoupled), PE pipeline registers (wider, decomposed into parallel data 56 - and control paths) 54 + - Three independent width domains: external bus (16-bit), IRAM (32-bit 55 + effective via two-half read, upgradeable to parallel 32-bit; see 56 + `iram-and-function-calls.md`), PE pipeline registers (wider, decomposed 57 + into parallel data and control paths) 57 58 - Width conversion at FIFO boundaries via serialisers/deserialisers 58 59 59 60 > **⚠ Tentative:** 16-bit is the working assumption for the emulator and ··· 192 193 - **Generation counter only on dyadic tokens**: prevents ABA problem when 193 194 context slots are reused. Monadic and SM tokens don't need it. 194 195 - **Width domains are independent**: bus width (16-bit), token format 195 - (variable flit count), IRAM width (32-48 bits), and PE pipeline width 196 - (wider, decomposed) are each sized for their own constraints. See 197 - `bus-architecture-and-width-decoupling.md` for the full analysis. 196 + (variable flit count), IRAM width (32-bit effective, two-half read), 197 + and PE pipeline width (wider, decomposed) are each sized for their 198 + own constraints. See `bus-architecture-and-width-decoupling.md` and 199 + `iram-and-function-calls.md` for the full analysis. 198 200 199 201 ## Module Taxonomy 200 202 201 203 ### CM (Control Module) — execution and matching 202 204 - Instruction memory (IM / IRAM): stores dataflow program (function bodies) 203 - - Width decoupled from bus: 32-48 bits, sized for opcode + destination 204 - encoding (see `bus-architecture-and-width-decoupling.md`) 205 + - Width decoupled from bus: 32-bit effective (two 8-bit halves), 206 + sized for opcode + destination encoding. See 207 + `iram-and-function-calls.md` for bit-level format. 205 208 - **Runtime-writable** via IRAM write tokens (prefix `011+01`) 206 209 - Write from network stalls the pipeline (acceptable for config operations) 207 210 - Enables runtime reprogramming and eliminates need for separate config bus
+16 -4
design-notes/assembler-architecture.md
··· 138 138 139 139 1. Collect all `MacroDef` entries from the graph (including built-in macros prepended to every program) 140 140 2. For each `IRMacroCall`, clone the macro's body template 141 - 3. Substitute `ParamRef` placeholders with actual argument values 142 - 4. Evaluate `ConstExpr` arithmetic expressions (supports `+`, `-`, `*` on integers and `_idx`) 141 + 3. Substitute `ParamRef` placeholders with actual argument values in all contexts: 142 + - **Const fields**: literal value substitution 143 + - **Edge endpoints**: node reference substitution 144 + - **Node names**: token pasting with prefix/suffix 145 + - **Opcode position**: resolve mnemonic string via `MNEMONIC_TO_OP` to concrete `ALUOp`/`MemOp` 146 + - **Placement qualifiers**: resolve `"pe0"` → `0` (via `PlacementRef`) 147 + - **Port qualifiers**: resolve `"L"` → `Port.L` (via `PortRef`) 148 + - **Context slot qualifiers**: resolve integer values (via `CtxSlotRef`) 149 + 4. Evaluate `ConstExpr` arithmetic expressions (supports `+`, `-`, `*`, `//` on integers and `_idx`) 143 150 5. Expand `IRRepetitionBlock` entries once per variadic argument, binding `_idx` to the iteration index 144 151 6. Qualify expanded names with scope prefixes: `#macroname_N.&label` for top-level, `$func.#macro_N.&label` inside functions 152 + 153 + **Macro `@ret` wiring:** 154 + 155 + After body expansion, `@ret` / `@ret_name` markers in macro edge destinations are replaced with concrete node references from the call site's output list (`IRMacroCall.output_dests`). This is pure edge rewriting — no trampolines, no cross-context routing, no `free_ctx`. Macros inline into the caller's context. 145 156 146 157 **Function call wiring:** 147 158 ··· 149 160 2. Generate trampoline `PASS` nodes for return routing 150 161 3. Create `IREdge` entries with `ctx_override=True` for cross-context argument passing (becomes `ctx_mode=01` in codegen) 151 162 4. Generate `FREE_CTX` nodes for context teardown on call completion 152 - 5. Wire `@ret` / `@ret_name` synthetic nodes for return paths 163 + 5. Wire `@ret` / `@ret_name` synthetic nodes for return paths (function `@ret` is distinct from macro `@ret` — functions create trampolines with context management) 153 164 154 165 **Post-conditions:** 155 166 ··· 289 300 - **Wider placement heuristics**: graph partitioning, min-cut algorithms, or profile-guided placement for larger programs 290 301 - **Incremental reassembly**: modify part of the graph and re-run only affected passes 291 302 - **Hardware encoding pass**: translate ALUInst/SMInst to bit-level instruction words for actual IRAM loading 292 - - **Conditional macro expansion**: the current macro system supports variadic repetition, constant arithmetic, and nested macro invocation (depth limit 32), but not conditionals within macros 303 + - **Conditional macro expansion**: the current macro system supports variadic repetition, constant arithmetic, opcode parameters, parameterized qualifiers, `@ret` output wiring (including positional `@ret` in variadic repetition blocks), and nested macro invocation (depth limit 32), but not conditionals within macros 304 + - **Generic variadic reduction**: a single `#reduce` macro that infers tree depth from variadic argument count (requires conditional expansion to handle non-power-of-2 inputs)
+20 -24
design-notes/bus-architecture-and-width-decoupling.md
··· 17 17 |--------|-------|-----------| 18 18 | External bus (inter-module) | 16-bit | routing trace count, physical buildability | 19 19 | Token format (logical) | variable-length flits | encoding needs per token type | 20 - | IRAM (instruction memory) | 32-48+ bits | opcode + destination encoding needs | 20 + | IRAM (instruction memory) | 32-bit effective (two 8-bit halves) | opcode + destination encoding; see `iram-and-function-calls.md` | 21 21 | Matching store entries | 8 or 16-bit data + 1 presence bit | data word size, token width mode | 22 22 | PE pipeline registers | wide, decomposed | parallel data path + control path | 23 23 | SM internal datapath | 16-bit | SRAM word size | ··· 351 351 | Dyadic wide | 00 | 2 | 32 | offset + ctx + port + gen + 16-bit data | 352 352 | Monadic normal | 010 | 2 | 32 | offset + ctx + 16-bit data | 353 353 | Dyadic narrow | 011+00 | 2 | 32 | offset + ctx + 8-bit data + port + gen | 354 - | IRAM write | 011+01 | 2-3 | 32-48 | iram_addr + instruction word | 354 + | IRAM write | 011+01 | 2-3 | 32-48 | iram_addr + instruction word (32-bit effective) | 355 355 | Monadic inline | 011+10 | 1 | 16 | offset + ctx only, no data | 356 356 | SM standard | 1 | 2 | 32 | SM_id + op + addr + 16-bit data or ret routing | 357 357 ··· 492 492 | Flags | immediate mode, inline output, etc. | 2-4 | 493 493 | Immediate | small constant for immediate-mode ops | 0-8 | 494 494 495 - This sums to roughly **32-48 bits** depending on address space size and 496 - how aggressively fields are packed. IRAM per PE is 128 entries (7-bit 497 - address), keeping destination address fields compact. 498 - 499 - Note: the IRAM instruction word includes a **width flag** for each output 500 - destination, determining whether the output token is formed as narrow 501 - (8-bit data) or wide (16-bit data). The source instruction controls the 502 - format of the token it produces. 495 + > **Update:** The field table above was a preliminary estimate. The 496 + > committed format is a **32-bit effective width** using a two-half read 497 + > (two 8-bit halves across two cycles). See `iram-and-function-calls.md` 498 + > for the definitive bit-level layout including CM compute, SM operations, 499 + > single vs dual destination, ctx_mode, and CONST16 wide-immediate. 503 500 504 501 ### IRAM Physical Organisation 505 502 506 - - 128 entries (7-bit address) per PE 507 - - Dyadic instructions at offsets 0..N-1 (where N = 16 or 32 depending on 508 - width mode of tokens targeting this PE) 509 - - Monadic instructions at any offset 0..127 510 - - Width: 2-3 parallel 8-bit SRAM chips for 16-48 bit instruction words, 511 - all addressed by the same address lines 512 - - Read in one pipeline stage (instruction fetch, after matching completes) 503 + - 128 entries (7-bit address) per PE, 256 bytes total (two 8-bit halves) 504 + - Address format: `[offset:7][half:1]` = 8 bits 505 + - Dyadic instructions at offsets 0-31 (5-bit offset in dyadic token format) 506 + - Monadic instructions at any offset 0-127 (7-bit offset range) 507 + - Two-half read: half 0 read in cycle N feeds decoder/ALU; half 1 read in 508 + cycle N+1 is latched for Stage 5. ALU executes during half 1 read — no 509 + pipeline bubble. 513 510 - Written only during program loading (IRAM write tokens) with valid-bit 514 511 protection 515 - 516 - Because IRAM read is a single pipeline stage using parallel SRAM chips, 517 - the width costs physical chips but does NOT add pipeline latency. A 48-bit 518 - read from three 8-bit SRAMs takes the same time as a 16-bit read from one. 512 + - **Upgrade path:** If two-half sequential read causes timing issues, widen 513 + to true 32-bit parallel read (4x 8-bit SRAMs). Same encoding, same 514 + 128 entries, just reads both halves in one cycle. Cost: 2 more SRAM 515 + chips per PE. 519 516 520 517 ### IRAM Sizing vs PE Count 521 518 ··· 688 685 689 686 ## Open Questions 690 687 691 - 1. **IRAM width** — 32 or 48 bits? Depends on destination address field 692 - sizes (now 7-10 bits given smaller IRAM), opcode count, immediate 693 - field. Next design task. 688 + 1. ~~**IRAM width**~~ — **Resolved.** 32-bit effective width via two-half 689 + read. See `iram-and-function-calls.md` for the definitive format. 694 690 2. **Mode B clock ratio** — exactly 2x, or design for arbitrary integer 695 691 ratios? 2x is simplest (toggle flip-flop). 696 692 3. **8-bit vs 16-bit PE configuration** — per-PE config register, or
design-notes/datasheets/74ls170.pdf

This is a binary file and will not be displayed.

design-notes/datasheets/ADV7170_7171.pdf

This is a binary file and will not be displayed.

design-notes/datasheets/DSA2IH00211615.pdf

This is a binary file and will not be displayed.

design-notes/datasheets/MOSES07127-1.pdf

This is a binary file and will not be displayed.

design-notes/datasheets/NATLS21982-1.pdf

This is a binary file and will not be displayed.

design-notes/datasheets/SN54170.PDF

This is a binary file and will not be displayed.

design-notes/datasheets/SN54LS171.PDF

This is a binary file and will not be displayed.

design-notes/datasheets/SN74LS610.PDF

This is a binary file and will not be displayed.

design-notes/datasheets/SN74S225.PDF

This is a binary file and will not be displayed.

design-notes/datasheets/sn54ls181.pdf

This is a binary file and will not be displayed.

design-notes/datasheets/sn74ls670.pdf

This is a binary file and will not be displayed.

+1 -1
design-notes/design-alternatives.md
··· 329 329 - 16-bit is the native word size for SM, ALU, and matching store data 330 330 - All standard tokens carry full 16-bit data in flit 2 (no more 14-bit 331 331 dyadic limitation) 332 - - IRAM width decoupled: 32-48 bits, independently sized 332 + - IRAM width decoupled: 32-bit effective (two-half read), independently sized 333 333 - PE pipeline registers are wider (~64-68 bits) but purely internal 334 334 335 335 ### Alternative: 8-bit Data
+90 -17
design-notes/dfasm-primer.md
··· 447 447 448 448 ## Macros 449 449 450 - Macros define reusable template subgraphs that are expanded inline at their call sites. The macro system supports parameterisation, variadic arguments, repetition blocks, constant arithmetic, and token pasting. 450 + Macros define reusable template subgraphs that are expanded inline at their call sites. The macro system supports parameterisation, variadic arguments, repetition blocks, constant arithmetic, token pasting, opcode parameters, parameterized qualifiers, and `@ret` output wiring. 451 451 452 452 ### Macro Definition 453 453 454 454 ```dfasm 455 455 #macro_name param1, param2, *variadic_param |> { 456 456 ; body — instructions and edges using ${param} substitution 457 - &node <| add ${param1} 457 + &node <| add 458 458 ${param1} |> &node:L 459 459 ${param2} |> &node:R 460 460 } ··· 486 486 } 487 487 ``` 488 488 489 + ### Opcode Parameters 490 + 491 + Parameters can appear in the opcode position of instruction definitions. This allows a single macro to work with any ALU or memory operation: 492 + 493 + ```dfasm 494 + #reduce_2 op |> { 495 + &r <| ${op} 496 + &r |> @ret 497 + } 498 + 499 + ; Usage — the opcode is passed as a bare mnemonic: 500 + #reduce_2 add |> &result 501 + #reduce_2 sub |> &result 502 + ``` 503 + 504 + Opcode arguments are passed as bare identifiers (not strings). The expand pass resolves them via `MNEMONIC_TO_OP` during expansion. An invalid mnemonic produces a MACRO error. 505 + 506 + ### Parameterized Qualifiers 507 + 508 + Parameters can appear in placement (`|pe0`) and port (`:L`) positions within a macro body: 509 + 510 + ```dfasm 511 + ; Parameterized port 512 + #wire_to target, port |> { 513 + &src <| pass 514 + &src |> ${target}:${port} 515 + } 516 + #wire_to &dest, L 517 + 518 + ; Parameterized placement 519 + #placed_const val, pe |> { 520 + &c <| const, ${val} |${pe} 521 + &c |> @ret 522 + } 523 + #placed_const 42, pe0 |> &target 524 + 525 + ; Parameterized context slot 526 + #placed_op op, pe, ctx |> { 527 + &n <| ${op} |${pe}[${ctx}] 528 + &n |> @ret 529 + } 530 + ``` 531 + 532 + The expand pass resolves placement strings (e.g., `"pe0"` → `0`), port strings (`"L"` → `Port.L`), and context slot values to their concrete types. Invalid values produce MACRO errors. 533 + 489 534 ### Repetition Blocks 490 535 491 536 The `$( ),*` syntax expands its body once per element of a variadic parameter. Within a repetition block, `${_idx}` provides the current iteration index (0-based): ··· 501 546 502 547 ### Constant Arithmetic 503 548 504 - Macro const fields support compile-time arithmetic with `+`, `-`, `*` on integer values and parameters: 549 + Macro const fields support compile-time arithmetic with `+`, `-`, `*`, `//` on integer values and parameters: 505 550 506 551 ```dfasm 507 552 #indexed_read base, *cells |> { ··· 511 556 } 512 557 ``` 513 558 514 - ### Macro Invocation 559 + ### Macro Invocation and Output Wiring (@ret) 515 560 516 - Macros are invoked as standalone statements: 561 + Macros are invoked as standalone statements. Arguments can be positional or named: 517 562 518 563 ```dfasm 519 - #loop_counted 520 564 #fan_out &a:L, &b:R, &c:L 521 565 #indexed_read 10, &dest1, &dest2, &dest3 566 + #make_pair name=foo 522 567 ``` 523 568 524 - Arguments can be positional or named: 569 + **Output wiring with `@ret`:** Macro bodies can define output points using `@ret` / `@ret_name` markers. At the call site, `|>` wires these outputs to destinations: 525 570 526 571 ```dfasm 527 - #make_pair name=foo 572 + ; Macro body defines outputs via @ret markers 573 + #loop_counted init, limit |> { 574 + &counter <| add 575 + &compare <| brgt 576 + &counter |> &compare:L 577 + &body_fan <| pass 578 + &compare |> &body_fan:L 579 + &inc <| inc 580 + &body_fan |> &inc:L 581 + &inc |> &counter:R 582 + ${init} |> &counter:L 583 + ${limit} |> &compare:R 584 + &body_fan |> @ret_body ; named output: body 585 + &compare |> @ret_exit:R ; named output: exit 586 + } 587 + 588 + ; Call with named output wiring: 589 + #loop_counted &init, &limit |> body=&process, exit=&done 590 + 591 + ; Or positional @ret for single-output macros: 592 + #reduce_2 add |> &result 528 593 ``` 529 594 595 + Unlike function calls, macro `@ret` wiring is purely edge rewriting — the `@ret_name` destination is replaced with the concrete node reference from the call site. No trampolines, no cross-context routing, no `free_ctx` insertion. Macros inline into the caller's context. 596 + 597 + **Bare `@ret`** maps to the first (or only) positional output. **`@ret_name`** maps to the named output `name=&dest` at the call site. Multiple `@ret` edges to different ports on the same output are valid. 598 + 530 599 ### Scoping 531 600 532 601 Expanded macro names are automatically qualified to prevent collisions between multiple invocations of the same macro: ··· 536 605 537 606 ### Built-in Macros 538 607 539 - The following macros are automatically available in all programs: 608 + The following macros are automatically available in all programs (defined in `asm/builtins.py`): 540 609 541 - | Macro | Purpose | 542 - |-------|---------| 543 - | `#loop_counted` | Counted loop: counter + compare + increment feedback loop | 544 - | `#loop_while` | Condition-tested loop: gate node for predicate-driven iteration | 545 - | `#permit_inject_N` | Inject N const(1) seed tokens (variants for N=1..4) | 546 - | `#reduce_add_N` | Binary reduction tree for addition (variants for N=2..4) | 610 + | Macro | Parameters | Outputs | Purpose | 611 + |-------|------------|---------|---------| 612 + | `#loop_counted` | `init, limit` | `body`, `exit` | Counted loop: counter + compare + increment feedback. Call with `#loop_counted &init, &limit \|> body=&process, exit=&done` | 613 + | `#loop_while` | `test` | `body`, `exit` | Condition-tested loop: gate node. Call with `#loop_while &test_src \|> body=&process, exit=&done` | 614 + | `#permit_inject` | `*targets` | (none — routes directly to targets) | Inject one const(1) seed per target. `#permit_inject &gate_a, &gate_b` | 615 + | `#reduce_2` | `op` | (positional) | Binary reduction: 1 node. `#reduce_2 add \|> &result` | 616 + | `#reduce_3` | `op` | (positional) | Binary reduction tree: 2 nodes. `#reduce_3 sub \|> &result` | 617 + | `#reduce_4` | `op` | (positional) | Binary reduction tree: 3 nodes. `#reduce_4 add \|> &result` | 547 618 548 - Built-in macros expose well-known internal node names (e.g., `&counter`, `&compare`, `&gate`) that the user wires externally after invocation. 619 + All built-in macros use `@ret` output wiring except `#permit_inject`, which routes directly to its variadic target arguments. The `#reduce_*` family accepts any opcode as a parameter. 549 620 550 621 ## Function Calls 551 622 ··· 577 648 578 649 ### Return Convention 579 650 580 - The expand pass creates synthetic `@ret` (or `@ret_name` for named outputs) nodes as return markers. The callee's result edges are wired to these markers, which trampoline the results back to the caller's context. 651 + Inside function bodies, `@ret` and `@ret_name` are reserved markers for return points. The expand pass replaces them with return trampolines — synthetic `pass` nodes that route results back to the caller's context via CTX_OVRD, with auto-inserted `free_ctx` nodes for context teardown. Port-qualified returns (`@ret:L`, `@ret:R`) handle dual-output return nodes. Named returns (`@ret_body`, `@ret_exit`) handle multiple independent return paths, wired at the call site via `name=@dest`. 652 + 653 + Note: `@ret` in function bodies creates trampolines with cross-context routing. `@ret` in macro bodies is simpler — pure edge rewriting with no context management, since macros inline into the caller's context. 581 654 582 655 ### Example 583 656
+90 -4
design-notes/io-and-bootstrap.md
··· 92 92 write (cell becomes FULL). Subsequent writes before a READ overwrite — 93 93 the SM cell is a 1-deep buffer. For deeper buffering, the IO hardware 94 94 can use a range of cells as a circular buffer. 95 - - The SM does not spontaneously generate tokens — it only satisfies 95 + - Standard SMs never spontaneously generate tokens — they only satisfy 96 96 pending deferred reads. The IO device's write *triggers* the deferred 97 97 read satisfaction, which is when the result token is emitted. 98 + 99 + ### Spontaneous Token Emission (SM00 Specialisation) 100 + 101 + The deferred-read model works well when the CM knows in advance that 102 + it wants IO data (it issues a READ, the read defers, the IO device 103 + eventually satisfies it). But some IO patterns are genuinely 104 + unsolicited — an interrupt-like event where external hardware needs to 105 + inject a token into the network without a prior READ request. 106 + 107 + SM00 could be specialised with a **dispatch register** (or small 108 + dispatch table) that maps IO events to pre-formed token templates. 109 + When an IO device signals an event and no deferred read is pending: 110 + 111 + 1. IO device asserts an event line (directly wired or via address 112 + decoder) 113 + 2. SM00 reads the dispatch register for that event source 114 + 3. The dispatch register contains a pre-formed token template 115 + (flit 1 routing + flit 2 data source) — similar to the SM return 116 + routing mechanism (see `sm-and-token-format-discussion.md`) 117 + 4. SM00 emits the token onto the network spontaneously 118 + 119 + The dispatch register is loaded at bootstrap (or via SM WRITE to a 120 + reserved address range). It tells SM00 "when UART RX fires and 121 + nobody is waiting, send this token to this PE at this offset." 122 + 123 + This makes SM00 the only SM that can act as a **token source** rather 124 + than purely a token responder. All other SMs remain reactive. 125 + 126 + **Hardware cost:** one additional register file (or a few reserved SM 127 + cells reinterpreted as dispatch entries) + event detection logic (edge 128 + detect on IO device status lines) + arbitration with normal SM 129 + operations. Estimated: 3-5 TTL chips beyond the base SM. 130 + 131 + **Alternative: always-pending deferred read.** The compiler ensures a 132 + READ is always pending on the IO cell — as soon as one is satisfied, 133 + the handler re-issues a READ immediately (feedback loop in the 134 + dataflow graph). This avoids SM00 specialisation entirely but has a 135 + significant resource cost: the current SM design supports only **one 136 + deferred read at a time per SM instance**. An always-pending IO read 137 + permanently occupies SM00's single deferred read slot, blocking all 138 + other deferred reads on SM00 (including other IO cells and any 139 + I-structure operations). If SM00 also serves as bootstrap SM and T0 140 + shared storage, this is a real constraint. 141 + 142 + The always-pending pattern works for a single IO source on a 143 + dedicated SM, but scales poorly. Multiple IO sources (UART + SPI + 144 + timer) would each need their own SM instance just to have a deferred 145 + read slot, which is wasteful. 146 + 147 + The spontaneous emission model (dispatch registers) avoids this 148 + entirely — no deferred read slot consumed, SM00's normal memory 149 + operations remain unblocked. This tips the balance toward SM00 150 + specialisation for any system with more than trivial IO. 151 + 152 + **Peripheral controller with batch notification.** A smarter 153 + peripheral controller (external hardware on SM00's address bus) 154 + manages its own buffering in a reserved cell range — similar to 155 + DMA/USART on STM32. The controller writes incoming data to a 156 + circular buffer of SM cells, tracks a write pointer internally, 157 + and writes a status/notification cell only at thresholds (half- 158 + complete, complete). The dataflow graph keeps one deferred read 159 + pending on the notification cell, processes a batch when it fires, 160 + and re-issues the read. This amortises the single-deferred-read 161 + cost across many IO events and keeps SM00's deferred read slot 162 + occupied only during the inter-batch interval, not per-byte. Still 163 + consumes the slot, but the duty cycle is much lower. 164 + 165 + **Multi-slot deferred reads.** The single-slot constraint that makes 166 + always-pending problematic could also be addressed by expanding the 167 + SM's deferred read storage to 2-4 entries using a small CAM. See 168 + `sm-design.md` "Multi-Slot Deferred Read CAM" section. Even 2 slots 169 + (one for IO, one for normal I-structure) would resolve the resource 170 + conflict without SM00 specialisation. A 4-entry CAM covers multiple 171 + IO sources simultaneously. This is architecturally the cleanest 172 + option — no special cases, no spontaneous emission, just a slightly 173 + larger deferred read store that benefits all SMs uniformly. 174 + 175 + For v0 with a single UART at low baud rates, the always-pending 176 + pattern with a single deferred read slot is sufficient. The multi-slot 177 + CAM, peripheral controller, and/or spontaneous emission models are 178 + refinements for systems with multiple IO sources or higher throughput 179 + requirements. 98 180 99 181 ### Hardware 100 182 ··· 247 329 5. **Program image format** — flat sequence of (flit1, flit2) token pairs 248 330 in ROM. Needs a terminator or length prefix so EXEC knows when to stop. 249 331 Length is the EXEC count parameter. 250 - 6. **SM00 further specialisation** — documented as an option (see 251 - `sm-design.md`). Not committed for v0. Standard SM opcodes are 252 - sufficient for basic IO via I-structure semantics. 332 + 6. **SM00 spontaneous token emission** — SM00 could be specialised 333 + with a dispatch register that maps unsolicited IO events to 334 + pre-formed token templates for spontaneous emission (interrupt 335 + equivalent without prior READ). See "Spontaneous Token Emission" 336 + section above. Not committed for v0 — always-pending deferred read 337 + pattern is sufficient for basic IO. The dispatch register mechanism 338 + is a future refinement for lower-latency interrupt response.
+35 -27
design-notes/loop-patterns-and-flow-control.md
··· 808 808 The assembler ships built-in macros (prepended to every program) that 809 809 implement common patterns from this document: 810 810 811 - | Macro | Pattern | 812 - |-------|---------| 813 - | `#loop_counted` | Counted loop (counter + compare + increment feedback) | 814 - | `#loop_while` | Condition-tested loop (gate node) | 815 - | `#permit_inject_N` | Permit injection (N=1..4 const seed tokens) | 816 - | `#reduce_add_N` | Binary reduction tree for addition (N=2..4) | 811 + | Macro | Parameters | Outputs (@ret) | Pattern | 812 + |-------|------------|----------------|---------| 813 + | `#loop_counted` | `init, limit` | `body`, `exit` | Counted loop (counter + compare + increment feedback) | 814 + | `#loop_while` | `test` | `body`, `exit` | Condition-tested loop (gate node) | 815 + | `#permit_inject` | `*targets` | (routes to targets) | Permit injection (variadic, one const(1) per target) | 816 + | `#reduce_2`..`_4` | `op` | (positional) | Binary reduction tree (parameterized opcode, per-arity) | 817 817 818 + All built-in macros use `@ret` output wiring. The `#reduce_*` family 819 + accepts any opcode as a parameter (e.g., `#reduce_4 add |> &result`). 818 820 See `asm/builtins.py` for definitions. 819 821 820 822 #### Example Macro Definitions ··· 900 902 ### Permit Injection — Two Approaches 901 903 902 904 For small K (roughly K <= 4), inline CONST injection. The built-in 903 - macros `#permit_inject_1` through `#permit_inject_4` provide this: 905 + `#permit_inject` macro is variadic — pass the target nodes directly: 904 906 905 907 ```dfasm 906 908 ; Built-in definition (from asm/builtins.py): 907 - #permit_inject_2 |> { 908 - &p0 <| const, 1 909 - &p1 <| const, 1 909 + #permit_inject *targets |> { 910 + $( 911 + &p <| const, 1 912 + &p |> ${targets} 913 + ),* 910 914 } 911 915 912 - ; Usage: invoke the built-in, then wire outputs to the gate 913 - #permit_inject_2 914 - #permit_inject_2.&p0 |> &dispatch_gate:L 915 - #permit_inject_2.&p1 |> &dispatch_gate:L 916 + ; Usage: pass gate nodes as targets 917 + #permit_inject &dispatch_gate:L, &dispatch_gate:L 916 918 ``` 917 919 918 920 For large K, use SM EXEC to batch-emit permits: ··· 933 935 934 936 ### Loop Control Macro 935 937 936 - The built-in `#loop_counted` provides the core loop infrastructure: 938 + The built-in `#loop_counted` provides the core loop infrastructure. 939 + It accepts `init` and `limit` as input parameters and exposes `body` 940 + and `exit` as `@ret` outputs: 937 941 938 942 ```dfasm 939 943 ; Built-in definition (from asm/builtins.py): 940 - #loop_counted |> { 944 + #loop_counted init, limit |> { 941 945 &counter <| add 942 946 &compare <| brgt 943 947 &counter |> &compare:L 948 + &body_fan <| pass 949 + &compare |> &body_fan:L 944 950 &inc <| inc 945 - &compare |> &inc:L 951 + &body_fan |> &inc:L 946 952 &inc |> &counter:R 953 + ${init} |> &counter:L 954 + ${limit} |> &compare:R 955 + &body_fan |> @ret_body 956 + &compare |> @ret_exit:R 947 957 } 948 958 949 - ; Usage: wire init, limit, body, and exit externally 950 - #loop_counted 959 + ; Usage: pass init/limit as args, wire body/exit via |> 951 960 &c_init <| const, 0 952 961 &c_limit <| const, 64 953 - &c_init |> #loop_counted.&counter:L ; initial counter value 954 - &c_limit |> #loop_counted.&compare:R ; loop bound 955 - #loop_counted.&compare:L |> &body_entry ; taken → body 956 - #loop_counted.&compare:R |> &done ; not-taken → exit 962 + #loop_counted &c_init, &c_limit |> body=&body_entry, exit=&done 957 963 ``` 958 964 959 965 ### Reduction Tree Macro ··· 986 992 ; &acc drains to #ret on completion 987 993 } 988 994 989 - ; Loop control (macro expands to CONST, INC, LT, SWITCH + feedback) 990 - #loop_counted 64, &dispatch, &done 995 + ; Loop control 996 + &c_init <| const, 0 997 + &c_limit <| const, 64 998 + #loop_counted &c_init, &c_limit |> body=&dispatch, exit=&done 991 999 992 - ; Permit injection (pick one strategy) 993 - #permit_inject_inline 4, &gate 1000 + ; Permit injection (4 permits to throttle body launches) 1001 + #permit_inject &gate:L, &gate:L, &gate:L, &gate:L 994 1002 995 1003 ; Gated dispatch — permits throttle body launches 996 1004 &gate <| gate
design-notes/papers/Mitsubishi M5L2732K.pdf design-notes/datasheets/Mitsubishi M5L2732K.pdf
+243 -45
design-notes/pe-design.md
··· 101 101 102 102 Stage 3: INSTRUCTION FETCH 103 103 - Use local offset to read from PE's instruction SRAM (IRAM) 104 - - IRAM width is decoupled from bus width: 32-48 bits, sized for 105 - opcode + destination encoding (see IRAM section below) 106 - - Multiple SRAM chips addressed in parallel (e.g., 3x 8-bit-wide 107 - for 48-bit read) — width costs chips but NOT latency 104 + - Two-half read: half 0 (opcode + control + const) feeds decoder/ALU 105 + on read cycle 1; half 1 (destinations) read on cycle 2, latched 106 + for Stage 5. ALU executes during the half 1 read — no bubble. 107 + - 2x 8-bit SRAM chips, 32-bit effective width per instruction slot 108 108 - ~200 transistors of logic 109 109 - NOTE: instruction memory is shared between pipeline reads and 110 110 network config writes — see "Instruction Memory" section below ··· 210 210 Instruction memory is PE-local external SRAM. **IRAM width is completely 211 211 independent of bus width** — it is sized for encoding needs, not bus 212 212 constraints. See `bus-architecture-and-width-decoupling.md` for the full 213 - rationale. 213 + rationale, and `iram-and-function-calls.md` for the detailed bit-level 214 + format. 214 215 215 - > **⚠ Preliminary:** IRAM width and field allocation are estimates, not 216 - > committed. The emulator uses Python dataclasses (`ALUInst`, `SMInst`) 217 - > that don't reflect bit-level encoding. Hardware encoding is deferred 218 - > to physical build. 216 + #### IRAM Width: Two-Half Format 219 217 220 - #### IRAM Width 218 + Each IRAM slot is **32 bits effective**, read as two 8-bit halves across 219 + two cycles. Two 8-bit SRAM chips addressed in parallel — one chip per 220 + half, or two chips with a half-select bit in the address: 221 221 222 - | Field | Purpose | Bits (est.) | 223 - | ----------- | ------------------------------------------- | ----------- | 224 - | Opcode | ALU/control operation | 5-8 | 225 - | Dest 1 addr | first output instruction address | 10-12 | 226 - | Dest 1 port | L/R input on destination | 1 | 227 - | Dest 2 addr | second output instruction address (fan-out) | 10-12 | 228 - | Dest 2 port | L/R input on destination | 1 | 229 - | Dest 2 PE | remote PE flag + PE_id for cross-PE outputs | 0-3 | 230 - | Arity | monadic/dyadic (or encoded in opcode) | 0-1 | 231 - | Flags | immediate mode, structure op, etc. | 2-4 | 232 - | Immediate | small constant for immediate-mode ops | 0-8 | 222 + ``` 223 + IRAM address = [offset:7][half:1] = 8 bits 224 + half 0: opcode + control + const/params (read cycle 1, feeds decoder+ALU) 225 + half 1: destinations or SM supplementary (read cycle 2, latched for Stage 5) 226 + ``` 233 227 234 - This sums to roughly **32-48 bits** depending on address space size and 235 - how aggressively fields are packed. 48 bits (3 x 16-bit SRAM words) is a 236 - comfortable target that avoids bit-packing contortions. 228 + 128 instruction slots per PE. Total SRAM usage: 256 bytes per PE. 237 229 238 - IRAM is read in one pipeline stage using parallel SRAM chips: e.g., 3x 239 - 8-bit-wide SRAMs for a 48-bit read, all addressed by the same address 240 - lines. Width costs physical chips but does NOT add pipeline latency. 230 + **Half 0 — CM compute (bit 15 = 0):** 231 + 232 + | Field | Bits | Purpose | 233 + |------------|-------|---------| 234 + | type | 1 | 0 = CM compute | 235 + | opcode | 5 | ALU/routing/control operation | 236 + | ctx_mode | 2 | 00=INHERIT, 01=CTX_OVRD, 10=CHANGE_TAG | 237 + | const | 8 | ALU immediate or ctx/gen override | 238 + 239 + **Half 0 — SM operation (bit 15 = 1):** 240 + 241 + | Field | Bits | Purpose | 242 + |-----------------|-------|---------| 243 + | type | 1 | 1 = SM operation | 244 + | sm_opcode | 5 | SM bus operation code | 245 + | ctx_mode | 2 | Context source for return routing | 246 + | const/ret | 8 | Address, return routing, or parameter | 247 + 248 + **Half 1 — Single destination (bit 15 = 0):** 249 + 250 + | Field | Bits | Purpose | 251 + |--------------|-------|---------| 252 + | has_dest2 | 1 | 0 = single destination | 253 + | dest1_PE | 2 | Target PE ID | 254 + | dest1_offset | 5 | Target instruction offset | 255 + | dest1_port | 1 | L/R port on destination | 256 + | const_ext | 7 | Extended const (CONST16: 15-bit immediate) | 257 + 258 + **Half 1 — Dual destination (bit 15 = 1):** 259 + 260 + | Field | Bits | Purpose | 261 + |--------------|-------|---------| 262 + | has_dest2 | 1 | 1 = dual destination | 263 + | dest1_PE | 2 | Target PE ID | 264 + | dest1_offset | 5 | Target instruction offset | 265 + | dest1_port | 1 | L/R port on destination | 266 + | dest2_PE | 2 | Second target PE ID | 267 + | dest2_offset | 5 | Second target offset | 268 + 269 + The two-cycle read overlaps with ALU execution: half 0 is read in 270 + cycle N and feeds the decoder/ALU immediately; half 1 is read in 271 + cycle N+1 and latched for Stage 5 (output formatter). No pipeline 272 + bubble. 273 + 274 + See `iram-and-function-calls.md` for SM operation half 1 variants 275 + (return routing, SM_id, CAS flit), ctx_mode semantics, and the 276 + CONST16 wide-immediate mechanism. 277 + 278 + **Upgrade path:** If the two-half sequential read introduces pipeline 279 + bubbles or timing issues in practice, the IRAM can be widened to a 280 + true 32-bit parallel read (2x 16-bit-wide SRAMs or 4x 8-bit-wide). 281 + The instruction encoding is unchanged — only the read path changes. 282 + Both halves would be available in a single cycle, eliminating the 283 + overlap between half 1 read and ALU execution. Cost: 2 additional 284 + SRAM chips per PE. 241 285 242 286 **Instruction words are never serialised onto the external bus** during 243 287 normal execution. They are only written via IRAM write packets during ··· 247 291 248 292 #### Wider Instruction Addresses (Future) 249 293 250 - If the instruction address field proves limiting, options include: 251 - - Steal bits from other fields (fewer opcodes, smaller immediate) 252 - - Move to 48-bit instruction words with wider address fields 253 - - Use bank/page register to extend effective address space without 254 - widening the per-token address field (similar to 8-bit CPUs using 255 - bank switching) 294 + If the 5-bit dest offset (32 entries) or 7-bit IRAM address (128 295 + entries) proves limiting, options include: 296 + - Steal bits from other fields (fewer opcodes, smaller const) 297 + - Widen IRAM to 48 bits (3x 8-bit halves or true parallel) for wider 298 + address fields 299 + - **Bank/page register** (strong candidate — see below) 256 300 - Make extended-address instructions a 3-flit token type, costing one 257 301 extra flit cycle only when needed 258 302 ··· 260 304 token and the address into IRAM are the same field, sized by the token 261 305 format, not the bus. 262 306 307 + #### IRAM Bank Switching (Strong Candidate) 308 + 309 + A small bank register (4 bits) prepended to the 8-bit IRAM address 310 + gives a 12-bit effective address space: 16 banks × 128 instructions = 311 + 2048 instruction slots per PE. With standard 8Kx8 SRAM chips (which 312 + are cheap, period-appropriate, and physically small), all 16 banks 313 + fit in the same SRAM chips already present — the bank register just 314 + provides the high address bits. 315 + 316 + ``` 317 + SRAM address = [bank:4][offset:7][half:1] = 12 bits 318 + bank: from PE-local bank register (4-bit latch) 319 + offset: from token (7 bits) 320 + half: from pipeline stage (0 = half 0, 1 = half 1) 321 + ``` 322 + 323 + **Why this is attractive:** 324 + 325 + 1. **All code preloaded at bootstrap.** 2048 instructions is enough 326 + for most programs' full working set. Load everything during init, 327 + then no IRAM write traffic during execution. The instruction 328 + residency problem largely goes away. 329 + 330 + 2. **Bank switch is trivial hardware.** One 4-bit register per PE, 331 + written by a config token or a dedicated instruction. Switching 332 + banks costs one cycle (write the register), then the next 333 + instruction fetch reads from the new bank. No IRAM rewriting, 334 + no drain/flush protocol. 335 + 336 + 3. **Simplifies identity detection.** Instead of fragment ID comparison 337 + against arbitrary tags, the question is "am I in the right bank?" 338 + — a 4-bit compare against the bank register. If the token's expected 339 + bank (derivable from its destination address or carried explicitly) 340 + doesn't match the current bank register, the PE knows immediately. 341 + 342 + 4. **Minimal SRAM cost increase.** An 8Kx8 SRAM is the same physical 343 + chip family as a 2Kx8 — just more address lines connected. The two 344 + existing SRAM chips (one per half) simply gain more address lines 345 + from the mapper. The SRAM itself may be a size bump but not a 346 + chip-count increase. 347 + 348 + 5. **Compatible with dynamic scheduling.** A scheduler can preload 349 + multiple function fragments into different banks and switch between 350 + them with a single register write. Combined with proactive loading 351 + (filling unused banks while other code executes), this gives a 352 + working set model without the complexity of demand paging. 353 + 354 + **Hardware: 74LS610 Memory Mapper** 355 + 356 + The 74LS610 (TI memory mapper, originally for TMS9900 family) is an 357 + ideal fit for IRAM bank switching. See 358 + `sm-and-token-format-discussion.md` for full chip details. Key 359 + properties: 360 + 361 + - 16 mapping registers, each 12 bits wide 362 + - 4-bit logical address input selects register → 12-bit physical 363 + address output 364 + - **Latch control** (pin 28): outputs can be frozen while register 365 + contents change — safe bank switching with no glitch window 366 + - ~40-50ns propagation delay (LS family), pipelineable with SRAM 367 + access 368 + - One chip per PE. Writes to mapping registers via data bus during 369 + config/bootstrap. 370 + 371 + The '610 maps a 4-bit logical bank selector to a 12-bit physical 372 + SRAM address prefix. The IRAM address becomes: 373 + 374 + ``` 375 + Logical: [bank_select:4][offset:7][half:1] 376 + | 377 + v (74LS610) 378 + Physical: [phys_bank:12][offset:7][half:1] = up to 20-bit SRAM address 379 + ``` 380 + 381 + In practice the physical address width is bounded by available SRAM 382 + chip capacity. With 8Kx8 SRAMs (13-bit address): the '610's 12-bit 383 + output is wider than needed — only 5 bits of physical bank + 7-bit 384 + offset + 1-bit half = 13 bits. This gives 32 physical banks of 128 385 + instructions each (4096 instructions per PE). 386 + 387 + The '610 is already in the design for SM banking. Using it for IRAM 388 + banking is the same chip, same wiring pattern, different address 389 + domain. One '610 per PE for IRAM, one for SM. 390 + 391 + **Switching mechanism: `MAP_PAGE` and `SET_PAGE` instructions** 392 + 393 + Two instructions manage banking. Neither touches the token format. 394 + 395 + - **`MAP_PAGE`** (monadic): writes a logical→physical mapping into 396 + one of the '610's 16 mapping registers. The register index and 397 + physical bank address come from instruction fields (const or 398 + operand data). Used during bootstrap or runtime to establish which 399 + physical SRAM regions back which logical pages. 400 + 401 + - **`SET_PAGE`** (monadic): writes a 4-bit logical page selector 402 + into a PE-local latch. The latch feeds the '610's MA0-MA3 inputs. 403 + All subsequent IRAM fetches go through the selected logical page's 404 + mapped physical bank. One cycle to switch. 405 + 406 + ``` 407 + Banking workflow: 408 + 1. Bootstrap: MAP_PAGE instructions establish mappings 409 + (logical page 0 → physical region A, page 1 → region B, etc.) 410 + 2. Runtime: SET_PAGE selects the active logical page 411 + 3. Latch → '610 MA0-MA3 → physical SRAM bank selection 412 + 4. All IRAM reads now address the selected bank 413 + ``` 414 + 415 + The compiler inserts `SET_PAGE` at function entry points or code 416 + phase transitions. The '610's latch control pin (pin 28) freezes 417 + outputs during the switch, preventing glitched SRAM addresses. 418 + 419 + Hardware cost: one 74LS175 (quad D flip-flop) as the page latch + 420 + the '610 itself. Two chips total for the entire banking mechanism. 421 + 422 + **Tradeoffs:** 423 + 424 + - Bank switch affects all in-flight tokens targeting this PE at 425 + offsets in the old bank. The compiler (or scheduler) must drain 426 + tokens for the old bank before switching — same throttle-and-drain 427 + protocol as code overwrite, but switching is instantaneous once 428 + drained (write latch, done). 429 + - `SET_PAGE` is sequentially scoped — it affects all subsequent 430 + fetches, not just one activation. The compiler must ensure that 431 + concurrent activations on the same PE agree on the active page, or 432 + use `SET_PAGE` as a barrier between phases. 433 + - Total capacity per PE is bounded by SRAM chip size, not the '610 434 + (which can address far more than any reasonable IRAM). 435 + - Pages are a pure address-mapping primitive. The compiler decides 436 + what they mean — per-function, per-phase, or any other grouping. 437 + The hardware doesn't enforce or assume any relationship between 438 + pages and function bodies. 439 + 440 + **Recommendation:** Bank switching via '610 is the most 441 + hardware-efficient path to larger code capacity. One chip per PE, 442 + no IRAM rewriting for programs within banked capacity, and the '610 443 + is already proven in this design (SM banking). Consider making this 444 + a v0.5 feature rather than deferring to "future." 445 + 263 446 #### Shared SRAM Arbitration 264 447 265 448 Shared SRAM means arbitration between two users: ··· 504 687 chip count budget per PE. 505 688 3. **Free slot tracking**: bump allocator + bitmap + priority encoder? Or 506 689 free-slot FIFO? 507 - 4. **Instruction encoding**: operation set, format, IRAM width (32 or 48 508 - bits). Decoupled from bus width — driven by opcode + destination 509 - fields. See field table in IRAM section above. 690 + 4. ~~**Instruction encoding**~~: **Resolved.** 32-bit effective width, 691 + two 8-bit halves read across two cycles. See IRAM section above and 692 + `iram-and-function-calls.md` for detailed bit-level format. 510 693 5. **Function splitting heuristics**: how does the compiler decide where 511 694 to split? Minimise cross-PE traffic? Balance slot usage across PEs? 512 695 Hardware constraints (slot count, entry count) drive it. ··· 536 719 537 720 Unlike Manchester, Amamiya, or Monsoon — which either replicated the 538 721 entire program into every PE's instruction memory or used very large 539 - per-PE instruction stores — this design has **small IRAM** (hundreds of 540 - entries, not thousands) with runtime-writable instruction memory. Any 541 - program larger than a single PE's IRAM needs code loading at runtime, 542 - even under fully static PE assignment. 722 + per-PE instruction stores — this design has **small IRAM per bank** 723 + (128 entries) with runtime-writable instruction memory. Without bank 724 + switching, any program larger than a single PE's IRAM needs code loading 725 + at runtime, even under fully static PE assignment. 726 + 727 + **With bank switching** (see IRAM Bank Switching section above), each PE 728 + holds up to 2048 instructions across 16 banks using the same SRAM chips. 729 + This substantially reduces the pressure on runtime code loading — most 730 + programs' full working set fits in the preloaded banks, and switching 731 + between function fragments costs a single register write instead of 732 + IRAM rewrite traffic. The code storage hierarchy and loader mechanisms 733 + below remain relevant for programs that exceed the banked capacity, but 734 + bank switching makes that the exception rather than the rule. 543 735 544 - This is a first-class architectural concern, not a deferred future 545 - capability. The reference architectures largely avoid it by throwing 546 - memory at the problem: Amamiya's 8KW/PE replicated instruction memory, 736 + The reference architectures largely avoid the residency problem by 737 + throwing memory at it: Amamiya's 8KW/PE replicated instruction memory, 547 738 Manchester's large instruction store, Monsoon's 64K-instruction frames. 548 - Our small IRAM is a deliberate tradeoff for hardware simplicity, but it 549 - means instruction residency management must be part of the design. 739 + Bank switching gives us a comparable effective capacity (2K instructions) 740 + with much less hardware than full replication. 550 741 551 742 ### Code Storage Hierarchy 552 743 ··· 593 784 systems, the I/O controller could serve this role during early phases. 594 785 595 786 ### The Identity Problem: Miss Detection 787 + 788 + > **Note:** With bank switching, this problem is substantially reduced. 789 + > If all code is preloaded at bootstrap across banks, there is no 790 + > "wrong code loaded" scenario — only "wrong bank selected." Bank 791 + > identity is a trivial 4-bit compare. The discussion below applies 792 + > primarily to systems that exceed banked capacity and must swap code 793 + > at runtime. 596 794 597 795 If code loading happens at runtime, the question arises: how does a PE 598 796 know the code in its IRAM is the *right* code for an arriving token?
+65 -4
design-notes/sm-design.md
··· 272 272 273 273 in practice at v0 scale (4 CMs, low contention), this should be rare. 274 274 the compiler can also help by ensuring that reads and writes to the same 275 - cell are ordered appropriately in the program graph. if depth-1 proves 276 - too restrictive, expanding to a 4-entry deferred read CAM (match on 277 - cell_addr, store return_routing) is straightforward — same logic, just 278 - replicated with priority encoding. but start with 1. 275 + cell are ordered appropriately in the program graph. 276 + 277 + ### Multi-Slot Deferred Read CAM (Candidate Enhancement) 278 + 279 + If depth-1 proves too restrictive, expanding to a multi-entry deferred 280 + read store using a small CAM (content-addressable memory) is a natural 281 + fit. A CAM does associative lookup in one cycle — present the WRITE 282 + address, the CAM match line fires if any entry holds that address. No 283 + sequential scan, no priority logic changes to the SM pipeline. 284 + 285 + ``` 286 + Deferred Read CAM (e.g., 4 entries): 287 + Entry 0: [valid:1][cell_addr:10][return_routing:16] 288 + Entry 1: [valid:1][cell_addr:10][return_routing:16] 289 + Entry 2: [valid:1][cell_addr:10][return_routing:16] 290 + Entry 3: [valid:1][cell_addr:10][return_routing:16] 291 + 292 + On READ hitting EMPTY cell: 293 + - Find first invalid entry (priority encoder), write cell_addr + 294 + return_routing, set valid 295 + - If all entries valid: stall (same as depth-1 overflow) 296 + 297 + On WRITE: 298 + - Present write_addr to CAM match lines 299 + - If any entry matches: satisfy that deferred read, clear entry 300 + - Normal WRITE proceeds regardless 301 + ``` 302 + 303 + **Why CAM is ideal here:** the deferred read lookup is inherently 304 + associative — "does any pending read match this write address?" A 305 + register file would require sequential comparison against each entry. 306 + A CAM answers in one cycle regardless of entry count. Small CAMs 307 + (4-16 entries) existed as discrete TTL/CMOS parts and are also trivial 308 + to build from comparators + registers. 309 + 310 + **Hardware cost:** 4 entries × (10-bit comparator + 27-bit register) + 311 + priority encoder for allocation + match OR for satisfaction detection. 312 + Estimated 8-12 TTL chips for a 4-entry CAM — roughly double the 313 + single-register cost. Alternatively, the National Semiconductor 100142 314 + (4-word × 4-bit, ECL, see `datasheets/NATLS21982-1.pdf`) is a discrete 315 + CAM chip that provides 4-entry associative lookup in a single package. 316 + Two 100142s cascade to 4 words × 8 bits, covering the cell address 317 + match width. The return routing storage still requires separate 318 + registers, but the address-match portion — the critical associative 319 + path — shrinks to 2-3 chips instead of a comparator tree. 320 + 321 + **IO motivation:** the strongest argument for multiple deferred read 322 + slots comes from IO on SM00 (see `io-and-bootstrap.md`). The 323 + always-pending deferred read pattern for IO permanently occupies a 324 + slot. With a single slot, SM00 can either service IO *or* do normal 325 + I-structure deferred reads, but not both. Even 2 slots (one for IO, 326 + one for memory operations) resolves this. 4 slots covers multiple IO 327 + sources (UART + SPI + timer) without needing SM00 specialisation or 328 + spontaneous token emission hardware. 329 + 330 + **Uniform vs SM00-only:** giving all SMs multi-slot deferred reads 331 + keeps the architecture uniform (no special cases). The per-SM cost 332 + increase is small. Alternatively, only SM00 gets the CAM and other 333 + SMs keep depth-1 — saves chips but adds a special case. The uniform 334 + approach is preferred unless chip budget is extremely tight. 335 + 336 + **Recommendation:** start with depth-1 for v0. If IO requirements or 337 + I-structure usage patterns hit the single-slot limit, upgrade to a 338 + 2-4 entry CAM uniformly across all SMs. The SM pipeline logic doesn't 339 + change — only the deferred read storage and match logic scales. 279 340 280 341 ## Operation Set 281 342
+21 -22
tests/test_builtins.py
··· 81 81 assert _BUILTIN_LINE_COUNT > 0 82 82 83 83 assert "#loop_counted" in BUILTIN_MACROS 84 - assert "#permit_inject_1" in BUILTIN_MACROS 84 + assert "#permit_inject" in BUILTIN_MACROS 85 85 assert "#reduce_2 op" in BUILTIN_MACROS 86 86 87 87 def test_builtins_prepended_to_pipeline(self): ··· 108 108 source = """ 109 109 @system pe=1, sm=0 110 110 &sink <| pass 111 - #permit_inject_1 |> out=&sink 111 + #permit_inject &sink 112 112 """ 113 113 graph = run_pipeline(source) 114 114 assert len(graph.errors) == 0 115 115 116 116 node_names = list(graph.nodes.keys()) 117 - has_p0 = any("&p0" in n for n in node_names) 118 - assert has_p0, f"Expected &p0 node from #permit_inject_1 expansion in {node_names}" 117 + has_p = any("&p" in n and "permit_inject" in n for n in node_names) 118 + assert has_p, f"Expected &p node from #permit_inject expansion in {node_names}" 119 119 120 - p0_node = next(n for n in graph.nodes.values() if "&p0" in n.name) 121 - assert p0_node.const == 1, "permit_inject_1 &p0 should have const=1" 120 + p_node = next(n for n in graph.nodes.values() if "&p" in n.name and "permit_inject" in n.name) 121 + assert p_node.const == 1, "permit_inject &p should have const=1" 122 122 123 123 124 124 class TestAC82_UserMacroShadows: ··· 127 127 def test_user_defined_macro_shadows_builtin(self): 128 128 """User-defined macro with same name shadows built-in. 129 129 130 - Verifies that when a user defines #permit_inject_1 with custom body, 130 + Verifies that when a user defines #permit_inject with custom body, 131 131 their definition is used instead of the built-in version. 132 132 """ 133 133 source = """ 134 134 @system pe=1, sm=0 135 135 136 - ; User defines #permit_inject_1 with a custom body (no parameters) 137 - #permit_inject_1 |> { 136 + ; User defines #permit_inject with a custom body 137 + #permit_inject *targets |> { 138 138 &custom_node <| const, 99 139 139 } 140 140 141 141 ; Invoke the user-defined macro 142 - #permit_inject_1 142 + &sink <| pass 143 + #permit_inject &sink 143 144 """ 144 145 graph = run_pipeline(source) 145 146 ··· 280 281 all_values.extend([t.data for t in pe_outputs if hasattr(t, 'data')]) 281 282 assert 7 in all_values, f"Expected 3+4=7 in outputs, got {all_values}" 282 283 283 - def test_builtin_permit_inject_1_assembles_and_runs(self): 284 - """#permit_inject_1 assembles and expands to const node.""" 284 + def test_builtin_permit_inject_assembles_and_runs(self): 285 + """#permit_inject assembles and expands to const nodes.""" 285 286 source = """ 286 287 @system pe=1, sm=0 287 288 &sink <| pass 288 - #permit_inject_1 |> out=&sink 289 + #permit_inject &sink 289 290 """ 290 291 result = assemble(source) 291 292 assert result is not None ··· 327 328 tree = parser.parse(BUILTIN_MACROS) 328 329 assert tree is not None 329 330 330 - def test_permit_inject_variants_defined_in_builtins(self): 331 - """All #permit_inject_1 through #permit_inject_4 are defined.""" 331 + def test_permit_inject_defined_in_builtins(self): 332 + """#permit_inject variadic macro is defined.""" 332 333 from asm.builtins import BUILTIN_MACROS 333 334 334 - for i in range(1, 5): 335 - macro_name = f"#permit_inject_{i}" 336 - assert macro_name in BUILTIN_MACROS, \ 337 - f"Expected {macro_name} definition in BUILTIN_MACROS" 335 + assert "#permit_inject *targets" in BUILTIN_MACROS, \ 336 + "Expected variadic #permit_inject definition in BUILTIN_MACROS" 338 337 339 338 def test_reduce_variants_defined_in_builtins(self): 340 339 """All #reduce_2 through #reduce_4 are defined with op parameter.""" ··· 401 400 402 401 &sink <| pass 403 402 #my_const 404 - #permit_inject_1 |> out=&sink 403 + #permit_inject &sink 405 404 """ 406 405 graph = run_pipeline(source) 407 406 assert len(graph.errors) == 0 408 407 409 408 node_names = list(graph.nodes.keys()) 410 409 has_val = any("&val" in n for n in node_names) 411 - has_p0 = any("&p0" in n for n in node_names) 410 + has_p = any("&p" in n and "permit_inject" in n for n in node_names) 412 411 assert has_val, f"Expected user macro &val in {node_names}" 413 - assert has_p0, f"Expected builtin &p0 in {node_names}" 412 + assert has_p, f"Expected builtin &p in {node_names}" 414 413 415 414 416 415 class TestLineNumberOffset: