OR-1 dataflow CPU sketch

docs: add dfasm macros and function calls design plan

Macro system with IR-level expansion, static function call syntax
with @ret markers and auto-inserted free_ctx, trailing-colon
location directives, dot-notation scope resolution, and built-in
macro library. 8 implementation phases.

Orual 365d3ac3 b79275c1

+2363 -447
+2
.exo/config.toml
··· 1 + default_role = "tl" 2 + zellij_session = "or1-design"
+3
.gitignore
··· 17 17 # dfgraph frontend 18 18 dfgraph/frontend/node_modules/ 19 19 dfgraph/frontend/dist/ 20 + # ExoMonad - track config, ignore runtime artifacts 21 + .exo/* 22 + !.exo/config.toml
+28 -4
design-notes/OR-1 Design.md
··· 14 14 15 15 There's two reasons. One makes me sound like a crank, the other makes me sound like an asshole. I'll start with the latter. I've wanted to build a breadboard CPU for a bit now, but I didn't want to just follow in someone else's footsteps. James Sharman has made perhaps the most polished and usable system, even writing some decent games for his JAM-1, and achieved high subscalar performance. Fabian Schuiki is targeting a superscalar system with out-of-order execution, backporting ideas from the late 90s and 2000s. I think I can do something way more out of distribution than either, and get competitive results. 16 16 17 - The other reason is that I do genuinely think we might have missed something when we dropped dataflow machines because the PC and thus the x86 CPU (and now the ARM CPU) ate the world. There are a bunch of concepts that, even within dataflow designs, got abandoned in the 90s favour of moving closer to a multicore Turing machine, just with hardware message passing and semaphores. 17 + The other reason is that I do genuinely think we might have missed something when we dropped dataflow machines because the PC and thus the x86 CPU (and now the ARM CPU) ate the world. There are a bunch of concepts that, even within dataflow designs, got abandoned in the 90s favour of moving closer to a multi-core Turing machine, just with hardware message passing and semaphores. 18 18 19 19 ## Project Goals 20 20 21 - - Dynamic dataflow CPU achievable with discrete logic (74-series TTL + SRAM) 21 + - Dynamic dataflow CPU achievable with discrete logic (74-series TTL or CMOS + SRAM) 22 22 - Multi-PE design targeting superscalar-equivalent IPC 23 23 - "Period-plausible" transistor budget: ~25-35K logic transistors + SRAM chips 24 24 - Comparable to a 68000 or a couple of Z80s in logic complexity ··· 38 38 ## Things the OR-1 does differently 39 39 40 40 - **Very** small instruction and operand storage in the CM (think register file or L1 cache, not RAM) at least relative to other dataflow computers 41 - - SM acts like L2 cache or RAM. 42 41 - This means that instructions must be fetched while running. 43 - - There is still no program counter or similar, loads are explicit. The compiler/assembler inserts loads as best it can. 42 + - There is still no program counter or similar, loads are explicit. The compiler/assembler inserts loads as best it can. 43 + - The 'exec' SM instruction offers a straightforward way to load a coherent block of code into the instruction cache at runtime and optionally trigger its execution. 44 + - SM is a hybrid of owned I-structure-esque memory and a standard shared address space with more typical guarantees. 45 + - ROM and memory-mapped IO devices which do not need I-structure guarantees are generally mapped into the shared address space. 46 + - Stronger guarantees over a block of raw memory space can be obtained in the typical way using synchronization primitives located in I-structure memory 47 + 48 + ### Boot process 49 + 50 + One of the challenges in making a dataflow CPU without a dedicated (and more conventional) control unit is figuring out how to bootstrap the system. Instructions must be loaded into the control memory elements and the first seed tokens must be emitted. The OR-1 solves this by giving one structure memory element responsibility for bootstrapping the system. It reuses the SM 'exec' instruction circuitry with a hardwired address to clock tokens stored in ROM onto the bus until it reaches a stop signal. 51 + 52 + 1. Bootstrap SM (SM00) activates on reset 53 + 2. Reset signal latches reset vector address into SM00's address register and triggers 'exec' instruction circuit 54 + 3. SM00 loads contents at reset vector into counter register, adds address 55 + 4. SM00 loads next address, and pushes direct to interconnect, increments address 56 + 5. Repeat until address == counter 57 + 6. Bootstrap SM input FIFO output now enabled 58 + 59 + The contents of the reset vector, after the length, contain the raw tokens required to load each PE's initial instructions and data, plus any seed tokens to begin execution. This can be a simple bootstrap program to enable loading of other code, or it can be the primary program itself. Valid programs must not send commands to the bootstrap SM during this process. The input FIFO's output is disabled during exec instructions, but putting any traffic intended for SM00 onto the bus risks interfering with other bus traffic, as SM00 makes no guarantees about the behaviour of its input FIFO during the boot process. 60 + 61 + ### Dynamic scheduling 62 + 63 + The OR-1 is, for a number of reasons, a mostly *static* dataflow machine. The way instructions are routed is built in to the instructions and tokens themselves. There's no dynamic load balancing inherent to the machine, as that would add a nontrivial amount of additional logic. There are of course a few escapes for this. One is `exec`, which reads out memory directly as tokens, and the closely related `iram_write`, which replaces the contents of a cell in instruction memory. Another, not yet implemented, is `mkpkt`, which creates an arbitrary packet from two operands. 64 + 65 + ### Function calls 66 + 67 + While obviously the `exec` instruction can effectively "call" a function, that is a high-overhead operation. Optimized code interleaves IRAM writes following `free_ctx` operations.
+24 -24
design-notes/alu-and-output-design.md
··· 166 166 > IntEnum ordinal values that do NOT correspond to these bit patterns. 167 167 > Final hardware encoding will be determined during physical build. 168 168 169 - | Opcode | Mnemonic | Arity | Output Mode | Description | 170 - |--------|----------|-------|-------------|-------------| 171 - | 00000 | ADD | dyadic | DUAL or SINGLE | A + B | 172 - | 00001 | SUB | dyadic | DUAL or SINGLE | A - B | 173 - | 00010 | INC | monadic | DUAL or SINGLE | A + 1 | 174 - | 00011 | DEC | monadic | DUAL or SINGLE | A - 1 | 175 - | 00100 | AND | dyadic | DUAL or SINGLE | A & B | 176 - | 00101 | OR | dyadic | DUAL or SINGLE | A \| B | 177 - | 00110 | XOR | dyadic | DUAL or SINGLE | A ^ B | 178 - | 00111 | NOT | monadic | DUAL or SINGLE | ~A | 179 - | 01000 | SHL | monadic | DUAL or SINGLE | A << N (imm) | 180 - | 01001 | SHR | monadic | DUAL or SINGLE | A >> N (imm, logical) | 181 - | 01010 | ASR | monadic | DUAL or SINGLE | A >> N (imm, arithmetic) | 182 - | 01011 | EQ | dyadic | DUAL or SINGLE | A == B → bool | 183 - | 01100 | LT | dyadic | DUAL or SINGLE | A < B signed → bool | 184 - | 01101 | GT | dyadic | DUAL or SINGLE | A > B signed → bool | 185 - | 01110 | ULT | dyadic | DUAL or SINGLE | A < B unsigned → bool | 186 - | 01111 | UGT | dyadic | DUAL or SINGLE | A > B unsigned → bool | 187 - | 10000 | SWITCH | dyadic | SWITCH | route data by bool | 188 - | 10001 | GATE | dyadic | GATE | pass or suppress by bool | 189 - | 10010 | PASS | monadic | DUAL or SINGLE | identity | 190 - | 10011 | CONST | monadic | DUAL or SINGLE | output = immediate | 191 - | 10100 | FREE_CTX | monadic | SUPPRESS | deallocate slot | 192 - | 10101-11111 | — | — | — | reserved for expansion | 169 + | Opcode | Mnemonic | Arity | Output Mode | Description | 170 + | ----------- | -------- | ------- | -------------- | ------------------------ | 171 + | 00000 | ADD | dyadic | DUAL or SINGLE | A + B | 172 + | 00001 | SUB | dyadic | DUAL or SINGLE | A - B | 173 + | 00000 | INC | monadic | DUAL or SINGLE | A + 1 (imm const) | 174 + | 00001 | DEC | monadic | DUAL or SINGLE | A - 1 (imm const) | 175 + | 00100 | AND | dyadic | DUAL or SINGLE | A & B | 176 + | 00101 | OR | dyadic | DUAL or SINGLE | A \| B | 177 + | 00110 | XOR | dyadic | DUAL or SINGLE | A ^ B | 178 + | 00111 | NOT | monadic | DUAL or SINGLE | ~A | 179 + | 01000 | SHL | monadic | DUAL or SINGLE | A << N (imm) | 180 + | 01001 | SHR | monadic | DUAL or SINGLE | A >> N (imm, logical) | 181 + | 01010 | ASR | monadic | DUAL or SINGLE | A >> N (imm, arithmetic) | 182 + | 01011 | EQ | dyadic | DUAL or SINGLE | A == B → bool | 183 + | 01100 | LT | dyadic | DUAL or SINGLE | A < B signed → bool | 184 + | 01101 | GT | dyadic | DUAL or SINGLE | A > B signed → bool | 185 + | 01110 | ULT | dyadic | DUAL or SINGLE | A < B unsigned → bool | 186 + | 01111 | UGT | dyadic | DUAL or SINGLE | A > B unsigned → bool | 187 + | 10000 | SWITCH | dyadic | SWITCH | route data by bool | 188 + | 10001 | GATE | dyadic | GATE | pass or suppress by bool | 189 + | 10010 | PASS | monadic | DUAL or SINGLE | identity | 190 + | 10011 | CONST | monadic | DUAL or SINGLE | output = immediate | 191 + | 10100 | FREE_CTX | monadic | SUPPRESS | deallocate slot | 192 + | 10101-11111 | — | — | — | reserved for expansion | 193 193 194 194 The output mode column indicates the default. DUAL vs SINGLE is 195 195 controlled by a flag in the IRAM instruction word (has_dest2), not by
+3 -3
design-notes/architecture-overview.md
··· 138 138 139 139 ``` 140 140 Dyadic wide (prefix 00, 2 flits): 141 - flit 1: [0][0][PE:2][offset:5][ctx:4][port:1][gen:2] = 16 bits 141 + flit 1: [0][0][port:1][PE:2][gen:2][offset:5][ctx:4] = 16 bits 142 142 flit 2: [data:16] = 16 bits 143 143 144 144 Monadic normal (prefix 010, 2 flits): ··· 150 150 flit 2: [data:8][port:1][gen:2][spare:5] = 16 bits 151 151 152 152 IRAM write (prefix 011+01, 2-3 flits): 153 - flit 1: [0][1][1][PE:2][01][iram_addr:7][flags:2] = 16 bits 153 + flit 1: [0][1][1][PE:2][01][flags:2][iram_addr:7] = 16 bits 154 154 flit 2: [instruction_word_low:16] = 16 bits 155 155 (flit 3: [instruction_word_high:8][spare:8] if needed) 156 156 157 157 Monadic inline (prefix 011+10, 1 flit): 158 - flit 1: [0][1][1][PE:2][10][offset:4][ctx:4][spare:1] = 16 bits 158 + flit 1: [0][1][1][PE:2][10][spare:1][offset:4][ctx:4] = 16 bits 159 159 160 160 SM standard (prefix 1, 2 flits): 161 161 flit 1: [1][SM_id:2][op:3-5][addr:8-10] = 16 bits
+69 -229
design-notes/assembler-architecture.md
··· 1 - # Dynamic Dataflow CPU — Assembler Architecture 1 + # OR-1 CPU - dfasm Assembler Architecture 2 2 3 - Covers the `asm/` package: pipeline structure, IR design, pass 4 - architecture, code generation modes, and key implementation decisions. 3 + Covers the `asm/` package: pipeline structure, IR design, pass architecture, code generation modes, and key implementation decisions. 5 4 6 5 See `architecture-overview.md` for the target hardware model. 7 6 See `dfasm-primer.md` for the language itself. 8 - See `asm/CLAUDE.md` for the contract-level summary. 9 7 10 8 ## Role 11 9 12 - The assembler translates dfasm source into emulator-ready configuration 13 - objects (`PEConfig`, `SMConfig`, seed tokens) or a hardware-faithful 14 - bootstrap token sequence. It bridges the gap between human-authored 15 - dataflow graph programs and the structures the emulator (and eventually 16 - hardware) consumes. 10 + The assembler translates dfasm source into emulator-ready configuration objects (`PEConfig`, `SMConfig`, seed tokens) or a hardware-faithful bootstrap token sequence. It bridges the gap between human-authored dataflow graph programs and the structures the emulator (and eventually hardware) consumes. 17 11 18 - The assembler does NOT optimise. It does not reorder instructions for 19 - performance, fuse operations, or eliminate redundant subgraphs. It is a 20 - faithful translator: the graph you write is the graph you get. Future 21 - optimisation passes may be inserted between resolve and place, but the 22 - current pipeline is intentionally thin — correctness first, cleverness 23 - later. 12 + The assembler does NOT (currently) optimize. It does not reorder instructions for performance, fuse operations, or eliminate redundant sub-graphs. It is a faithful translator: the graph you write is the graph you get. Future optimization passes may be inserted between resolve and place, but the current pipeline is intentionally thin. Correctness first, cleverness later. 24 13 25 14 ## Pipeline Overview 26 15 27 - Six stages, each a pure function from `IRGraph → IRGraph` (or 28 - `IRGraph → output`). The pipeline is: 16 + Six stages, each a pure function from `IRGraph → IRGraph` (or `IRGraph → output`). 17 + 18 + The pipeline is: 29 19 30 20 ``` 31 21 dfasm source ··· 50 40 or SM init → ROUTE_SET → LOAD_INST → seeds (token mode) 51 41 ``` 52 42 53 - Each pass returns a new `IRGraph`. Graphs are never mutated after 54 - construction — each pass produces a fresh copy with the new information 55 - filled in. Errors accumulate in `IRGraph.errors` rather than failing 56 - fast, so the assembler reports all problems in a single pass rather 57 - than forcing the programmer to fix them one at a time. 43 + Each pass returns a new `IRGraph`. Graphs are never mutated after construction, each pass produces a fresh copy with the new information filled in. Errors accumulate in `IRGraph.errors` rather than failing fast, so the assembler reports all problems in a single pass rather than forcing the programmer to fix them one at a time. 58 44 59 - The public API orchestrates the pipeline and raises `ValueError` if any 60 - stage produces errors: 45 + The public API orchestrates the pipeline and raises `ValueError` if any stage produces errors: 61 46 62 47 ```python 63 - assemble(source: str) -> AssemblyResult # direct mode 48 + assemble(source: str) -> AssemblyResult # direct mode 64 49 assemble_to_tokens(source: str) -> list # token stream mode 65 50 round_trip(source: str) -> str # parse → lower → serialize 66 51 serialize_graph(graph: IRGraph) -> str # IRGraph → dfasm at any stage 67 52 ``` 68 53 69 54 ## IR Types (`ir.py`) 70 - 71 - All IR types are frozen dataclasses, following the conventions of 72 - `tokens.py` and `cm_inst.py`. 73 55 74 56 ### Core Types 75 57 ··· 86 68 87 69 Destinations evolve through the pipeline: 88 70 89 - 1. **After lower**: `NameRef(name, port)` — symbolic, unresolved 90 - 2. **After allocate**: `ResolvedDest(name, addr)` — concrete 91 - `Addr(a, port, pe)` with IRAM offset, port, and target PE 92 - 93 - This two-stage resolution means early passes can work with symbolic 94 - names while later passes have concrete hardware addresses. 71 + 1. **After lower**: `NameRef(name, port)` - symbolic, unresolved 72 + 2. **After allocate**: `ResolvedDest(name, addr)` - concrete `Addr(a, port, pe)` with IRAM offset, port, and target PE 95 73 74 + This two-stage resolution means early passes can work with symbolic names while later passes have concrete hardware addresses, while maintaining the names for debug clarity. 96 75 ### Graph Traversal Utilities 97 76 98 - `IRGraph` provides recursive traversal functions for working with nested 99 - regions: 77 + `IRGraph` provides recursive traversal functions for working with nested regions: 100 78 101 - - `iter_all_subgraphs()` — depth-first traversal of graph + all region 102 - bodies 103 - - `collect_all_nodes()` — flatten nodes from graph and all nested regions 104 - - `collect_all_nodes_and_edges()` — flatten both 105 - - `collect_all_data_defs()` — flatten data definitions 106 - - `update_graph_nodes()` — recursively update nodes while preserving 107 - region structure 108 - 109 - These are necessary because function definitions create nested 110 - `IRRegion` objects with their own `IRGraph` bodies. 79 + - `iter_all_subgraphs()`: depth-first traversal of graph + all region bodies 80 + - `collect_all_nodes()`: flatten nodes from graph and all nested regions 81 + - `collect_all_nodes_and_edges()`: flatten both 82 + - `collect_all_data_defs()`: flatten data definitions 83 + - `update_graph_nodes()`: recursively update nodes while preserving region structure 111 84 112 85 ## Pass Details 113 86 114 87 ### Parse 115 88 116 - Uses the Lark library with the Earley parser algorithm. The grammar is 117 - in `dfasm.lark`. Earley is required (not LALR) because the grammar has 118 - ambiguities between `location_dir` (a bare qualified reference) and 119 - `weak_edge` (outputs before opcode) that require context-sensitive 120 - resolution. 89 + Uses the Lark library with the Earley parser algorithm. The formal grammar is defined in `dfasm.lark`. Earley is required because the grammar has ambiguities between `location_dir` (a bare qualified reference) and `weak_edge` (outputs before opcode) that require context-sensitive resolution. 121 90 122 - The parser produces a concrete syntax tree (CST) — Lark `Tree` objects 123 - with `Token` terminals. No semantic processing happens here. 124 - 91 + The parser produces a concrete syntax tree (CST), a Lark `Tree` objects with `Token` terminals. 125 92 ### Lower (`lower.py`) 126 93 127 - A Lark `Transformer` that walks the CST bottom-up, converting each 128 - grammar rule into IR types. 94 + A Lark `Transformer` that walks the CST bottom-up, converting each grammar rule into IR types. 129 95 130 96 **Key transformations:** 131 97 132 - - `inst_def` → `IRNode` with opcode, const, PE placement, named args 133 - - `plain_edge` → `IREdge` with source, dest, port qualifiers 134 - - `strong_edge` / `weak_edge` → anonymous `IRNode` + input/output 98 + - `inst_def` -> `IRNode` with opcode, const, PE placement, named args 99 + - `plain_edge` -> `IREdge` with source, dest, port qualifiers 100 + - `strong_edge` / `weak_edge` -> anonymous `IRNode` + input/output 135 101 `IREdge` set (creates a `CompositeResult`) 136 102 - `func_def` → `IRRegion(kind=FUNCTION)` with a nested `IRGraph` body 137 - - `location_dir` → `IRRegion(kind=LOCATION)` — subsequent statements 103 + - `location_dir` -> `IRRegion(kind=LOCATION)`: subsequent statements 138 104 are collected into its body during post-processing 139 - - `data_def` → `IRDataDef` with SM placement and cell address 105 + - `data_def` -> `IRDataDef` with SM placement and cell address 140 106 - `system_pragma` → `SystemConfig` (stored on the transformer, attached 141 107 to the final `IRGraph`) 142 108 143 109 **Name qualification:** 144 110 145 - Labels (`&name`) inside function regions are qualified with the function 146 - scope: `&add` inside `$main` becomes `$main.&add`. This scoping is 147 - transparent to the programmer — dfasm source uses bare `&label` names 148 - within functions, and the assembler qualifies them internally. 149 - 150 - Node references (`@name`) and function references (`$name`) are 151 - top-level and are not qualified. 111 + Labels (`&name`) inside function regions are qualified with the function scope: `&add` inside `$main` becomes `$main.&add`. 152 112 153 113 **Opcode mapping (`opcodes.py`):** 154 114 155 - Mnemonic strings from the grammar are mapped to `ALUOp`, `MemOp`, or 156 - `CfgOp` enum values via `MNEMONIC_TO_OP`. A complication: Python 157 - `IntEnum` subclasses can share numeric values across types 158 - (`ArithOp.ADD == 0 == MemOp.READ`), so the reverse mapping and set 159 - membership tests use type-aware collections 160 - (`TypeAwareOpToMnemonicDict`, `TypeAwareMonadicOpsSet`) that key on 161 - `(type, value)` tuples internally. 115 + Mnemonic strings from the grammar are mapped to `ALUOp`, `MemOp`, or `CfgOp` enum values via `MNEMONIC_TO_OP`. A complication: Python `IntEnum` sub-classes can share numeric values across types (`ArithOp.ADD == 0 == MemOp.READ`), so the reverse mapping and set membership tests use type-aware collections (`TypeAwareOpToMnemonicDict`, `TypeAwareMonadicOpsSet`) that key on `(type, value)` tuples internally. 162 116 163 117 ### Resolve (`resolve.py`) 164 118 ··· 166 120 167 121 **Process:** 168 122 169 - 1. Flatten all nodes from the graph and all nested regions into a single 170 - namespace 171 - 2. Build a scope map: qualified name → defining scope 123 + 1. Flatten all nodes from the graph and all nested regions into a single namespace 124 + 2. Build a scope map: qualified name -> defining scope 172 125 3. For each edge, check that both source and dest exist 173 - 4. Detect cross-function label references (a `&label` in `$main` cannot 174 - reference a `&label` in `$other`) and flag as SCOPE errors 175 - 5. For undefined references, compute Levenshtein distance against all 176 - known names and suggest the closest match ("did you mean `&addr`?") 126 + 4. Detect cross-function label references (a `&label` in `$main` cannot reference a `&label` in `$other`) and flag as SCOPE errors 127 + 5. For undefined references, compute Levenshtein distance against all known names and suggest the closest match ("did you mean `&addr`?") 177 128 178 - Resolve does not modify nodes — it only appends errors. 129 + Resolve does not modify nodes, it only appends errors. 179 130 180 131 ### Place (`place.py`) 181 132 182 133 Assigns PE IDs to nodes that don't have explicit placement. 183 134 184 - **Explicit placements** (from `|pe0` qualifiers in source) are validated 185 - first: reject any `pe >= pe_count`. 135 + **Explicit placements** (from `|pe0` qualifiers in source) are validated first. 136 + Reject any `pe >= pe_count`. 186 137 187 - **Auto-placement algorithm** for unplaced nodes (greedy, insertion 188 - order): 138 + **Auto-placement algorithm** for unplaced nodes (greedy, insertion order): 189 139 190 140 1. For each unplaced node, find its connected neighbours via edges 191 141 2. Count PE occurrences among placed neighbours (locality heuristic) 192 - 3. Sort candidate PEs by: most neighbours (descending), then most 193 - remaining IRAM capacity (tie-break) 142 + 3. Sort candidate PEs by: most neighbours (descending), 143 + then most remaining IRAM capacity (tie-break) 194 144 4. Place on the first PE with room for both IRAM and context slots 195 - 5. If no PE fits, record a placement error with per-PE utilisation 196 - breakdown 145 + 5. If no PE fits, record a placement error with per-PE utilization breakdown 197 146 198 147 **IRAM cost model:** 199 148 ··· 202 151 203 152 **System config inference:** 204 153 205 - If no `@system` pragma is provided, the placer infers `pe_count` from 206 - the highest explicit PE ID and uses defaults for IRAM capacity (64) and 207 - context slots (4). 154 + If no `@system` pragma is provided, the placer infers `pe_count` from the highest explicit PE ID and uses defaults for IRAM capacity (64) and context slots (4). 208 155 209 156 ### Allocate (`allocate.py`) 210 157 211 - Assigns three things per node: IRAM offset, context slot, and resolved 212 - destinations. 158 + Assigns three things per node: IRAM offset, context slot, and resolved destinations. 213 159 214 160 **IRAM layout (per PE):** 215 161 216 - Dyadic instructions are packed at low offsets (0..D-1), monadic at 217 - higher offsets (D..D+M-1). This matches the hardware contract in 218 - `pe-design.md`: the token's offset field doubles as the matching store 219 - entry for dyadic instructions, so they must occupy the dense low range. 162 + Dyadic instructions are packed at low offsets (0..D-1), monadic at higher offsets (D..D+M-1). This matches the hardware contract in `pe-design.md`: the token's offset field doubles as the matching store entry for dyadic instructions, so they must occupy the dense low range. 220 163 221 164 **Context slot assignment (per PE):** 222 165 223 - Each function scope gets a distinct context slot. The root scope 224 - (top-level) always gets slot 0. Additional scopes get slots 1, 2, ... 225 - in order of first appearance. Overflow beyond `ctx_slots` is a RESOURCE 226 - error. 166 + Each function scope gets a distinct context slot. The root scope (top-level) always gets slot 0. Additional scopes get slots 1, 2, ... in order of first appearance. Overflow beyond `ctx_slots` is a RESOURCE error. 227 167 228 168 **Destination resolution:** 229 169 230 - For each node, outgoing edges are resolved to `ResolvedDest` objects 231 - containing concrete `Addr(a=iram_offset, port=Port.L|R, pe=target_pe)`. 232 - The allocator handles: 170 + For each node, outgoing edges are resolved to `ResolvedDest` objects containing concrete `Addr(a=iram_offset, port=Port.L|R, pe=target_pe)`. The allocator handles: 233 171 234 - - Single outgoing edge → `dest_l` 235 - - Two outgoing edges → `dest_l` + `dest_r` (distinguished by 236 - `source_port` qualifier or positional order) 237 - - Port conflicts (duplicate ports, mixed explicit/implicit) → PORT error 172 + - Single outgoing edge -> `dest_l` 173 + - Two outgoing edges -> `dest_l` + `dest_r` (distinguished by `source_port` qualifier or positional order) 174 + - Port conflicts (duplicate ports, mixed explicit/implicit) -> PORT error 238 175 239 176 **SM ID assignment:** 240 177 241 - For `MemOp` nodes, the allocator assigns the target SM ID. Single-SM 242 - systems default to `sm_id=0`. Multi-SM systems with ambiguous targets 243 - produce a RESOURCE error. 178 + For `MemOp` nodes, the allocator assigns the target SM ID. Single-SM systems default to `sm_id=0`. Multi-SM systems with ambiguous targets produce a RESOURCE error. 244 179 245 180 ### Codegen (`codegen.py`) 246 181 ··· 250 185 251 186 Produces immediately usable emulator configuration: 252 187 253 - - `pe_configs`: list of `PEConfig` with populated IRAM (ALUInst/SMInst 254 - by offset), context slot count, and route restrictions 255 - - `sm_configs`: list of `SMConfig` with initial cell values from data 256 - definitions 257 - - `seed_tokens`: `MonadToken` for each `CONST` node with no incoming 258 - edges (these kick off execution) 188 + - `pe_configs`: list of `PEConfig` with populated IRAM (ALUInst/SMInst by offset), context slot count, and route restrictions 189 + - `sm_configs`: list of `SMConfig` with initial cell values from data definitions 190 + - `seed_tokens`: `CMToken` for each strongly-connected `CONST` node with no incoming edges 259 191 260 - Route restrictions are computed by scanning all edges from each PE to 261 - determine which other PEs and SMs it can reach. Self-routes are always 262 - included. 192 + Route restrictions are computed by scanning all edges from each PE to determine which other PEs and SMs it can reach. Self-routes are always included. 263 193 264 194 **Token stream mode** (`generate_tokens() → list`): 265 195 266 196 Produces a hardware-faithful bootstrap sequence: 267 197 268 - 1. **SM init tokens** — `SMToken(op=WRITE)` for each data definition 269 - 2. **ROUTE_SET tokens** — `RouteSetToken` per PE with route restrictions 270 - 3. **LOAD_INST tokens** — `LoadInstToken` per PE with full IRAM contents 271 - 4. **Seed tokens** — same `MonadToken` list as direct mode 198 + 1. **SM init tokens**: `SMToken(op=WRITE)` for each data definition 199 + 2. **IRAMWrite tokens**: `IRAMWrite` tokens per PE with instructions 200 + 3. **Seed tokens**: same `CMToken` list as direct mode 272 201 273 - This ordering mirrors what the hardware bootstrap would do: initialise 274 - structure memory, configure routing, load instruction memory, then 275 - inject the initial tokens that start execution. 202 + This ordering mirrors what the hardware bootstrap would do: initialize structure memory, load instruction memory, then inject the initial tokens that start execution. 276 203 277 - Both modes reuse the same internal logic — token stream mode calls 278 - `generate_direct()` internally and then reformats the result. 204 + Both modes reuse the same internal logic. Token stream mode calls `generate_direct()` internally and then re-formats the result. 279 205 280 206 ## Error Handling (`errors.py`) 281 207 282 - Errors are structured with category, source location, message, and 283 - optional suggestions. Categories: 208 + Errors are structured with category, source location, message, and optional suggestions. 284 209 285 210 | Category | Stage | Examples | 286 211 |----------|-------|---------| ··· 305 230 = help: First defined at line 2 306 231 ``` 307 232 308 - ## Key Design Decisions 309 - 310 - ### Immutable Pass Pattern 311 - 312 - Each pass returns a new `IRGraph`. This simplifies debugging (you can 313 - inspect intermediate representations), enables future caching, and 314 - follows functional programming discipline. The tradeoff is allocation 315 - overhead from copying, but for assembler-scale programs this is 316 - negligible. 317 - 318 - ### Dyadic-First IRAM Layout 319 - 320 - Dyadic instructions are packed at low IRAM offsets so the token's offset 321 - field doubles as the matching store entry index. This is a hardware 322 - constraint from `pe-design.md` (option b) — no extra bits or lookup 323 - tables needed. The compiler must cooperate, and it does. 324 - 325 - ### Greedy Placement 326 - 327 - The placer is intentionally simple: greedy bin-packing with a locality 328 - heuristic. No simulated annealing, no ILP solver, no graph partitioning 329 - library. For the target scale (4 PEs, tens of instructions), greedy is 330 - adequate. The locality heuristic (prefer the PE where connected 331 - neighbours already live) naturally minimises cross-PE token traffic. 332 - 333 - More sophisticated placement is a future concern — the pass interface 334 - (`IRGraph → IRGraph`) means the placer is swappable without touching 335 - anything else. 336 - 337 - ### Error Accumulation 338 - 339 - All phases append errors to `IRGraph.errors` rather than raising 340 - immediately. This means the programmer sees all problems at once, not 341 - a frustrating one-at-a-time reveal. The pipeline orchestrator 342 - (`__init__.py`) raises `ValueError` after each stage if errors are 343 - present. 344 - 345 - ### Type-Aware Opcode Collections 346 - 347 - Python `IntEnum` values collide across subclasses (`ArithOp.ADD = 0 = 348 - MemOp.READ`). Plain dicts and sets lose type information. The 349 - `TypeAwareOpToMnemonicDict` and `TypeAwareMonadicOpsSet` in 350 - `opcodes.py` key on `(type(op), op.value)` tuples to avoid this. 351 - 352 - ## Module Dependency Graph 353 - 354 - ``` 355 - dfasm.lark (grammar) 356 - 357 - 358 - lower.py ──→ ir.py (types) ──→ opcodes.py (mnemonic mapping) 359 - │ │ 360 - ▼ ▼ 361 - resolve.py errors.py (error types) 362 - 363 - 364 - place.py 365 - 366 - 367 - allocate.py 368 - 369 - 370 - codegen.py ──→ cm_inst (ALUInst, SMInst, Addr) 371 - │ ──→ tokens (MonadToken, SMToken, CfgToken variants) 372 - │ ──→ emu/types (PEConfig, SMConfig) 373 - │ ──→ sm_mod (Presence) 374 - 375 - __init__.py (pipeline orchestration, public API) 376 - ``` 377 - 378 - **Boundary rule**: `emu/` and root-level modules never import from 379 - `asm/`. The assembler depends on the emulator's types, not the other 380 - way around. 381 - 382 233 ## Serialization and Round-Tripping (`serialize.py`) 383 234 384 - The serializer emits valid dfasm source from an `IRGraph` at any 385 - pipeline stage. This enables: 235 + The serializer emits valid dfasm source from an `IRGraph` at any pipeline stage. This enables: 386 236 387 - - **Round-trip testing**: `source → parse → lower → serialize → source'` 388 - verifies that the parser and lowering pass preserve the program 389 - - **IR inspection**: dump the graph after any pass to see what the 390 - assembler is doing 391 - - **Code generation from IR**: future tools could construct `IRGraph` 392 - objects programmatically and serialize them to dfasm 237 + - **Round-trip testing**: `source → parse → lower → serialize → source'` verifies that the parser and lowering pass preserve the program 238 + - **IR inspection**: dump the graph after any pass to see what the assembler is doing 239 + - **Code generation from IR**: future tools could construct `IRGraph` objects programmatically and serialize them to dfasm 393 240 394 - The serializer unqualifies names inside function regions (strips the 395 - `$func.` prefix), preserves port and placement qualifiers, and formats 396 - values as hex (if > 255) or decimal. 241 + The serializer unqualifies names inside function regions (strips the `$func.` prefix), preserves port and placement qualifiers, and formats values as hex (if > 255) or decimal. 397 242 398 243 ## Future Work 399 244 400 - - **Optimisation passes** between resolve and place: dead node 401 - elimination, constant folding, subgraph deduplication 402 - - **Macro expansion**: the grammar already supports `#macro` syntax in 403 - data definitions; the expansion pass is not yet implemented 404 - - **Wider placement heuristics**: graph partitioning, min-cut 405 - algorithms, or profile-guided placement for larger programs 406 - - **Incremental reassembly**: modify part of the graph and re-run only 407 - affected passes 408 - - **Hardware encoding pass**: translate ALUInst/SMInst to bit-level 409 - instruction words for actual IRAM loading 245 + - **Optimization passes** between resolve and place: dead node elimination, constant folding, sub-graph deduplication 246 + - **Macro expansion**: the grammar already supports `#macro` syntax; the expansion pass is not yet implemented 247 + - **Wider placement heuristics**: graph partitioning, min-cut algorithms, or profile-guided placement for larger programs 248 + - **Incremental reassembly**: modify part of the graph and re-run only affected passes 249 + - **Hardware encoding pass**: translate ALUInst/SMInst to bit-level instruction words for actual IRAM loading
+123 -175
design-notes/dfasm-primer.md
··· 1 - # dfasm — Dataflow Graph Assembly Language 1 + # dfasm - Dataflow Graph Assembly Language 2 2 3 - A primer on the dfasm dialect: syntax, semantics, naming conventions, 4 - and the mapping from source to executable configuration. 3 + A primer on the assembly dialect used in the OR-1 5 4 6 5 See `assembler-architecture.md` for the assembler's internal pipeline. 7 6 See `architecture-overview.md` for the hardware model dfasm targets. 8 7 9 8 ## What dfasm Is 10 9 11 - dfasm is a textual representation of dataflow graphs. Each instruction 12 - is a **node** with zero, one, or two inputs and up to two outputs. 13 - Connections between nodes are **edges**. Execution is data-driven: 14 - a node fires when all its required operands have arrived as tokens on 15 - the network. 10 + dfasm is a representation of low-level dataflow program graphs in text form. Each instruction forms a **node** with zero, one, or two inputs, and up to two outputs. Connections between nodes are conceived of as graph edges. Execution is entirely data-driven. A node fires when all its required operands have arrived as tokens. 16 11 17 - dfasm is not sequential assembly. There is no program counter, no 18 - implicit instruction ordering. The order of statements in the source 19 - file has no effect on execution — only the graph topology matters. 20 - 12 + It is *not* conventional assembly with strong implicit sequential behaviour. There is no program counter, and jumps and execution order are driven primarily by the graph topology. Writing things in a sensible order is up to you, and is for your own benefit. 21 13 ## Syntax Overview 22 14 23 15 ### Comments 24 16 25 - Semicolons start line comments (traditional assembler convention): 17 + Semicolons start line comments, as is fairly common for assembly. While the parser is sophisticated, I've chosen to do this to set a specific tone. dfasm, while it has a number of seemingly sophisticated features, remains an *assembly language*, tightly coupled to the low-level functions of the hardware. 26 18 27 19 ```dfasm 28 20 ; This is a comment ··· 33 25 34 26 dfasm uses three sigil-prefixed naming conventions: 35 27 36 - | Sigil | Scope | Use | 37 - |-------|-------|-----| 38 - | `@name` | Global (top-level) | Node references, data definitions | 39 - | `&name` | Local (within enclosing function) | Labels for instructions | 40 - | `$name` | Global | Function / subgraph definitions | 28 + | Sigil | Scope | Use | 29 + | ------- | --------------------------------- | --------------------------------- | 30 + | `@name` | Global (top-level) | Node references, data definitions | 31 + | `&name` | Local (within enclosing function) | Labels for instructions | 32 + | `$name` | Global | Function / subgraph definitions | 41 33 42 - Names are composed of `[a-zA-Z_][a-zA-Z0-9_]*`. Sigils are part of the 43 - reference syntax, not the name itself. 34 + Names are composed of `[a-zA-Z_][a-zA-Z0-9_]*`. 44 35 45 36 ### Qualifier Chains 46 37 47 - Names can be chained with placement and port qualifiers. No spaces 48 - are allowed within a chain: 38 + Names can be chained with placement and port qualifiers. No spaces are allowed within a chain. Placement indicators are mostly optional. The assembler will attempt to resolve and auto-place instructions, currently using a basic greedy locality heuristic. If an instruction cannot be placed, or the assembler places it badly, you can manually assign its placement. 49 39 50 40 ```dfasm 51 41 &sum|pe0:L ; label "sum", placed on PE 0, left port ··· 59 49 | Port | `:L` or `:R` | Left or right input port (for edges) | 60 50 | Cell address | `:N` | SM cell address (for data definitions) | 61 51 62 - Placement is optional — the assembler auto-places unplaced nodes using 63 - a greedy locality heuristic. 52 + ## Statement Types 53 + 54 + ### Pragma 64 55 65 - ## Statement Types 56 + Pragmas are built-in nodes that provide specific information about how the program should be assembled. 66 57 67 - ### System Pragma 58 + ### `@system` 68 59 69 - Declares hardware configuration. Required for programs that need 70 - specific PE/SM counts: 60 + Declares hardware configuration. Required for programs that need specific PE/SM counts: 71 61 72 62 ```dfasm 73 63 @system pe=4, sm=1, iram=128, ctx=4 74 64 ``` 75 65 76 - | Parameter | Required | Default | Meaning | 77 - |-----------|----------|---------|---------| 78 - | `pe` | yes | — | Number of processing elements | 79 - | `sm` | yes | — | Number of structure memory modules | 80 - | `iram` | no | 64 | IRAM capacity per PE (instruction slots) | 81 - | `ctx` | no | 4 | Context slots per PE | 66 + | Parameter | Required | Default | Meaning | 67 + | --------- | -------- | ------- | ---------------------------------------- | 68 + | `pe` | yes | — | Number of processing elements | 69 + | `sm` | yes | — | Number of structure memory modules | 70 + | `iram` | no | 64 | IRAM capacity per PE (instruction slots) | 71 + | `ctx` | no | 4 | Context slots per PE | 82 72 83 73 At most one `@system` pragma per program. 84 74 75 + ### `@rom_data` 76 + 77 + Declares that the following section will be placed in ROM with an optional name and base address. If the address is not specified, it will be placed after the contents of the reset vector. It will *not* be emitted as tokens during bootstrapping. 78 + 79 + ```dfasm 80 + @rom_data [name=, addr=...] 81 + ``` 82 + 83 + Unlike the `@system` pragma, the `@rom_data` pragma can be used more than once. Functions placed in a `@rom_data` section can be loaded via `exec`, as can a named `@rom_data` section. A function with its seed tokens in its scope while be called with those tokens. 85 84 ### Instruction Definition 86 85 87 86 Defines a named node with an opcode and optional arguments: ··· 90 89 &label <| opcode [, arg ...] 91 90 ``` 92 91 93 - The `<|` operator reads as "receives from" — the node receives data 94 - from whatever edges point to it. 92 + The `<|` operator reads as "receives from". 93 + The node receives data from whatever edges point to it. 95 94 96 95 **Examples:** 97 96 ··· 117 116 &source |> &dest1:L, &dest2:R ; fan-out to two destinations 118 117 ``` 119 118 120 - The `|>` operator reads as "flows to" — data flows from source to 121 - destination. Port qualifiers on the destination specify which input 122 - the data arrives on. Port qualifiers on the source specify which 123 - output slot it leaves from (relevant for dual-output nodes like 124 - switch operations). 119 + The `|>` operator reads as "flows to". Data flows from source to destination. Port qualifiers on the destination specify which input the data arrives on. Port qualifiers on the source specify which 120 + output slot it leaves from (relevant for dual-output nodes like switch operations). 121 + 122 + > `const` instructions on the left/source side create 'seed' tokens, injected into the machine after loading, at startup, or when their function enters scope. 125 123 126 - **Default port is L (left)** when no port is specified. 124 + **When no port is specified, the default is L** 127 125 128 126 ### Strong Edge (Inline Anonymous Node) 129 127 ··· 137 135 add &a, &b |> &result:L ; anonymous add of &a and &b → &result 138 136 ``` 139 137 140 - This is shorthand. The assembler creates a hidden node (named 141 - `&__anon_N`) and wires the inputs and outputs. Useful for small, 142 - one-off operations that don't need a label. 138 + This is shorthand. The assembler creates a hidden node (named `&__anon_N`) and wires the inputs and outputs. Useful for small, one-off operations that don't need a label. 143 139 144 140 ### Weak Edge (Reverse Inline) 145 141 ··· 153 149 &result:L add <| &a, &b ; same as: add &a, &b |> &result:L 154 150 ``` 155 151 156 - The distinction between strong and weak edges is purely syntactic — 157 - they produce identical IR. 152 + The distinction between strong and weak edges is currently purely syntactic, they produce identical IR. Future iterations of the OR-1 will execute a series of strong edges as a pseudo-sequential block. 158 153 159 154 ### Function Definition 160 155 ··· 172 167 } 173 168 ``` 174 169 175 - Labels (`&name`) inside a function are scoped to that function. You 176 - cannot reference `&sub1` from outside `$fib`. Internally, the assembler 177 - qualifies the name as `$fib.&sub1`. 170 + Labels (`&name`) inside a function are scoped to that function. You cannot reference `&sub1` from outside `$fib`. Internally, the assembler qualifies the name as `$fib.&sub1`. 178 171 179 - Node references (`@name`) are always global — they can be referenced 180 - from anywhere. 172 + > Node references (`@name`) are always global, and can be referenced from anywhere. 181 173 182 174 ### Data Definition 183 175 184 - Initialises a structure memory cell before execution begins: 176 + Initializes a structure memory cell before execution begins: 185 177 186 178 ```dfasm 187 179 @data|sm0:5 = 0x42 ; SM 0, cell 5, value 0x42 ··· 189 181 @msg|sm1:10 = "hello" ; string chars as packed 16-bit words 190 182 ``` 191 183 192 - Data definitions require SM placement (`|smN`) and a cell address 193 - (`:N`). The assembler translates these into SM write tokens during 194 - bootstrap. 184 + Data definitions require SM placement (`|smN`) and a cell address (`:N`). The assembler translates these into SM write tokens during bootstrap, if placed in RAM, or into a text section of the ROM image if placed in ROM. 195 185 196 186 ### Location Directive 197 187 ··· 203 193 &b <| add 204 194 ``` 205 195 206 - Statements following a location directive are collected into that 207 - location's scope until the next function or location directive. 196 + Statements following a location directive are collected into that location's scope until the next function or location directive. 208 197 209 198 ## Opcodes 210 199 211 200 ### Arithmetic (dyadic unless noted) 212 201 213 - | Mnemonic | Arity | Description | 214 - |----------|-------|-------------| 215 - | `add` | dyadic | L + R | 216 - | `sub` | dyadic | L − R | 217 - | `inc` | monadic | data + 1 | 218 - | `dec` | monadic | data − 1 | 219 - | `shiftl` | monadic | shift left by const+1 bits | 220 - | `shiftr` | monadic | logical shift right by const+1 bits | 221 - | `ashiftr` | monadic | arithmetic shift right by const bits | 202 + | Mnemonic | Arity | Description | 203 + | --------- | ------- | -------------------------------- | 204 + | `add` | dyadic | L + R | 205 + | `sub` | dyadic | L − R | 206 + | `inc` | monadic | data + 1 | 207 + | `dec` | monadic | data − 1 | 208 + | `shiftl` | monadic | shift left by 1 bits | 209 + | `shiftr` | monadic | logical shift right by 1 bits | 210 + | `ashiftr` | monadic | arithmetic shift right by 1 bits | 222 211 223 212 ### Logical 224 213 225 - | Mnemonic | Arity | Description | 226 - |----------|-------|-------------| 227 - | `and` | dyadic | bitwise AND | 228 - | `or` | dyadic | bitwise OR | 229 - | `xor` | dyadic | bitwise XOR | 230 - | `not` | monadic | bitwise NOT | 214 + | Mnemonic | Arity | Description | 215 + | -------- | ------- | ----------- | 216 + | `and` | dyadic | bitwise AND | 217 + | `or` | dyadic | bitwise OR | 218 + | `xor` | dyadic | bitwise XOR | 219 + | `not` | monadic | bitwise NOT | 220 + | | | | 231 221 232 222 ### Comparison (dyadic, produce bool_out) 233 223 234 - | Mnemonic | Description | 235 - |----------|-------------| 236 - | `eq` | L == R | 237 - | `lt` | L < R (signed) | 238 - | `lte` | L ≤ R (signed) | 239 - | `gt` | L > R (signed) | 240 - | `gte` | L ≥ R (signed) | 224 + | Mnemonic | Description | 225 + | -------- | -------------- | 226 + | `eq` | L == R | 227 + | `lt` | L < R (signed) | 228 + | `lte` | L ≤ R (signed) | 229 + | `gt` | L > R (signed) | 230 + | `gte` | L ≥ R (signed) | 241 231 242 - Comparison results are signed 2's complement interpretation of 16-bit 243 - values. 232 + Comparison results are signed 2's complement interpretation of 16-bit values. 244 233 245 234 ### Routing / Switching / Branching (dyadic) 246 235 247 - These operations route tokens based on a comparison result. They are 248 - all dyadic — they compare L and R, then route accordingly. 236 + These operations route tokens based on a comparison result. They are all dyadic — they compare L and R, then route accordingly. 249 237 250 - **Branch operations** (`br*`): emit data to `dest_l` (taken) or 251 - `dest_r` (not taken) based on comparison: 238 + **Branch operations** (`br*`): emit data to `dest_l` (taken) or `dest_r` (not taken) based on comparison: 252 239 253 - | Mnemonic | Condition | 254 - |----------|-----------| 255 - | `breq` | L == R | 256 - | `brgt` | L > R | 257 - | `brge` | L ≥ R | 258 - | `brof` | overflow | 259 - | `brty` | type match | 240 + | Mnemonic | Condition | 241 + | -------- | ---------- | 242 + | `breq` | L == R | 243 + | `brgt` | L > R | 244 + | `brge` | L ≥ R | 245 + | `brof` | overflow | 246 + | `brty` | type match | 260 247 261 - NOTE: `br*` ops use predicate register and internal-to-PE loopback route 262 - if that hardware is implemented. 248 + > NOTE: 249 + >`br*` ops use predicate register and internal-to-PE loopback route if supported by hardware. 263 250 264 - **Switch operations** (`sw*`): like branch, but when the condition is 265 - true, data goes to `dest_l` and a trigger token (value 0) goes to 266 - `dest_r`. When false, trigger goes to `dest_l` and data goes to 267 - `dest_r`: 251 + **Switch operations** (`sw*`): like branch, but when the condition is true, data goes to `dest_l` and a trigger token (value 0) goes to `dest_r`. 252 + When false, trigger goes to `dest_l` and data goes to `dest_r`: 268 253 269 254 | Mnemonic | Condition | 270 255 |----------|-----------| ··· 276 261 277 262 **Other routing:** 278 263 279 - | Mnemonic | Arity | Description | 280 - |----------|-------|-------------| 281 - | `gate` | dyadic | pass data through if bool_out is true, suppress if false | 282 - | `sel` | dyadic | select between inputs | 283 - | `merge` | dyadic | merge two inputs | 264 + | Mnemonic | Arity | Description | 265 + | -------- | ------ | -------------------------------------------------------- | 266 + | `gate` | dyadic | pass data through if bool_out is true, suppress if false | 267 + | `sel` | dyadic | select between inputs | 268 + | `merge` | dyadic | merge two inputs | 284 269 285 - ### Data (monadic) 270 + ### Data 286 271 287 - | Mnemonic | Description | 288 - |----------|-------------| 289 - | `pass` | pass data through unchanged | 290 - | `const` | emit constant value (from const field) | 291 - | `free_ctx` | deallocate context slot, no data output | 272 + | Mnemonic | Arity | Description | 273 + | ---------- | ------- | --------------------------------------- | 274 + | `pass` | monadic | pass data through unchanged | 275 + | `const` | monadic | emit constant value (from const field) | 276 + | `free_ctx` | monadic | deallocate context slot, no data output | 277 + | `call` | dyadic | | 292 278 293 279 - `free_ctx` in particular is a special token used to handle function body and loop exits. 294 280 295 281 ### Structure Memory 296 282 297 - | Mnemonic | Arity | Description | 298 - |----------|-------|-------------| 299 - | `read` | monadic | read from SM cell (const = cell address) | 300 - | `write` | context-dependent | write to SM cell — monadic if const is set (cell addr from const), dyadic if const is None (cell addr from L operand) | 301 - | `clear` | monadic | clear SM cell | 302 - | `alloc` | monadic | allocate SM cell | 303 - | `free` | monadic | free SM cell | 304 - | `rd_inc` | monadic | atomic read-and-increment | 305 - | `rd_dec` | monadic | atomic read-and-decrement | 306 - | `cmp_sw` | monadic | compare-and-swap | 307 - 308 - Note: `free_ctx` (ALU context deallocation) and `free` (SM cell free) 309 - are disambiguated by mnemonic — `free_ctx` maps to `RoutingOp.FREE_CTX` 310 - while `free` maps to `MemOp.FREE`. 311 - 283 + | Mnemonic | Arity | Description | 284 + | -------- | ----------------- | --------------------------------------------------------------------------------------------------------------------- | 285 + | `read` | monadic | read from SM cell (const = cell address) | 286 + | `write` | context-dependent | write to SM cell — monadic if const is set (cell addr from const), dyadic if const is None (cell addr from L operand) | 287 + | `clear` | monadic | clear SM cell | 288 + | `alloc` | monadic | allocate SM cell | 289 + | `free` | monadic | free SM cell | 290 + | `rd_inc` | monadic | atomic read-and-increment | 291 + | `rd_dec` | monadic | atomic read-and-decrement | 292 + | `cmp_sw` | monadic | compare-and-swap | 312 293 ### Configuration / System 313 294 314 295 | Mnemonic | Description | ··· 319 300 | `iow` | I/O write | 320 301 | `iorw` | I/O read-write | 321 302 322 - These are rarely written by hand — `load_inst` and `route_set` are 323 - generated by the assembler's token stream mode during bootstrap. 303 + These are rarely written by hand — `load_inst` and `route_set` are generated by the assembler's token stream mode during bootstrap. 324 304 325 305 ## Literals 326 306 ··· 336 316 **Escape sequences** (in regular strings and char literals): 337 317 `\n`, `\t`, `\r`, `\0`, `\\`, `\'`, `\"`, `\xHH` 338 318 339 - **Multi-char packing:** when multiple char values appear in a data 340 - definition, they are packed big-endian into 16-bit words: 319 + **Multi-char packing:** when multiple char values appear in a data definition, they are packed big-endian into 16-bit words: 341 320 342 321 ```vhdl 343 322 @data|sm0:0 = 'h', 'i' ; → 0x6869 (h=0x68 in high byte, i=0x69 in low) ··· 347 326 348 327 ## Complete Example 349 328 350 - A simple program that adds two constants and routes the result 351 - across PEs: 329 + A simple program that adds two constants and routes the result across PEs: 352 330 353 331 ```vhdl 354 332 ; Hardware: 2 PEs, no structure memory ··· 368 346 369 347 **What happens at runtime:** 370 348 371 - 1. The assembler emits two seed tokens (for `&c1` and `&c2`) since 372 - they are `CONST` nodes with no incoming edges. 373 - 2. Both tokens arrive at PE 0's matching store. `&result` is a dyadic 374 - instruction — it waits for both operands. 375 - 3. When both arrive, the matching store pairs them. The left operand 376 - (3) and right operand (7) feed the ALU. 377 - 4. The ALU computes `3 + 7 = 10` and emits a token to `&output` on 378 - PE 1. 379 - 5. `&output` is monadic (`pass`) — it bypasses the matching store and 380 - immediately emits the value 10. 349 + 1. The assembler emits two seed tokens (for `&c1` and `&c2`) since they are `CONST` nodes with no incoming edges. 350 + 2. Both tokens arrive at PE 0's matching store. `&result` is a dyadic instruction — it waits for both operands. 351 + 3. When both arrive, the matching store pairs them. The left operand (3) and right operand (7) feed the ALU. 352 + 4. The ALU computes `3 + 7 = 10` and emits a token to `&output` on PE 1. 353 + 5. `&output` is monadic (`pass`) — it bypasses the matching store and immediately emits the value 10. 381 354 382 355 ## Structure Memory Example 383 356 ··· 433 406 &not_taken |> &output:R 434 407 ``` 435 408 436 - Since `val == cmp` (both 5), `sweq` evaluates to true: data (5) goes 437 - to `dest_l` (taken) and a trigger token (0) goes to `dest_r` 438 - (not_taken). 409 + Since `val == cmp` (both 5), `sweq` evaluates to true: data (5) goes to `dest_l` (taken) and a trigger token (0) goes to `dest_r` (not_taken). 439 410 440 411 ## Auto-Placement 441 412 442 - Nodes without explicit `|peN` qualifiers are automatically placed by 443 - the assembler: 413 + Nodes without explicit `|peN` qualifiers are automatically placed by the assembler: 444 414 445 415 ```vhdl 446 416 @system pe=3, sm=0 ··· 455 425 &result |> &output:L 456 426 ``` 457 427 458 - The assembler's greedy placer assigns PEs based on connectivity — nodes 459 - connected by edges prefer to share a PE (minimising cross-PE traffic). 460 - The result is functionally identical to explicit placement. 428 + The assembler's greedy placer assigns PEs based on connectivity. Nodes connected by edges prefer to share a PE (minimizing cross-PE traffic). The result is functionally identical to explicit placement. 461 429 462 430 ## From Source to Execution 463 431 464 432 ### Lowering and Resolution 465 433 466 - After parsing, the assembler lowers the CST to an intermediate 467 - representation (`IRGraph`). Names are qualified, scopes are created, 468 - and edges are validated. The resolve pass checks that every edge 469 - endpoint exists and produces suggestions for typos. 434 + After parsing, the assembler lowers the CST to an intermediate representation (`IRGraph`). Names are qualified, scopes are created, and edges are validated. The resolve pass checks that every edge endpoint exists and produces suggestions for typos. 470 435 471 436 ### Placement and Allocation 472 437 473 - Unplaced nodes get PE assignments. Then the allocator assigns each node 474 - an IRAM offset and context slot. Dyadic instructions are packed at low 475 - IRAM offsets (0..D-1), monadic above (D..D+M-1). This layout matches 476 - the hardware contract: the token's offset field doubles as the matching 477 - store entry for dyadic instructions. 438 + Unplaced nodes get PE assignments. Then the allocator assigns each node an IRAM offset and context slot. Dyadic instructions are packed at low IRAM offsets (0..D-1), monadic above (D..D+M-1). This layout matches the hardware contract: the token's offset field doubles as the matching store entry for dyadic instructions. 478 439 479 - Context slots are assigned per function scope per PE — each function 480 - body sharing a PE gets its own context slot, enabling concurrent 481 - activations to coexist without operand interference. 440 + Context slots are assigned per function scope per PE. Each function body sharing a PE gets its own context slot, enabling concurrent activations to coexist without operand interference. 482 441 483 442 ### Code Generation 484 443 485 - The assembler offers two output modes: 444 + The assembler currently offers two output modes: 486 445 487 - **Direct mode** produces `PEConfig` objects (IRAM contents, route 488 - restrictions, context slot count) and `SMConfig` objects (initial cell 489 - values), plus seed tokens. This is the fast path for the emulator — 490 - configuration is applied directly. 491 - 492 - **Token stream mode** produces a bootstrap sequence: SM initialisation 493 - writes, route configuration tokens, instruction load tokens, then seed 494 - tokens. This mirrors the hardware bootstrap protocol — the same 495 - sequence that the I/O controller would emit over the network to 496 - configure a physical system. 446 + **Direct mode** produces `PEConfig` objects (IRAM contents, route restrictions, context slot count) and `SMConfig` objects (initial cell values), plus seed tokens. This is the fast path for the emulator. Configuration is applied directly. 497 447 498 - Both modes produce identical execution results. The token stream mode 499 - exists because it validates the end-to-end bootstrap path that real 500 - hardware will use. 448 + **Token stream mode** produces a bootstrap sequence: SM initialization writes, route configuration tokens, instruction load tokens, then seed tokens. This mirrors the bootstrap process, loading the code stored at the reset vector.
+658
design-notes/iram-and-function-calls.md
··· 1 + # IRAM Format, SM Operations, and Function Call Design 2 + 3 + Covers the instruction memory word format, SM operation encoding, flit 1 4 + bit layout, and function call/return primitives. 5 + 6 + See `pe-design.md` for overall PE pipeline and matching store. 7 + See `alu-and-output-design.md` for ALU operation set and output formatter. 8 + See `bus-architecture-and-width-decoupling.md` for bus-level token format. 9 + See `sm-design.md` for SM internals and I-structure semantics. 10 + 11 + --- 12 + 13 + ## Design Context 14 + 15 + The IRAM word encodes everything the PE needs to execute an instruction 16 + and form output tokens: ALU operation, operand source, output routing, 17 + context management, and SM bus commands. The format must accommodate both 18 + CM compute instructions and SM memory operations within a fixed-width 19 + word. 20 + 21 + ### Key Constraints 22 + 23 + - **32-bit effective width.** Two 8-bit SRAM chips read in two cycles 24 + ("half 0" and "half 1"). This keeps per-PE SRAM chip count low (2 25 + chips for IRAM data, vs 4 for 32-bit parallel or 6 for 48-bit). 26 + - **Two-cycle read overlaps with ALU.** Half 0 is read in cycle N and 27 + feeds the decoder/ALU immediately. Half 1 is read in cycle N+1 and 28 + latched for Stage 5 (output formatter). The ALU executes during the 29 + half 1 read, so no pipeline bubble is introduced. 30 + - **SM bus flit must be emittable without opcode translation.** The PE 31 + does not interpret SM bus opcodes semantically. For const-addressed 32 + SM ops, IRAM supplies the address; for ptr-addressed ops, token data 33 + supplies the address. In both cases, the SM bus opcode bits on the 34 + wire come from the decoder EEPROM's positional mapping, not from 35 + runtime interpretation of the SM command. 36 + - **const:8 feeds only the ALU.** The 8-bit immediate constant lives 37 + entirely in half 0 and is available to the ALU on the first read 38 + cycle. It is never split across halves. 39 + 40 + ### IRAM Addressing 41 + 42 + ``` 43 + IRAM address = [offset:7][half:1] = 8 bits 44 + half 0: opcode + control + const/params 45 + half 1: destinations (CM) or SM bus data / return routing (SM) 46 + ``` 47 + 48 + 128 instruction slots per PE. Each slot occupies 2 consecutive SRAM 49 + addresses (half 0 at even, half 1 at odd, or equivalently low bit 50 + selects half). Total SRAM usage: 256 bytes per PE. Fits comfortably 51 + in a single 32Kx8 SRAM chip with address space to spare. 52 + 53 + --- 54 + 55 + ## Flit 1 Bit Layout (Bus Token Format) 56 + 57 + All CM token types share a field-aligned layout enabling format-agnostic 58 + hardware operations on ctx and PE fields. 59 + 60 + ``` 61 + DYADIC WIDE: [0][0][port:1][PE:2][gen:2][offset:5][ctx:4] 62 + 15 14 13 12-11 10-9 8-4 3-0 63 + 64 + MONADIC NORM: [0][1][0][PE:2][offset:7][ctx:4] 65 + 15 14 13 12-11 10-4 3-0 66 + 67 + DYADIC NARROW: [0][1][1][PE:2][0][0][offset:5][ctx:4] 68 + 15 14 13 12-11 10 9 8-4 3-0 69 + 70 + MONADIC INLINE: [0][1][1][PE:2][1][0][spare:1][offset:4][ctx:4] 71 + 15 14 13 12-11 10 9 8 7-4 3-0 72 + 73 + IRAM WRITE: [0][1][1][PE:2][0][1][flags:2][iram_addr:7] 74 + 15 14 13 12-11 10 9 8-7 6-0 75 + 76 + SM: [1][SM_id:2][op:3-5][addr:8-10] 77 + 15 14-13 varies 78 + ``` 79 + 80 + ### Field Alignment Invariants 81 + 82 + - **ctx** is always bits [3:0] on all standard CM token types. Hardware 83 + that patches ctx (override, CHANGE_TAG) operates on a fixed 4-wire 84 + position regardless of token format. 85 + - **PE** is always bits [12:11] on all CM token types. Routing checks 86 + and PE_id comparison use the same 2-bit position universally. 87 + - **offset** low 5 bits are [8:4] on dyadic wide, dyadic narrow, and 88 + the lower portion of monadic normal's 7-bit offset [10:4]. Monadic 89 + inline uses [7:4] (4-bit offset) due to tighter encoding. 90 + - **Matching store address** for dyadic tokens = bits [8:0] = 91 + [offset:5][ctx:4]. Nine contiguous bits wired directly to matching 92 + store SRAM address pins. No glue logic. 93 + 94 + ### Misc Bucket Sub-Type Decode 95 + 96 + Tokens with prefix [011] (bits [15:13]) are discriminated by bits [10:9]: 97 + 98 + ``` 99 + [10:9] = 00 → dyadic narrow 100 + [10:9] = 01 → IRAM write 101 + [10:9] = 10 → monadic inline 102 + [10:9] = 11 → reserved / spare 103 + ``` 104 + 105 + ### dest_type Derivation 106 + 107 + The output token format (dyadic wide vs monadic normal vs monadic inline) 108 + is derived from context rather than stored per-destination in IRAM: 109 + 110 + - SWITCH not-taken cycle: always monadic inline (hardwired in formatter) 111 + - offset < 32: dyadic wide (offset bit 5 or higher is clear) 112 + - offset >= 32: monadic normal 113 + 114 + This eliminates 2 bits of per-destination IRAM storage. Dyadic narrow 115 + output is deferred to v1; all dyadic targets receive dyadic wide tokens. 116 + 117 + --- 118 + 119 + ## IRAM Word Format 120 + 121 + ### CM Compute (half 0 bit 15 = 0) 122 + 123 + ``` 124 + ════════════════════════════════════════════════════════════════ 125 + HALF 0 — feeds decoder + ALU on read cycle 1 126 + ════════════════════════════════════════════════════════════════ 127 + 128 + [0][opcode:5][ctx_mode:2][const:8] 129 + 15 14-10 9-8 7-0 130 + ``` 131 + 132 + **opcode:5** — 32 slots. Selects ALU function, arity, output behaviour. 133 + Decoded by EEPROM into control signals. See `alu-and-output-design.md` 134 + for the operation set. 135 + 136 + **ctx_mode:2** — controls output token context source: 137 + 138 + ``` 139 + 00 = INHERIT ctx and gen in output tokens come from pipeline latches 140 + (inherited from the executing token's context). 141 + const:8 is an 8-bit ALU immediate. 142 + 01 = CTX_OVRD ctx and gen in output tokens come from const:8, 143 + reinterpreted as [ctx:4][gen:2][spare:2]. 144 + Used for static cross-context calls. 145 + 10 = CHG_TAG output flit 1 comes entirely from left operand data 146 + (16-bit packed tag value). const:8 ignored. 147 + Used for dynamic function calls and returns. 148 + 11 = RESERVED future use. 149 + ``` 150 + 151 + **const:8** — 8-bit immediate. Interpretation depends on ctx_mode: 152 + 153 + ``` 154 + ctx_mode 00: ALU immediate operand (0-255, or signed -128..+127) 155 + ctx_mode 01: [ctx:4][gen:2][spare:2] — context override for outputs 156 + ctx_mode 10: ignored (routing from data in CHANGE_TAG mode) 157 + ``` 158 + 159 + ``` 160 + ════════════════════════════════════════════════════════════════ 161 + HALF 1 — latched for Stage 5, read overlaps with ALU execution 162 + ════════════════════════════════════════════════════════════════ 163 + 164 + Bit 15 = has_dest2 (single vs dual destination) 165 + 166 + ────────────────────────────────────── 167 + Single destination (has_dest2 = 0): 168 + 169 + [0][dest1_PE:2][dest1_offset:5][dest1_port:1][const_ext:7] 170 + 15 14-13 12-8 7 6-0 171 + 172 + Full 5-bit offset for dest1 (covers dyadic range 0-31). 173 + const_ext:7 extends half 0's const:8 for multi-purpose use: 174 + - CONST16 opcode: const_ext:7 + const:8 = 15-bit immediate. 175 + Sufficient for any CM flit 1 tag (bit 15 is always 0 for CM). 176 + - Future: wider offset (7-bit monadic targets), predicate 177 + store fields, extended flags. 178 + - Interpretation selected by opcode via decoder EEPROM. 179 + 180 + ────────────────────────────────────── 181 + Dual destination (has_dest2 = 1): 182 + 183 + [1][dest1_PE:2][dest1_offset:5][dest1_port:1][dest2_PE:2][dest2_offset:5] 184 + 15 14-13 12-8 7 6-5 4-0 185 + 186 + Both offsets limited to 5 bits (dyadic range 0-31). 187 + dest2_port derived from convention: 188 + DUAL mode: same as dest1_port 189 + SWITCH mode: opposite of dest1_port (or irrelevant for 190 + monadic inline not-taken trigger) 191 + ``` 192 + 193 + ### SM Operation (half 0 bit 15 = 1) 194 + 195 + ``` 196 + ════════════════════════════════════════════════════════════════ 197 + HALF 0 — SM instruction header 198 + ════════════════════════════════════════════════════════════════ 199 + 200 + [1][sm_opcode:5][ctx_mode:2][const_addr_or_ret:8] 201 + 15 14-10 9-8 7-0 202 + ``` 203 + 204 + **sm_opcode:5** — PE-internal SM operation code. The decoder EEPROM maps 205 + this to: 206 + 207 + - SM bus wire opcode bits (3-bit or 5-bit, depending on the operation) 208 + - Arity signal (monadic vs dyadic, for matching store bypass) 209 + - Flit 2 content signal (return routing vs data vs packed operands) 210 + - Address source signal (token data vs IRAM const) 211 + 212 + The PE does not interpret SM bus opcodes semantically. The EEPROM 213 + performs positional mapping only. 214 + 215 + **ctx_mode:2** — same semantics as CM compute. Controls whether return 216 + routing uses inherited ctx/gen (mode 00) or overridden ctx/gen (mode 01). 217 + Mode 10 (CHANGE_TAG) not applicable to SM ops. 218 + 219 + **const_addr_or_ret:8** — dual interpretation based on sm_opcode: 220 + 221 + ``` 222 + Ptr-addressed, result-returning (SM_READ, SM_RMW): 223 + [ret_PE:2][ret_offset:5][ret_port:1] 224 + Return routing assembled with pipeline ctx/gen. 5-bit ret_offset. 225 + 226 + Ptr-addressed, non-returning (SM_WRITE): 227 + Don't care. 228 + 229 + Const-addressed ops (SM_READ_C, SM_WRITE_C, SM_RMW_C, SM_CAS): 230 + [const_addr:8] 231 + Direct 8-bit address into the target SM's address space. 232 + 233 + CMD ops (SM_EXEC, SM_CMD): 234 + [param:8] 235 + Operation-specific parameter (EXEC count, page value, etc.) 236 + ``` 237 + 238 + ``` 239 + ════════════════════════════════════════════════════════════════ 240 + HALF 1 — SM supplementary data (interpretation varies) 241 + ════════════════════════════════════════════════════════════════ 242 + 243 + Ptr-addressed ops (SM_READ, SM_WRITE, SM_RMW): 244 + Don't care. Address comes from token data at runtime. 245 + 246 + Const-addressed, result-returning (SM_READ_C, SM_RMW_C): 247 + [ret_PE:2][ret_offset:7][ret_port:1][SM_id:2][spare:4] 248 + 15-14 13-7 6 5-4 3-0 249 + 250 + Full 7-bit return offset (covers entire monadic range). 251 + SM_id specifies which structure memory to target. 252 + 253 + Const-addressed, non-returning (SM_WRITE_C): 254 + [SM_id:2][spare:14] — or don't care if SM_id is elsewhere. 255 + 15-14 13-0 256 + 257 + SM_CAS: 258 + [SM bus flit 1, verbatim: 16 bits] 259 + The hardcoded CAS target. Emitted directly to the bus. 260 + 261 + SM_EXEC / SM_CMD: 262 + [extended_params:16] 263 + Base address, configuration values, or other command parameters. 264 + ``` 265 + 266 + --- 267 + 268 + ## SM Operation Summary 269 + 270 + SM operations come in paired variants: pointer-addressed (address from 271 + token data, suffix-free) and const-addressed (address from IRAM const 272 + field, suffix `_C`). This avoids burning IRAM slots on per-address 273 + variants of common operations. 274 + 275 + ### Addressing Modes 276 + 277 + **Pointer-addressed:** The token's data value is a structure pointer: 278 + 279 + ``` 280 + Structure pointer (16 bits): 281 + [spare:4][SM_id:2][addr:10] 282 + 15-12 11-10 9-0 283 + ``` 284 + 285 + The pointer is a fat pointer embedding both the target SM identity and 286 + the cell address. Stage 5 extracts SM_id and addr from the token data to 287 + assemble the SM bus flit. One IRAM entry covers all addresses — array 288 + traversal, pointer chasing, and computed addressing all use the same 289 + instruction. 290 + 291 + **Const-addressed:** The IRAM const field provides an 8-bit address 292 + (256 cells). SM_id comes from half 1. Used for fixed-location access: 293 + IO registers, lock words, call descriptor tables, configuration cells. 294 + 295 + ### Operation Table 296 + 297 + ``` 298 + sm_opcode mnemonic addr source arity flit 2 content result? 299 + ─────────────────────────────────────────────────────────────────────────────── 300 + 00000 SM_READ token data monadic return routing* yes 301 + 00001 SM_READ_C const monadic return routing** yes 302 + 00010 SM_WRITE token data dyadic right operand (data) no 303 + 00011 SM_WRITE_C const monadic token data (value) no 304 + 00100 SM_RMW token data monadic return routing* yes 305 + 00101 SM_RMW_C const monadic return routing** yes 306 + 00110 SM_CAS const (half1) dyadic return routing* yes 307 + 00111 SM_EXEC varies monadic return routing (opt) optional 308 + 01000 SM_CMD varies monadic varies no 309 + 01001- (reserved) 310 + 11111 311 + ``` 312 + 313 + `*` return routing assembled from half 0 ret field (5-bit offset) + 314 + pipeline ctx/gen. 315 + 316 + `**` return routing assembled from half 1 ret field (7-bit offset) + 317 + pipeline ctx/gen. SM_id also from half 1. 318 + 319 + ### SM Bus Flit Assembly 320 + 321 + Stage 5 assembles SM flit 1 from two possible sources, selected by the 322 + `addr_src` decoder signal: 323 + 324 + ``` 325 + Ptr-addressed: 326 + flit 1 = [1][token_data[11:10]][wire_opcode from EEPROM][token_data[9:0] or [7:0]] 327 + 328 + Const-addressed: 329 + flit 1 = [1][half1[5:4]][wire_opcode from EEPROM][half0[7:0]] 330 + (SM_id from half 1, address from half 0 const field) 331 + 332 + CAS (special): 333 + flit 1 = half 1 verbatim (pre-formed SM bus flit) 334 + ``` 335 + 336 + Hardware: SM_id mux (2-bit, 2:1) + addr mux (8-10-bit, 2:1) + hardwired 337 + `[1]` prefix + wire opcode from EEPROM. ~2 chips total. 338 + 339 + ### SM_WRITE Arity 340 + 341 + SM_WRITE (ptr-addressed) is dyadic: left operand = structure pointer, 342 + right operand = data value. The matching store synchronises pointer and 343 + data availability before the write fires. 344 + 345 + SM_WRITE_C (const-addressed) is monadic: the token's data value is the 346 + value to write, the address comes from IRAM const. No matching needed. 347 + 348 + ### SM_CAS 349 + 350 + CAS (compare-and-swap) always uses a const address from half 1 (verbatim 351 + SM bus flit). Both operands provide expected and new values. CAS is a 352 + 3-input operation; with dyadic matching limited to 2 inputs, one operand 353 + (the address) must be static. Lock words and atomic counters typically 354 + live at fixed addresses, making this the natural choice. 355 + 356 + ``` 357 + SM_CAS bus output: 358 + flit 1: half 1 verbatim (SM bus flit with hardcoded address) 359 + flit 2: return routing (from half 0 ret field + pipeline ctx/gen) 360 + flit 3: [left_operand[7:0]][right_operand[7:0]] (expected + new, 8-bit each) 361 + ``` 362 + 363 + --- 364 + 365 + ## Output Token Context Source (ctx_mode) 366 + 367 + All output tokens need ctx and gen values. Three sources, selected by 368 + ctx_mode in the IRAM word: 369 + 370 + ``` 371 + ctx_mode 00 (INHERIT): 372 + ctx = pipeline latch (inherited from executing token) 373 + gen = pipeline latch 374 + Default for same-context execution. Zero overhead. 375 + 376 + ctx_mode 01 (CTX_OVRD): 377 + ctx = IRAM const[7:4] 378 + gen = IRAM const[3:2] 379 + For static cross-context sends. Both destinations share 380 + the same overridden ctx/gen. Compile-time constant. 381 + 382 + ctx_mode 10 (CHG_TAG / CHANGE_TAG): 383 + Entire flit 1 = left operand data value (16 bits, verbatim). 384 + The packed tag IS flit 1. No field extraction or assembly. 385 + PE, offset, ctx, port, gen all come from the data value. 386 + Right operand becomes flit 2 (payload data). 387 + ``` 388 + 389 + ### Pipeline Latches (Baseline Infrastructure) 390 + 391 + ctx (4 bits) and gen (2 bits) must survive from Stage 2 (matching / 392 + decode) through Stage 5 (output formation). This requires 6 bits of 393 + pipeline latches across 3 stage boundaries = ~18 flip-flops. Estimated 394 + 2-3 TTL chips. 395 + 396 + These latches are required for basic machine operation (INHERIT mode), 397 + not just for function calls. All other context modes (CTX_OVRD, 398 + CHANGE_TAG) add muxing on top of this baseline. 399 + 400 + ### Stage 5 Mux Structure 401 + 402 + ``` 403 + Tier 1: ctx/gen source select (ctx_mode 00 vs 01) 404 + Mux on 6 wires (ctx:4 + gen:2). ~1 chip. 405 + Select line from decoder EEPROM. 406 + 407 + Tier 2: flit 1 source select (assembled vs CHANGE_TAG bypass) 408 + Mux on 16 wires. ~2 chips. 409 + Select line from decoder EEPROM (ctx_mode == 10). 410 + ``` 411 + 412 + Tier 2 takes the entire flit 1 from the left operand bypass latch when 413 + CHANGE_TAG is active, bypassing all assembly logic. The packed tag format 414 + matches the flit 1 bit layout by design — no field rearrangement needed. 415 + 416 + ### CHANGE_TAG Hardware 417 + 418 + In addition to the Stage 5 mux (shared infrastructure): 419 + 420 + - **Left operand bypass latch:** 16-bit register that preserves the left 421 + operand value past the ALU (which would otherwise consume it). Loaded 422 + when the decoder signals a CHANGE_TAG-class opcode. ~2 chips. 423 + - **ALU behaviour:** right operand passes through as identity (becomes 424 + flit 2 payload). Left operand is not consumed by ALU. 425 + 426 + Total CHANGE_TAG-specific hardware: ~4 chips per PE (bypass latch + 427 + Stage 5 mux), on top of the ~3 chip baseline for pipeline latches. 428 + 429 + --- 430 + 431 + ## Function Call Design 432 + 433 + ### The Problem 434 + 435 + A function call in this architecture must solve: 436 + 437 + 1. **Code residency** — callee instructions in IRAM on the right PEs. 438 + 2. **Context isolation** — fresh context slot for the new activation. 439 + 3. **Argument injection** — N argument values tagged into callee's context. 440 + 4. **Return linkage** — callee knows where to send results. 441 + 5. **Context teardown** — free slot(s) when activation completes. 442 + 443 + ### Static Calls (v0) 444 + 445 + For non-recursive calls with compiler-known call graphs, all context 446 + assignments are compile-time constants. The compiler assigns ctx slots to 447 + function chunks when laying out IRAM. 448 + 449 + **Argument passing:** The caller's instructions producing argument values 450 + have their destination fields set to the callee's (PE, offset, ctx) with 451 + ctx_mode = 01 (CTX_OVRD). Arguments flow across context boundaries as 452 + normal token routing. No special call instructions. 453 + 454 + **Return:** For single-call-site functions, the callee's return 455 + instruction has its destination baked into IRAM, pointing back to the 456 + caller's (PE, offset, ctx) with ctx_mode = 01. No dynamic return address. 457 + 458 + **Multiple call sites:** If a function is called from N sites, the callee 459 + needs N return paths. Options: 460 + 461 + - Per-call-site return trampolines in IRAM (duplicate the return 462 + instruction with different destinations). Burns IRAM slots. 463 + - CHANGE_TAG for returns (see Dynamic Calls below). 464 + - The compiler can often avoid the problem by inlining small functions. 465 + 466 + **Context allocation:** Compile-time. The compiler assigns non-overlapping 467 + ctx slot ranges to functions based on the call graph. The generation 468 + counter provides ABA protection if slots are reused across non-overlapping 469 + lifetimes. 470 + 471 + **Context teardown:** Lazy generation invalidation (no explicit FREE_CTX 472 + required). When a slot is reused with an incremented generation counter, 473 + stale presence bits are cleared on first access: 474 + 475 + ``` 476 + Token arrives at matching store cell: 477 + presence == 0: → store operand, set presence (normal) 478 + presence == 1, gen matches: → match found, read partner (normal) 479 + presence == 1, gen mismatch: → stored value is stale, overwrite 480 + with new operand, update stored gen 481 + ``` 482 + 483 + Hardware cost: one gate changing the write-enable condition on gen 484 + mismatch. The gen comparison already exists. 2-bit gen wraps after 4 485 + generations; for v0 workloads this is sufficient. 3-bit gen eliminates 486 + wraparound risk if needed. 487 + 488 + **Total overhead for static calls: zero extra instructions.** Function 489 + calls are just IRAM destination configuration. 490 + 491 + ### Dynamic Calls (v1) 492 + 493 + For recursive calls, indirect calls (function pointers, trait objects), 494 + and functions with multiple call sites that need dynamic return routing. 495 + 496 + **New primitives required:** 497 + 498 + | Primitive | Type | Hardware | Purpose | 499 + |-----------|------|----------|---------| 500 + | CHANGE_TAG | dyadic CM | ~4 chips/PE | Output routing from data operand | 501 + | EXTRACT_TAG | monadic CM | ~2 chips/PE | Capture runtime ctx+gen as data | 502 + | SM_READ_C | monadic SM | shared with SM | Fetch call descriptors | 503 + | SM-based alloc | SM READ_INC | 0 PE chips | Runtime context slot allocation | 504 + 505 + **CHANGE_TAG (ctx_mode = 10):** Left operand is a 16-bit packed tag 506 + (a pre-formed flit 1 value). Right operand is the data payload. Output 507 + token's flit 1 = left operand verbatim. Flit 2 = right operand. Enables 508 + sending a value to any destination computed at runtime. 509 + 510 + **EXTRACT_TAG:** Monadic instruction. Captures the executing token's 511 + context information as a 16-bit data value (a return continuation). The 512 + return offset comes from an IRAM immediate field; PE_id from hardware; 513 + ctx and gen from pipeline latches. Output is a packed flit 1 value that 514 + can be passed to CHANGE_TAG by the callee to route results back. 515 + 516 + **Call descriptor tables:** Pre-formed flit 1 values for callee argument 517 + destinations, stored in SM at boot (loaded via EXEC). The caller fetches 518 + descriptors via SM_READ_C (const-addressed, monadic, fast) and feeds 519 + them to CHANGE_TAG. 520 + 521 + **Runtime context allocation:** An SM cell serves as an atomic counter. 522 + The caller issues SM_RMW (READ_INC) on the counter; the returned value 523 + (mod N) is the new context slot ID, used on all PEs the callee touches. 524 + Zero PE hardware. SM round-trip latency is acceptable for function call 525 + setup. 526 + 527 + ### Dynamic Call Sequence 528 + 529 + ``` 530 + Caller (PE0, ctx=3) calls foo(a, b) → result dynamically: 531 + 532 + SM_READ_C(tag_table + 0) → tag_arg0 ; fetch packed flit 1 for arg 0 533 + SM_READ_C(tag_table + 1) → tag_arg1 ; fetch packed flit 1 for arg 1 534 + SM_READ_C(tag_table + 2) → tag_ret_dest ; where callee sends ret_cont 535 + EXTRACT_TAG → ret_cont ; pack (PE0, ctx=3, ret_offset, gen) 536 + CHANGE_TAG(tag_arg0, a) → arg 0 to callee 537 + CHANGE_TAG(tag_arg1, b) → arg 1 to callee 538 + CHANGE_TAG(tag_ret_dest, ret_cont) → return continuation to callee 539 + 540 + Callee (receives args + ret_cont via normal matching): 541 + ; ... compute result ... 542 + CHANGE_TAG(ret_cont, result) → routes result back to caller 543 + ``` 544 + 545 + SM_READ_C operations can be pipelined. CHANGE_TAG operations are 546 + independent and can fire in parallel once their operands arrive. 547 + Effective critical path: SM read latency + one CHANGE_TAG. Comparable 548 + to a conventional function call with register setup + jump. 549 + 550 + ### Partial Execution 551 + 552 + The dataflow execution model supports Amamiya-style partial function 553 + execution naturally. If the callee's argument entry points are 554 + independent instructions (not a single multi-input "begin" node), 555 + arguments arriving early begin executing the callee's body before all 556 + arguments are present. No special hardware support — the compiler 557 + structures the callee's dataflow graph to expose this parallelism. 558 + 559 + ### Tail Calls 560 + 561 + If the callee reuses the caller's context slot (no allocation, no 562 + generation increment), the call is a tail call. The compiler simply 563 + routes arguments with the inherited ctx. No CHANGE_TAG needed, no 564 + allocation, no teardown. Falls out of ctx_mode = 00 (INHERIT) naturally. 565 + 566 + --- 567 + 568 + ## 15-bit Immediate Constants (CONST16) 569 + 570 + Single-output instructions (has_dest2 = 0) can repurpose half 1's 571 + const_ext:7 field to extend the 8-bit const from half 0: 572 + 573 + ``` 574 + CONST16 output value = [const_ext:7][const:8] = 15 bits 575 + ``` 576 + 577 + 15 bits covers any CM flit 1 tag value (bit 15 is always 0 for CM 578 + tokens). This enables producing packed tag constants for CHANGE_TAG 579 + in a single instruction without SM_READ_C. 580 + 581 + The decoder EEPROM selects the "wide const" interpretation based on the 582 + CONST16 opcode. No additional hardware — the concatenation is wiring. 583 + 584 + --- 585 + 586 + ## Hardware Cost Summary 587 + 588 + ### Baseline (required for any token output) 589 + 590 + | Component | Chips/PE | Purpose | 591 + |-----------|----------|---------| 592 + | Pipeline latches (ctx:4 + gen:2) | 2-3 | ctx/gen survival through pipeline | 593 + | IRAM SRAM (2x 8-bit) | 2 | Instruction storage | 594 + 595 + ### ctx_mode 01 (CTX_OVRD — static cross-context calls) 596 + 597 + | Component | Chips/PE | Purpose | 598 + |-----------|----------|---------| 599 + | Stage 5 ctx/gen mux | ~1 | Select inherited vs IRAM-specified ctx/gen | 600 + 601 + ### ctx_mode 10 (CHANGE_TAG — dynamic calls) 602 + 603 + | Component | Chips/PE | Purpose | 604 + |-----------|----------|---------| 605 + | Left operand bypass latch | ~2 | Preserve left operand past ALU | 606 + | Stage 5 flit 1 mux | ~2 | Select assembled flit vs raw data | 607 + 608 + ### EXTRACT_TAG (v1) 609 + 610 + | Component | Chips/PE | Purpose | 611 + |-----------|----------|---------| 612 + | Data path mux | ~1-2 | Route pipeline state to ALU output bus | 613 + 614 + Pipeline latches already exist (baseline). EXTRACT_TAG just taps them 615 + into the data output path. 616 + 617 + ### SM flit assembly 618 + 619 + | Component | Chips/PE | Purpose | 620 + |-----------|----------|---------| 621 + | SM_id mux (2-bit 2:1) | ~0.5 | Select SM_id from pointer vs half 1 | 622 + | Address mux (8-10-bit 2:1) | ~1.5 | Select addr from pointer vs const | 623 + 624 + ### Total incremental cost: baseline → full dynamic calls 625 + 626 + ~7-10 chips per PE, layered incrementally. Each capability is 627 + independently useful and testable. 628 + 629 + --- 630 + 631 + ## Open Design Questions 632 + 633 + 1. **dest2_port convention:** Same as dest1 for DUAL, opposite for 634 + SWITCH, or steal a spare bit for explicit control? Current assumption: 635 + derived from opcode. 636 + 637 + 2. **const_ext:7 interpretation table:** Which opcodes use it for wide 638 + const vs wider offset vs predicate fields? Needs to be defined as 639 + the instruction set solidifies. 640 + 641 + 3. **SM_id for ptr-addressed ops:** Embedded in the structure pointer 642 + format at bits [11:10]. Does this SM_id assignment need to be 643 + configurable, or is it always a direct hardware ID? 644 + 645 + 4. **EXEC return routing ("done" signal):** SM_EXEC could emit a 646 + completion token to a specified destination when the EXEC sequence 647 + finishes. Return routing from half 0 ret field. Useful for code 648 + loading synchronisation. Not yet committed. 649 + 650 + 5. **Lazy gen invalidation wraparound:** 2-bit gen wraps after 4 651 + generations. Sufficient for v0. Monitor during emulation; bump to 652 + 3-bit if wraparound is observed. 653 + 654 + 6. **5-bit offset limit on dual-dest and ptr-addressed SM return 655 + routing:** Dual-dest instructions and ptr-addressed SM results can 656 + only target offsets 0-31. Monadic targets at higher offsets require 657 + a PASS trampoline or CHANGE_TAG. Acceptable for v0; monitor during 658 + compilation of real programs.
+1031
design-notes/loop-patterns-and-flow-control.md
··· 1 + # Loop Patterns and Flow Control Idioms 2 + 3 + Execution patterns for loops, reductions, and flow control in the 4 + dataflow architecture. These are software/compiler conventions built 5 + from existing hardware primitives — no dedicated loop hardware exists. 6 + 7 + Most patterns described here are candidates for **assembler macros**: 8 + reusable expansions that emit the underlying instruction sequences. 9 + The programmer writes `LOOP_COUNTED(counter, limit, body_label)` and 10 + the assembler expands it into the token feedback arcs, SWITCH routing, 11 + and permit structures described below. 12 + 13 + See `iram-and-function-calls.md` for IRAM format and ctx_mode details. 14 + See `alu-and-output-design.md` for SWITCH, GATE, and output modes. 15 + See `sm-design.md` for SM operations referenced by some patterns. 16 + 17 + --- 18 + 19 + ## Core Loop Mechanism 20 + 21 + There is no program counter, no branch instruction, and no loop 22 + construct in hardware. Loops are **token feedback arcs**: an 23 + instruction's output token is routed back to an input of an earlier 24 + instruction (or itself) in the dataflow graph. The loop "iterates" 25 + each time a token completes the feedback circuit. 26 + 27 + ### Minimal Counted Loop 28 + 29 + ``` 30 + graph: 31 + CONST(0) ─────────────────────────────────┐ 32 + 33 + ┌───────────────────────────────────────────┤ 34 + │ ▼ 35 + │ ┌─────┐ ┌──────────┐ ┌────────────────┐ 36 + └─►│ INC │────►│ LT limit │────►│ SWITCH │ 37 + └─────┘ └──────────┘ │ true → body │ 38 + ▲ i+1 bool │ false → exit │ 39 + │ └────────┬───────┘ 40 + │ data (i+1) to body ◄──────┘ 41 + │ │ 42 + └── i+1 fed back (same token) ──────┘ 43 + ``` 44 + 45 + Instructions (all on same PE, same context): 46 + 47 + ```dfasm 48 + ; Counted loop: increment from 0, dispatch to body, exit when done 49 + 50 + &counter <| const, 0 ; initial counter value (seed token starts loop) 51 + &step <| inc ; increment counter 52 + &cmp <| lt ; compare counter < limit 53 + &route <| sweq ; route by comparison result 54 + 55 + const <limit> |> &cmp:R ; limit value (or SM read if > 255) 56 + 57 + &counter |> &step ; seed → first increment 58 + &step |> &cmp:L ; counter → comparison left 59 + &step |> &route:L ; counter → switch data input (fan-out) 60 + &cmp |> &route:R ; bool → switch control 61 + 62 + &route:L |> &body:L ; taken (true) → dispatch to body 63 + &route:R |> &exit:L ; not-taken (false) → loop done 64 + &route:L |> &step ; feedback arc: counter recirculates 65 + ``` 66 + 67 + The feedback arc is just a destination field in the SWITCH instruction 68 + (or a PASS trampoline if SWITCH can't dual-route to both body dispatch 69 + and the INC feedback simultaneously). The loop "runs" as long as tokens 70 + keep flowing through the feedback arc. 71 + 72 + ### Timing 73 + 74 + Each iteration traverses: INC → LT → SWITCH → bus → INC. With the v0 75 + pipeline (no local bypass, all tokens go through external bus), expect 76 + roughly 6-10 cycles per loop control iteration depending on bus 77 + contention and pipeline depth. 78 + 79 + The loop body executes concurrently with loop control — the dispatched 80 + body token enters the body subgraph immediately while the counter 81 + continues to the next iteration. If the body takes longer than one 82 + control iteration, multiple body invocations can be in flight 83 + simultaneously (with appropriate flow control — see Permit Tokens). 84 + 85 + --- 86 + 87 + ## Permit-Token Flow Control 88 + 89 + When loop body iterations can execute concurrently, a throttling 90 + mechanism prevents context slot exhaustion. **Permit tokens** are the 91 + standard dataflow idiom for this. 92 + 93 + ### Concept 94 + 95 + K permit tokens circulate through the system. Each dispatch to a body 96 + context consumes one permit. Each body completion produces one permit. 97 + At most K body iterations are in flight simultaneously. If no permits 98 + are available, the dispatch GATE stalls — the loop control token waits 99 + in the matching store until a permit arrives. 100 + 101 + ``` 102 + permits (K tokens, initially injected at boot) 103 + 104 + 105 + ┌────────┐ 106 + │ GATE │◄──── loop control produces (counter, body_data) 107 + │ L: permit 108 + │ R: dispatch_data 109 + └───┬────┘ 110 + │ (fires only when BOTH permit AND data are ready) 111 + 112 + dispatch to body context (CHANGE_TAG or CTX_OVRD) 113 + 114 + 115 + body executes ... body completes 116 + 117 + 118 + emit permit token back to GATE (port L) 119 + ``` 120 + 121 + ### Implementation 122 + 123 + The GATE instruction is dyadic. Left port receives the permit token. 124 + Right port receives the loop's dispatch data (counter value, array 125 + pointer, whatever the body needs). GATE fires only when both are 126 + present — this IS the backpressure mechanism. No special hardware 127 + flow control. 128 + 129 + ```dfasm 130 + ; Permit-gated dispatch 131 + &gate <| gate ; dyadic: L=permit, R=dispatch data 132 + &loop_output |> &gate:R ; loop control feeds data to gate 133 + 134 + ; Body completion recycles the permit 135 + &body_done |> &gate:L ; body's final instruction returns permit 136 + ``` 137 + 138 + The body's final instruction emits a token to `&gate:L` as one of its 139 + destinations. This token is the recycled permit. Its data value is 140 + irrelevant (just a trigger); what matters is its presence in the 141 + matching store. 142 + 143 + K is chosen by the compiler: 144 + 145 + - K = 1: fully sequential, one body at a time. safe default. 146 + - K = number of reserved body context slots: maximum parallelism. 147 + - K = pipeline depth / body latency: optimal for throughput. 148 + 149 + ### Initial Permit Injection 150 + 151 + At boot (or function entry), K permit tokens must be injected into 152 + the GATE. Options: 153 + 154 + - **CONST chain:** K CONST instructions at sequential offsets, each 155 + with dest targeting the GATE's left port. Triggered by the function 156 + entry token via fan-out. Burns K IRAM slots but is simple. 157 + - **SM EXEC:** pre-load K permit tokens in SM, EXEC emits them. 158 + Uses one IRAM slot for the EXEC trigger. Better for large K. 159 + - **Assembler macro:** `PERMITS(K, gate_offset)` expands to the 160 + appropriate injection sequence. 161 + 162 + ### Assembler Macro Sketch 163 + 164 + ``` 165 + ; PERMIT_LOOP(K, limit, body_label, exit_label) 166 + ; Expands to: 167 + ; - K CONST instructions emitting initial permits 168 + ; - GATE (permit, dispatch_data) guarding body dispatch 169 + ; - INC/LT/SWITCH loop control chain 170 + ; - feedback arc from SWITCH to INC 171 + ; - body return path emitting permit on completion 172 + ``` 173 + 174 + The macro assigns IRAM offsets for the control structure and reserves 175 + K context slots for body iterations. 176 + 177 + --- 178 + 179 + ## Parallel Reduction 180 + 181 + A common pattern following parallel loop iterations: combine K partial 182 + results into a single value. 183 + 184 + ### Binary Reduction Tree 185 + 186 + ``` 187 + K=4 partial sums: s0 s1 s2 s3 188 + \ / \ / 189 + ADD ADD 190 + \ / 191 + ADD 192 + 193 + total 194 + ``` 195 + 196 + Each ADD is a dyadic instruction in its own right. The partial results 197 + arrive as tokens, match in the ADD's matching store entry, fire, and 198 + produce the next level's input. The tree structure is pure dataflow — 199 + no special reduction hardware. 200 + 201 + For K iterations, the tree has log2(K) levels and K-1 ADD instructions. 202 + With K=8, that's 7 ADDs across 3 levels. All ADDs at the same level 203 + can fire in parallel (they're on different matching store entries or 204 + different PEs). 205 + 206 + ### Assembler Macro Sketch 207 + 208 + ``` 209 + ; REDUCE(op, inputs[], output) 210 + ; Expands to: 211 + ; - ceil(log2(N)) levels of binary op instructions 212 + ; - routing from each level's outputs to next level's inputs 213 + ; - final output routed to specified destination 214 + ``` 215 + 216 + --- 217 + 218 + ## Loop-Carried Accumulators (Self-Loop Pattern) 219 + 220 + A value that updates every iteration and feeds back to itself. The 221 + canonical example: `sum += a[i]`. 222 + 223 + ### Matching-Store-as-Register 224 + 225 + A dyadic instruction whose output routes back to its own left port: 226 + 227 + ```dfasm 228 + ; Self-loop accumulator: sum += each incoming value 229 + &acc <| add ; dyadic: L=accumulated sum, R=new element 230 + &acc |> &acc:L ; feedback: result → own left port 231 + 232 + ; Initialise: deposit starting value before first element arrives 233 + &init <| const, 0 234 + &init |> &acc:L ; seed the accumulator with 0 235 + ``` 236 + 237 + The matching store cell at `&acc`'s (ctx, offset, port L) holds the 238 + accumulator value between iterations. Each new element arriving 239 + on port R triggers the ADD, which deposits the updated sum back 240 + into port L's cell. 241 + 242 + **Timing:** Each accumulation step is a full round-trip: ALU → 243 + output formatter → bus → input FIFO → matching store → ALU. Roughly 244 + 6-10 cycles at v0. This is the sequential bottleneck — the accumulator 245 + feedback arc is inherently serial. 246 + 247 + **Extracting the final value:** After the last element is accumulated, 248 + the sum sits in the matching store cell at port L. It needs a "drain" 249 + event to extract it. Options: 250 + 251 + - A sentinel token on port R triggers one final ADD (or PASS), and 252 + dest2 routes the result to the downstream consumer. 253 + - A GATE controlled by a "loop done" boolean from the loop control. 254 + When the loop completes, the GATE opens and the accumulated value 255 + flows out. 256 + 257 + ### With Parallel Iterations 258 + 259 + Each body context has its own matching store entries (different ctx → 260 + different SRAM address). K parallel accumulators at the same IRAM 261 + offset but different context slots operate independently. 262 + 263 + ```dfasm 264 + ; All iterations share the same IRAM instruction for &acc. 265 + ; Different context slots → different matching store cells. 266 + ; No interference between iterations. 267 + 268 + ; ctx=1: sum_1 accumulator (self-loop) 269 + ; ctx=2: sum_2 accumulator (self-loop) 270 + ; ctx=3: sum_3 accumulator (self-loop) 271 + ; ... all at the same &acc instruction, different contexts 272 + ``` 273 + 274 + After all iterations complete, a reduction tree combines partial sums. 275 + The permit-token mechanism guarantees that partial sums are ready before 276 + the reduction begins (the permits themselves can be chained to trigger 277 + the reduction). 278 + 279 + --- 280 + 281 + ## Predicate Register Optimisation (Future, ~1 Chip) 282 + 283 + A single shared 1-bit register (or small multi-bit register) that 284 + stores a comparison result locally, bypassing the token network for 285 + the boolean path. 286 + 287 + ### Benefit 288 + 289 + In the standard loop control pattern, the comparison boolean travels 290 + as a token: LT produces a bool token → bus → matching store → SWITCH 291 + consumes it. The predicate register short-circuits this: 292 + 293 + ``` 294 + without predicate register: 295 + LT → [bool token] → bus → matching → SWITCH 296 + cost: full token round-trip for the boolean 297 + 298 + with predicate register: 299 + LT writes bool to predicate register (side effect, no token) 300 + SWITCH reads predicate register (local wire, no matching needed) 301 + cost: zero additional cycles for the boolean path 302 + ``` 303 + 304 + The counter feedback arc still goes through the bus. But the boolean 305 + path — typically half the loop control overhead — becomes free. 306 + 307 + ### Constraints 308 + 309 + - **Not per-context.** Single shared register. The compiler must 310 + guarantee only one activation uses the predicate at a time. 311 + - **Not suitable for parallel iterations.** Each iteration would need 312 + its own predicate state. Use the token-based boolean path for 313 + parallel loop control. 314 + - **IRAM encoding:** 1-2 bits per instruction (pred_write, pred_read). 315 + Can be folded into opcode space as dedicated variants (LT_P, SWITCH_P) 316 + or drawn from spare bits in half 1. 317 + 318 + ### Hardware 319 + 320 + ``` 321 + 1-bit predicate register: 1 flip-flop 322 + write path (from comparator): 1 gate (write enable) 323 + read mux (to SWITCH): 1 gate (bool_out source select) 324 + Total: ~1 chip (fraction of a chip, really) 325 + ``` 326 + 327 + --- 328 + 329 + ## Accumulator Register Optimisation (Future, ~3 Chips) 330 + 331 + A single shared 16-bit register writable by the ALU and readable as 332 + an ALU input source. Eliminates the bus round-trip for tight 333 + accumulation loops. 334 + 335 + ### Benefit 336 + 337 + ``` 338 + without accumulator register: 339 + ADD(acc, new) → output → bus → input → matching → ADD 340 + cost: ~6-10 cycles per accumulation 341 + 342 + with accumulator register: 343 + ACC_ADD: reads acc from register, adds new from token, writes result 344 + back to register. monadic (only new element token needed). 345 + cost: 1 pipeline pass per accumulation (~3-5 cycles) 346 + ``` 347 + 348 + ### Constraints 349 + 350 + - **Not per-context.** Same single-activation restriction as predicate 351 + register. 352 + - **Monadic operation.** ACC_ADD/ACC_SUB are monadic — the accumulator 353 + is an implicit operand from the register, the explicit operand comes 354 + from the arriving token. No matching store entry consumed. 355 + - **No matching store write conflict.** The register is a separate 356 + storage element from the matching store. ALU writes to the register 357 + at Stage 4; matching store writes happen at Stage 2. No port 358 + conflict, no stall logic needed. 359 + - **Stepping stone to SC blocks.** The accumulator register is 360 + effectively the first register of a future local register file. 361 + Adding a second register + sequential instruction counter yields a 362 + minimal strongly-connected (SC) block capability. 363 + 364 + ### Hardware 365 + 366 + ``` 367 + 16-bit register (2x 74LS374): 2 chips 368 + ALU source mux (add acc_reg input): 1 chip 369 + write enable gating: ~0 chips (1 gate) 370 + Total: ~3 chips per PE 371 + ``` 372 + 373 + --- 374 + 375 + ## Assembler Macro and Function Call Strategy 376 + 377 + The patterns above are mechanical enough to be assembler macros. 378 + Macros are expected to be simple text/token substitution (C-style 379 + `#define` territory, not Zig comptime or C++ templates). This means: 380 + 381 + - No conditional logic within macros. Different strategies get 382 + different macros, and the programmer picks which to use. 383 + - No offset allocation intelligence. The assembler tracks placement 384 + and validates the expansion (offset collisions, missing labels), 385 + but the macro itself is dumb substitution. 386 + - No type checking or context-slot tracking. The programmer is 387 + responsible for not blowing the slot budget. 388 + 389 + ### Macro Syntax 390 + 391 + A macro call uses `#macro_name` and follows the same syntax as any 392 + other operation (arguments, edge wiring, port qualifiers). Macro 393 + definitions follow a function-block-like structure. 394 + 395 + ### Function Calls as Syntax 396 + 397 + Using a `$func` label as an instruction generates the appropriate 398 + routing for static calls. Named arguments match against the 399 + function's internal labels: 400 + 401 + ```dfasm 402 + ; Function definition: 403 + $add_pair |> { 404 + add &a, &b |> #ret 405 + } 406 + 407 + ; Static call — named args wire to internal labels: 408 + $add_pair a=&x, b=&y |> @output 409 + ``` 410 + 411 + `#ret` is a built-in macro that marks the function's return point. 412 + The assembler resolves `a=&x` to mean "wire `&x`'s output to 413 + `$add_pair.&a`, port L, with ctx_mode=01 and the allocator-assigned 414 + context slot." The `|> @output` wires the return point to `@output`. 415 + 416 + For static calls (non-recursive, known call graph), this generates 417 + only routing annotations on existing instructions — no extra IRAM 418 + entries, no CHANGE_TAG, just destination fields set with ctx_mode=01. 419 + 420 + ### Dynamic and Recursive Calls (v1, Manual / Macro-Assisted) 421 + 422 + The assembler does NOT handle recursive or indirect calls 423 + automatically — that's compiler territory. Recursive calls require 424 + runtime context allocation (SM READ_INC), CHANGE_TAG sequences, and 425 + EXTRACT_TAG for return continuations. The assembler provides macros 426 + to reduce boilerplate for the mechanical parts; the programmer 427 + manages descriptor tables, context budgets, and flow control. 428 + 429 + #### The Problem 430 + 431 + A dynamic call to a function with N arguments requires: 432 + 433 + 1. **Allocate context** — SM READ_INC on an allocator cell → new_ctx 434 + 2. **Build return continuation** — EXTRACT_TAG captures caller's 435 + (PE, ctx, offset, gen) as a 16-bit packed tag value 436 + 3. **Fetch N+1 tag templates** — SM_READ_C from a descriptor table 437 + (one per argument destination + one for return destination) 438 + 4. **Patch ctx into each tag** — OR new_ctx into bits [3:0] of 439 + each template (templates are pre-built at boot with ctx=0) 440 + 5. **Send return continuation** — CHANGE_TAG with patched return tag 441 + 6. **Send each argument** — CHANGE_TAG with patched arg tag + value 442 + 443 + Done naively (one IRAM slot per step), this burns 4N+8 IRAM slots 444 + per call site. Two recursive calls = 24+ slots for N=1. With 128 445 + slots per PE, that's unsustainable. 446 + 447 + #### The EXEC-Based Call Stub 448 + 449 + EXEC is a token cannon — it reads a sequence of pre-formed tokens 450 + from SM/ROM and fires them onto the bus. The tokens can be anything: 451 + SM read requests, CM tokens, triggers. One IRAM slot (the EXEC 452 + trigger) replaces an arbitrary number of pre-staged operations. 453 + 454 + The key insight: steps 3-4 above (fetch tag templates, deliver them 455 + to patching logic) are **identical for every call to the same 456 + function**. The tag templates, their SM addresses, and where to 457 + deliver them are all compile-time constants. Only the allocated 458 + ctx and argument values change per call. 459 + 460 + This splits the call machinery into two parts: 461 + 462 + **Call stub (shared, loaded once per function):** 463 + 464 + IRAM instructions that receive runtime values (ctx, args, return 465 + continuation) and perform the patching + dispatch. These live in 466 + IRAM and are shared across all call sites for the same function. 467 + Different call sites invoke the stub in different context slots, 468 + so their matching store entries don't collide. 469 + 470 + **EXEC sequence (in SM/ROM, per function):** 471 + 472 + Pre-formed tokens that read tag templates from the descriptor table 473 + and deliver them to the stub's OR instructions. Triggered by a 474 + single EXEC instruction. Stored once, fired per call. 475 + 476 + ``` 477 + Per-function (loaded once): 478 + call stub in IRAM: ctx fan-out + OR patches + CHANGE_TAGs 479 + EXEC sequence in SM: SM_READ_C tokens targeting the stub 480 + descriptor table: pre-formed flit 1 templates (ctx=0) 481 + 482 + Per-call-site (tiny): 483 + 3 IRAM slots: rd_inc (allocate) + exec (trigger) + extract_tag (return) 484 + wiring: feed ctx, return cont, and arg values into the stub 485 + ``` 486 + 487 + #### Call Stub Structure (Example: N=1 Argument) 488 + 489 + ```dfasm 490 + ; ── call stub for $fib, loaded once, shared across call sites ── 491 + ; runs in caller's allocated ctx (different per call → no collision) 492 + 493 + ; ctx fan-out: new_ctx needs to reach 2 OR instructions (ret + arg) 494 + &__fib_ctx_fan <| pass 495 + &__fib_ctx_fan |> &__fib_or_ret:R, &__fib_or_n:R 496 + 497 + ; tag patching: template (from EXEC'd SM_READ_C) OR'd with new_ctx 498 + &__fib_or_ret <| or ; L: ret tag template, R: new_ctx 499 + &__fib_or_n <| or ; L: arg tag template, R: new_ctx 500 + 501 + ; dispatch: patched tag + data → output token 502 + &__fib_ct_ret <| change_tag ; L: patched ret tag, R: return continuation 503 + &__fib_ct_n <| change_tag ; L: patched arg tag, R: argument value 504 + 505 + ; internal wiring 506 + &__fib_or_ret |> &__fib_ct_ret:L 507 + &__fib_or_n |> &__fib_ct_n:L 508 + ``` 509 + 510 + Stub cost: 1 (PASS fan-out) + 2 (OR) + 2 (CHANGE_TAG) = 5 IRAM slots 511 + for N=1. For N=2: add 1 more PASS in fan-out chain + 1 OR + 1 512 + CHANGE_TAG = 8 slots. General: 2 + 2N slots (fan-out chain + per-arg 513 + OR and CHANGE_TAG). 514 + 515 + Note: the OR and CHANGE_TAG instructions are dyadic, consuming IRAM 516 + slots in the low-offset range (0-31). The PASS fan-out chain is 517 + monadic and can live in the monadic range (offsets 32+), where IRAM 518 + space is more abundant — monadic instructions don't consume matching 519 + store entries, so the 7-bit offset space is available. 520 + 521 + #### Per-Call-Site Expansion 522 + 523 + ```dfasm 524 + ; ── per call site: 3 IRAM slots + wiring ── 525 + 526 + &__alloc <| rd_inc, @ctx_alloc ; allocate callee context 527 + &__exec <| exec, @fib_call_seq ; fire tag-fetch sequence 528 + &__extag <| extract_tag, <ret_offset> ; capture return continuation 529 + 530 + ; wire runtime values into stub (these are edge declarations, not IRAM) 531 + &__alloc |> &__fib_ctx_fan ; new ctx → stub fan-out 532 + &__extag |> &__fib_ct_ret:R ; return cont → stub 533 + &arg_val |> &__fib_ct_n:R ; argument → stub 534 + ``` 535 + 536 + rd_inc and extract_tag are monadic. exec is monadic. The per-call-site 537 + cost is 3 monadic IRAM slots — they sit in the monadic offset range 538 + and don't consume matching store entries. 539 + 540 + #### EXEC Sequence Contents (in SM/ROM) 541 + 542 + Pre-formed tokens, stored at boot, fired by exec: 543 + 544 + ``` 545 + Token 0: SM_READ_C(@fib_desc + 0) → deliver to &__fib_or_ret:L 546 + Token 1: SM_READ_C(@fib_desc + 1) → deliver to &__fib_or_n:L 547 + ``` 548 + 549 + Each token is a fully-formed 2-flit packet: flit 1 = SM read command, 550 + flit 2 = return routing pointing at the stub's OR instruction. The 551 + EXEC sequencer reads these from consecutive SM cells and emits them 552 + onto the bus. The SM processes each read and returns the tag template 553 + to the specified OR instruction. 554 + 555 + #### Fibonacci: Two Recursive Calls 556 + 557 + ```dfasm 558 + $fib |> { 559 + ; ── function body ── 560 + &n <| pass 561 + lt &n, 2 |> &test 562 + &test <| sweq 563 + &n |> &test:L 564 + 565 + ; base case 566 + &test:L |> #ret 567 + 568 + ; recursive case 569 + sub &n, 1 |> &n1 570 + sub &n, 2 |> &n2 571 + 572 + ; two calls, each 3 monadic IRAM slots + shared stub 573 + &__alloc1 <| rd_inc, @ctx_alloc 574 + &__exec1 <| exec, @fib_call_seq 575 + &__extag1 <| extract_tag, 20 ; results arrive at offset 20 576 + 577 + &__alloc1 |> &__fib_ctx_fan 578 + &__extag1 |> &__fib_ct_ret:R 579 + &n1 |> &__fib_ct_n:R 580 + 581 + &__alloc2 <| rd_inc, @ctx_alloc 582 + &__exec2 <| exec, @fib_call_seq 583 + &__extag2 <| extract_tag, 21 ; results arrive at offset 21 584 + 585 + &__alloc2 |> &__fib_ctx_fan 586 + &__extag2 |> &__fib_ct_ret:R 587 + &n2 |> &__fib_ct_n:R 588 + 589 + ; reduction 590 + add &r1, &r2 |> #ret ; r1 at offset 20, r2 at offset 21 591 + } 592 + ``` 593 + 594 + **Important:** The two calls share the same stub IRAM instructions 595 + but run in different contexts (ctx allocated by rd_inc). The matching 596 + store entries for `&__fib_or_ret` etc. are indexed by (ctx, offset), 597 + so different ctx values → different cells → no collision. 598 + 599 + The calls ARE sequenced by data dependencies — the second call can't 600 + fire its CHANGE_TAG until its rd_inc and exec complete, which are 601 + independent of the first call. Both calls can be in flight 602 + simultaneously. 603 + 604 + #### IRAM Budget 605 + 606 + ``` 607 + dyadic slots monadic slots 608 + (0-31 range) (32-127 range) 609 + ───────────────────────────────────────────────────────── 610 + function body (fib): ~6 ~4 611 + call stub (shared): 4 1 612 + per call site (×2): 0 6 613 + result reduction: 1 0 614 + ───────────────────────────────────────────────────────── 615 + total: ~11 ~11 616 + ``` 617 + 618 + ~22 IRAM slots total for recursive fibonacci. Well within 128. And 619 + if fib is called from external sites, they pay only 3 monadic slots 620 + each — the stub and body are already loaded. 621 + 622 + #### Stub Sharing Across Mutual Recursion 623 + 624 + Two-layer recursion (A calls B calls A) can share ctx allocation and 625 + EXEC infrastructure. If A and B are on the same PE: 626 + 627 + - Each function has its own call stub (different tag templates) 628 + - They share the same `@ctx_alloc` SM cell 629 + - Their EXEC sequences are independent but stored in the same SM 630 + - Both stubs live in IRAM simultaneously at different offsets 631 + 632 + If A and B have the same argument count and shape, a future 633 + optimisation is a *generic* call stub parameterised only by which 634 + EXEC sequence to fire. The tag templates in the descriptor table 635 + encode all the per-function differences. The stub just patches ctx 636 + and dispatches — it doesn't know or care which function it's calling. 637 + This is essentially a vtable dispatch and emerges naturally from the 638 + architecture. 639 + 640 + #### Descriptor Table Layout (in SM, initialised at boot) 641 + 642 + ``` 643 + @fib_desc + 0: return destination tag template (ctx=0) 644 + [0][0][port][PE][gen][offset][0000] 645 + @fib_desc + 1: arg 'n' destination tag template (ctx=0) 646 + [0][0][port][PE][gen][offset][0000] 647 + 648 + ; for N=2 function: 649 + @func_desc + 0: return tag template 650 + @func_desc + 1: arg 0 tag template 651 + @func_desc + 2: arg 1 tag template 652 + ``` 653 + 654 + Templates are full 16-bit flit 1 values with ctx field set to 0. 655 + The stub's OR instruction patches bits [3:0] with the allocated ctx. 656 + Templates are written to SM during bootstrap (via EXEC from ROM or 657 + explicit SM_WRITE_C during init). 658 + 659 + #### SM-Based Argument Passing (Large Functions) 660 + 661 + For functions with many arguments (N > 3), the IRAM cost of the 662 + OR + CHANGE_TAG stub becomes prohibitive — each arg burns 2 dyadic 663 + IRAM slots. An alternative: **stage arguments in SM cells and let 664 + the EXEC sequence deliver them.** 665 + 666 + The caller writes argument values to a block of SM "call frame" 667 + cells using SM_WRITE_C (monadic, const-addressed). The EXEC 668 + sequence's tail end includes SM_READ_C tokens that read those cells 669 + back out and deliver them as tokens to the callee's entry points. 670 + The callee is oblivious — it just sees tokens arriving normally. 671 + 672 + ``` 673 + Caller writes args to SM call frame: 674 + SM_WRITE_C(@frame + 0, arg0_value) ; monadic, 1 IRAM slot 675 + SM_WRITE_C(@frame + 1, arg1_value) ; monadic 676 + ... 677 + SM_WRITE_C(@frame + N-1, argN_value) ; monadic 678 + SM_WRITE_C(@frame + N, ret_cont) ; return continuation 679 + EXEC @call_seq ; fire it all 680 + 681 + EXEC sequence (in SM/ROM): 682 + SM_READ_C(@frame + 0) → deliver to callee &arg0:L 683 + SM_READ_C(@frame + 1) → deliver to callee &arg1:L 684 + ... 685 + SM_READ_C(@frame + N-1) → deliver to callee &argN:L 686 + SM_READ_C(@frame + N) → deliver to callee &ret_cont:L 687 + ``` 688 + 689 + **Costs:** 690 + 691 + ``` 692 + stub approach SM call frame 693 + (per-arg OR+CT) (SM staging) 694 + ─────────────────────────────────────────────────────────── 695 + caller IRAM: 3 monadic N+2 monadic (writes + exec + extag) 696 + stub IRAM: 2N dyadic + fan-out 0 697 + EXEC sequence: N+1 tokens N+1 tokens 698 + SM cells used: 0 (runtime) N+1 (call frame) 699 + ─────────────────────────────────────────────────────────── 700 + dyadic IRAM slots: 2N+ 0 701 + monadic IRAM slots: 3 N+2 702 + ``` 703 + 704 + The SM call frame approach uses zero dyadic IRAM for call overhead. 705 + All caller instructions are monadic (SM_WRITE_C, EXEC, EXTRACT_TAG), 706 + living in the abundant 32-127 offset range. The callee's IRAM is 707 + pure function body — no call machinery whatsoever. 708 + 709 + **Tradeoffs:** 710 + 711 + - **Latency:** two SM round-trips per argument (write then read) vs 712 + one CHANGE_TAG. Adds ~2× SM access latency to call setup. For 713 + large N where the stub approach would serialise through a fan-out 714 + chain anyway, the SM approach may not be worse. 715 + - **SM cell pressure:** N+1 cells per call frame. Concurrent calls 716 + need separate frames (different base addresses). The caller manages 717 + this — either static allocation for known call depth, or an SM 718 + frame pointer bumped via READ_INC. 719 + - **No ctx patching needed:** the EXEC sequence tokens already have 720 + the correct routing baked in (they target the callee's entry points 721 + directly). Context allocation is still needed, but the patching 722 + step (OR new_ctx into tag templates) is eliminated because the 723 + EXEC sequence can be **rebuilt per call** from a template + 724 + allocated ctx. Or, if the callee always runs in a fixed ctx 725 + (static allocation), the EXEC sequence is truly static. 726 + 727 + **When to use which:** 728 + 729 + - N=1-2 args: stub approach. the OR + CHANGE_TAG chain is small, 730 + latency is minimal, no SM cells consumed. 731 + - N=3+ args: SM call frame starts winning on IRAM pressure. 732 + - N=6+ args: SM call frame is clearly better. the stub would need 733 + 12+ dyadic slots just for call overhead. 734 + - one-shot EXEC'd functions (loaded on demand, run once): SM call 735 + frame is natural — the EXEC that loads the code can also deliver 736 + the arguments in one sequence. 737 + 738 + A function intended to be called via EXEC one-shot (loaded from ROM 739 + into IRAM, executed, then discarded) can have its entire call 740 + convention baked into the EXEC block: code loading tokens first, 741 + then SM_READ tokens that deliver arguments from pre-staged cells. 742 + The caller just writes args to SM and fires EXEC. The function 743 + loads, receives its arguments, runs, sends results, done. 744 + 745 + --- 746 + 747 + ### Macro System Requirements 748 + 749 + The call macros above need the following from the macro system. 750 + All of these are within the capability of rust-style `macro_rules!` 751 + or a purpose-built assembler template system. No conditionals, no 752 + recursion in the macro evaluator, no type system. 753 + 754 + #### 1. Named Variadic Repetition 755 + 756 + ``` 757 + $($arg = $src),* 758 + ``` 759 + 760 + A comma-separated list of named pairs, expanded once per entry. 761 + Rust `macro_rules!` provides this directly. 762 + 763 + #### 2. Token Pasting (Label Synthesis) 764 + 765 + ``` 766 + &__${arg}_tag 767 + ``` 768 + 769 + Concatenate a macro parameter into a label name. Produces unique 770 + labels per repetition entry. C has `##`, rust proc macros have 771 + `Ident::new()`, but `macro_rules!` does NOT have this natively — 772 + would need an assembler-specific extension or a `paste!`-style 773 + helper. 774 + 775 + This is the single most important extension beyond stock 776 + `macro_rules!`. Without it, macros can't generate unique labels 777 + for per-arg instructions. 778 + 779 + #### 3. Implicit Repetition Index 780 + 781 + ``` 782 + ${_idx} 783 + ``` 784 + 785 + An auto-incrementing counter within a `$(...),*` expansion. 786 + Used for descriptor table offset arithmetic (`$desc + ${_idx} + 1`). 787 + Not available in rust `macro_rules!` — another assembler-specific 788 + extension. Alternative: require the programmer to pass explicit 789 + indices, which is ugly but functional. 790 + 791 + #### 4. Constant Arithmetic in Expressions 792 + 793 + ``` 794 + $desc + ${_idx} + 1 795 + ``` 796 + 797 + Compile-time addition on constant/label expressions. The assembler 798 + already evaluates constant expressions for instruction operands, so 799 + the macro expander just needs to emit the expression text and let the 800 + normal evaluator handle it. No new evaluation capability needed. 801 + 802 + #### 5. Label Reference Across Macro Boundaries 803 + 804 + ``` 805 + &__alloc |> $func.__stub.ctx_in 806 + ``` 807 + 808 + The per-call-site macro needs to reference labels inside the 809 + per-function stub macro's expansion. This requires either: 810 + 811 + - A naming convention that both macros agree on (fragile but simple) 812 + - The stub macro "exporting" label names via a known pattern 813 + (`$func.__stub.*`) 814 + - The assembler resolving qualified names across scopes 815 + 816 + The naming convention approach is the most macro-friendly: the 817 + `call_stub` macro always emits labels named 818 + `&__${func}_ctx_fan`, `&__${func}_or_ret`, etc., and the 819 + `call_dyn` macro references them by constructing the same names. 820 + Both macros must agree. This is a social contract, not a type system. 821 + 822 + #### What's NOT Needed 823 + 824 + - **Conditional expansion** — different call shapes get different 825 + macros, not `if` inside a macro. 826 + - **Recursive macro expansion** — the fan-out PASS chain has a fixed 827 + structure per argument count. For N=1 it's one PASS with dual dest. 828 + For N=2 it's two PASSes. Rather than recursing, provide 829 + `call_stub_1`, `call_stub_2`, `call_stub_3` for common arities. 830 + Ugly, pragmatic, correct. 831 + - **Type checking** — the assembler validates after expansion (wrong 832 + arity, missing labels, offset overflow). The macro doesn't check. 833 + - **Hygiene** — label collisions between macro expansions ARE a risk. 834 + Mitigated by the `&__${func}_` prefix convention. If two functions 835 + have the same name, you have bigger problems. 836 + 837 + #### Example Macro Definitions 838 + 839 + ```dfasm 840 + ; ── call stub for a 1-argument function ── 841 + ; emitted once per function, provides shared call infrastructure 842 + .macro call_stub_1 $func, $desc { 843 + ; ctx fan-out (1 arg + 1 return = 2 consumers, one PASS suffices) 844 + &__${func}_ctx_fan <| pass 845 + &__${func}_ctx_fan |> &__${func}_or_ret:R, &__${func}_or_arg0:R 846 + 847 + ; tag patching 848 + &__${func}_or_ret <| or 849 + &__${func}_or_arg0 <| or 850 + 851 + ; dispatch 852 + &__${func}_ct_ret <| change_tag 853 + &__${func}_ct_arg0 <| change_tag 854 + 855 + ; internal wiring 856 + &__${func}_or_ret |> &__${func}_ct_ret:L 857 + &__${func}_or_arg0 |> &__${func}_ct_arg0:L 858 + } 859 + 860 + ; ── call stub for a 2-argument function ── 861 + .macro call_stub_2 $func, $desc { 862 + ; ctx fan-out chain (3 consumers) 863 + &__${func}_ctx_fan0 <| pass 864 + &__${func}_ctx_fan1 <| pass 865 + &__${func}_ctx_fan0 |> &__${func}_or_ret:R, &__${func}_ctx_fan1 866 + &__${func}_ctx_fan1 |> &__${func}_or_arg0:R, &__${func}_or_arg1:R 867 + 868 + ; tag patching 869 + &__${func}_or_ret <| or 870 + &__${func}_or_arg0 <| or 871 + &__${func}_or_arg1 <| or 872 + 873 + ; dispatch 874 + &__${func}_ct_ret <| change_tag 875 + &__${func}_ct_arg0 <| change_tag 876 + &__${func}_ct_arg1 <| change_tag 877 + 878 + ; internal wiring 879 + &__${func}_or_ret |> &__${func}_ct_ret:L 880 + &__${func}_or_arg0 |> &__${func}_ct_arg0:L 881 + &__${func}_or_arg1 |> &__${func}_ct_arg1:L 882 + } 883 + 884 + ; ── per-call-site (works for any arity) ── 885 + .macro call_dyn $func, $alloc, $call_seq, $ret_offset, $($arg = $src),* { 886 + ; allocate + trigger + return continuation 887 + &__call_alloc_${func} <| rd_inc, $alloc 888 + &__call_exec_${func} <| exec, $call_seq 889 + &__call_extag_${func} <| extract_tag, $ret_offset 890 + 891 + ; wire into stub 892 + &__call_alloc_${func} |> &__${func}_ctx_fan 893 + &__call_extag_${func} |> &__${func}_ct_ret:R 894 + $( 895 + $src |> &__${func}_ct_${arg}:R 896 + ),* 897 + } 898 + ``` 899 + 900 + Usage: 901 + 902 + ```dfasm 903 + ; one-time setup 904 + #call_stub_1 fib, @fib_desc 905 + 906 + ; at each call site 907 + #call_dyn fib, @ctx_alloc, @fib_call_seq, 20, n = &my_arg 908 + ``` 909 + 910 + The `call_stub_N` per-arity approach is admittedly clunky. A future 911 + macro system with proper counted repetition could unify them. For 912 + now, N=1 through N=3 covers the vast majority of functions, and 913 + anything beyond N=3 can be hand-written — it's the same pattern, 914 + just more of it. 915 + 916 + ### Permit Injection — Two Macros 917 + 918 + For small K (roughly K <= 4), inline CONST injection: 919 + 920 + ```dfasm 921 + ; Macro definition: 922 + $permit_inject_inline K, &gate |> { 923 + ; expands to K const instructions, each targeting &gate:L 924 + ; each const needs its own trigger to fire 925 + } 926 + 927 + ; Usage: inject 3 permits into the gate 928 + #permit_inject_inline 3, &dispatch_gate 929 + ``` 930 + 931 + For large K, use SM EXEC to batch-emit permits: 932 + 933 + ```dfasm 934 + ; Macro definition: 935 + $permit_inject_exec K, &gate, @sm_base |> { 936 + ; expands to a single SM_EXEC reading K pre-formed permit 937 + ; tokens from SM starting at @sm_base, each addressed to &gate:L 938 + } 939 + 940 + ; Usage: inject 8 permits via EXEC 941 + #permit_inject_exec 8, &dispatch_gate, @permit_store 942 + ``` 943 + 944 + Programmer chooses based on K. No magic. 945 + 946 + ### Loop Control Macro 947 + 948 + ```dfasm 949 + $loop_counted &limit, &body, &exit |> { 950 + &counter <| const, 0 951 + &step <| inc 952 + &test <| lt 953 + &route <| sweq 954 + 955 + &counter |> &step 956 + &step |> &test:L, &route:L ; fan-out: counter to both LT and SWITCH 957 + &limit |> &test:R 958 + &test |> &route:R ; bool from comparison → SWITCH control 959 + &route:L |> &body ; taken → body dispatch 960 + &route:R |> &exit ; not-taken → done 961 + &route:L |> &step ; feedback arc: counter recirculates 962 + } 963 + 964 + ; Usage: 965 + #loop_counted 64, &body_entry, &done 966 + ``` 967 + 968 + ### Reduction Tree Macro 969 + 970 + ```dfasm 971 + $reduce_tree &op, &inputs[], &output |> { 972 + ; expands to ceil(log2(N)) levels of binary &op instructions 973 + ; N inferred from length of &inputs[] 974 + ; &output receives the final reduced value 975 + } 976 + 977 + ; Usage: 978 + #reduce_tree add, [&s0, &s1, &s2, &s3], @total 979 + ``` 980 + 981 + ### Parallel Loop (Composition) 982 + 983 + A parallel loop is manual composition of macros and function calls. 984 + No single macro tries to handle the full topology — each handles the 985 + repetitive part it's good at. 986 + 987 + ```dfasm 988 + @system pe=4, sm=1, ctx=8 989 + 990 + ; The body as a function — self-loop accumulator 991 + $body |> { 992 + &acc <| add 993 + &acc |> &acc:L ; feedback: acc recirculates 994 + ; &i arrives as input, feeds &acc:R 995 + ; &acc drains to #ret on completion 996 + } 997 + 998 + ; Loop control (macro expands to CONST, INC, LT, SWITCH + feedback) 999 + #loop_counted 64, &dispatch, &done 1000 + 1001 + ; Permit injection (pick one strategy) 1002 + #permit_inject_inline 4, &gate 1003 + 1004 + ; Gated dispatch — permits throttle body launches 1005 + &gate <| gate 1006 + &dispatch |> &gate:R ; loop data → gate right port 1007 + ; permits arrive at &gate:L from injection + body completion 1008 + 1009 + ; Body invocations via function call syntax 1010 + $body i=&gate |> &partial 1011 + 1012 + ; Reduction of partial results 1013 + #reduce_tree add, [&p0, &p1, &p2, &p3], @final_sum 1014 + ``` 1015 + 1016 + --- 1017 + 1018 + ## Pattern Cost Summary 1019 + 1020 + | Pattern | HW cost | IRAM slots | Iterations/cycle | Parallel? | 1021 + |---------|---------|------------|-------------------|-----------| 1022 + | Self-loop accumulator | 0 | 1 (the ADD) | ~1/8 (bus RT) | yes (per-ctx) | 1023 + | Permit-token throttle | 0 | K+2 (permits + GATE) | K in flight | yes | 1024 + | Counted loop control | 0 | 4 (CONST+INC+LT+SWITCH) | ~1/8 (bus RT) | no (sequential) | 1025 + | Binary reduction tree | 0 | K-1 (one per ADD) | log2(K) levels | yes | 1026 + | Predicate register | ~1 chip | +1 bit/instr | saves ~4 cycles/iter | no (shared) | 1027 + | Accumulator register | ~3 chips | 1 (ACC_ADD) | ~1/4 (no bus RT) | no (shared) | 1028 + 1029 + All zero-hardware patterns work with v0. Predicate and accumulator 1030 + registers are independent future additions that compose with the 1031 + existing patterns.
+12 -12
design-notes/sm-design.md
··· 324 324 op_base ext bus opcode internal op addr bits name 325 325 ───────────────────────────────────────────────────────────────── 326 326 000 aa 000 0000 10 (1024) READ 327 - 001 aa 001 0001 10 WRITE 328 - 010 aa 010 0010 10 ALLOC 329 - 011 aa 011 0011 10 FREE 330 - 100 aa 100 0100 10 CLEAR 331 - 101 aa 101 0101 10 EXT (3-flit mode) 327 + 001 aa 001 0001 10 WRITE 328 + 010 aa 010 0010 10 ALLOC 329 + 011 aa 011 0011 10 FREE 330 + 100 aa 100 0100 10 EXEC 331 + 101 aa 101 0101 10 EXT (3-flit mode) 332 332 110 00 11000 0110 8 (256) READ_INC 333 - 110 01 11001 0111 8 READ_DEC 334 - 110 10 11010 1000 8 CAS 335 - 110 11 11011 1001 8 RAW_READ 336 - 111 00 11100 1010 8 EXEC 337 - 111 01 11101 1011 8 SET_PAGE 338 - 111 10 11110 1100 8 WRITE_IMM 339 - 111 11 11111 1101 8 (spare) 333 + 110 01 11001 0111 8 READ_DEC 334 + 110 10 11010 1000 8 CAS 335 + 110 11 11011 1001 8 RAW_READ 336 + 111 00 11100 1010 8 CLEAR 337 + 111 01 11101 1011 8 SET_PAGE 338 + 111 10 11110 1100 8 WRITE_IMM 339 + 111 11 11111 1101 8 (spare) 340 340 ``` 341 341 342 342 'aa' = address bits (part of 10-bit address).
+410
docs/design-plans/2026-02-28-dfasm-macros.md
··· 1 + # dfasm Macros, Function Calls, and Syntax Refinements 2 + 3 + ## Summary 4 + 5 + dfasm is the assembly language for the OR1 dataflow CPU. Programs describe computation as a graph of nodes (instructions) connected by edges (token flows), where each node fires when its inputs arrive. Currently the assembler pipeline lowers source text directly to a flat intermediate representation without any abstraction mechanism — every node must be written out explicitly, and function-like reuse requires manual duplication and careful context-slot management. 6 + 7 + This design adds three capabilities on top of the existing pipeline. First, a macro system lets programmers define named graph templates with parameters and invoke them to expand boilerplate in place; a new `expand` pass inserted between the `lower` and `resolve` stages handles template cloning, parameter substitution, and scope qualification. Second, a function call syntax (`$func a=&x |> @output`) provides a structured way to invoke a named subgraph from one context slot into another, with the expander automatically inserting return trampolines and `free_ctx` nodes so context slots are released at runtime rather than held indefinitely. Third, a trailing-colon syntax change to location directives removes an existing grammar ambiguity, which may allow switching the parser from Earley to LALR. A built-in standard library of common graph patterns (loops, reductions, permit injection) ships as bundled dfasm source, loaded through the same pipeline as user code. 8 + 9 + ## Definition of Done 10 + 11 + 1. **Macro system** — a new IR-level expansion pass (between lower and resolve) that takes macro definitions parsed as dfasm into IR templates, expands macro invocations into fully-qualified IR nodes/edges within scoped namespaces, and supports parameter substitution with token pasting and (ideally) variadic repetition. 12 + 13 + 2. **Macro definition and invocation syntax** — grammar extensions for `#name |> { body }` definitions with parameters and `#name args...` invocations. `#` sigil owns the entire macro namespace. 14 + 15 + 3. **Function call syntax** — `$func a=&x, b=&y |> @output` generates cross-context edges with CTX_OVRD routing for static calls. `@ret` / `@ret_name` / `@ret:port` built-in nodes inside function bodies identify return points. The expand pass auto-inserts `free_ctx` on return paths. Cross-PE calls supported from the start. 16 + 17 + 4. **Dot-notation scope resolution** — `$func.&label` and `#macro.&label` as user-facing syntax for referencing names inside scoped regions. 18 + 19 + 5. **Location directive disambiguation** — trailing colon on region labels (`@region:`) to eliminate the ambiguity between location directives and node references that currently requires Earley parsing. 20 + 21 + 6. **Built-in macro library** — standard macros (loop control, permit injection, reduction trees, call stubs) shipped as bundled dfasm text, loaded through the same pipeline. 22 + 23 + 7. **Tests** — coverage for macro expansion, function call wiring, scope resolution, location directive syntax, and error cases (undefined macros, wrong arity, scope violations). 24 + 25 + ## Acceptance Criteria 26 + 27 + ### dfasm-macros.AC1: Macro definitions parse and lower to IR 28 + - **dfasm-macros.AC1.1 Success:** `#name params |> { body }` parses as macro_def and lowers to MacroDef region 29 + - **dfasm-macros.AC1.2 Success:** Macro body containing inst_def, plain_edge, strong_edge, weak_edge all lower into template IRGraph 30 + - **dfasm-macros.AC1.3 Success:** ParamRef placeholders appear in template const fields and edge endpoints 31 + - **dfasm-macros.AC1.4 Failure:** Macro definition with duplicate parameter names produces error 32 + - **dfasm-macros.AC1.5 Failure:** Macro definition with reserved name (@ret) produces error 33 + 34 + ### dfasm-macros.AC2: Macro invocations expand correctly 35 + - **dfasm-macros.AC2.1 Success:** `#name args` expands to scope-qualified nodes (#name_N.&label) 36 + - **dfasm-macros.AC2.2 Success:** Literal parameters substitute into const fields 37 + - **dfasm-macros.AC2.3 Success:** Ref parameters substitute into edge endpoints 38 + - **dfasm-macros.AC2.4 Success:** Nested macro calls expand recursively 39 + - **dfasm-macros.AC2.5 Success:** Macro inside function body gets double-scoped ($func.#macro_N.&label) 40 + - **dfasm-macros.AC2.6 Failure:** Undefined macro invocation produces NAME error with suggestions 41 + - **dfasm-macros.AC2.7 Failure:** Wrong arity produces ARITY error listing expected vs actual 42 + - **dfasm-macros.AC2.8 Failure:** Recursive expansion exceeding depth limit produces MACRO error 43 + 44 + ### dfasm-macros.AC3: Token pasting and constant expressions 45 + - **dfasm-macros.AC3.1 Success:** ParamRef with prefix/suffix concatenates into label names 46 + - **dfasm-macros.AC3.2 Success:** Constant arithmetic ($desc + $idx + 1) evaluates at expansion time 47 + - **dfasm-macros.AC3.3 Failure:** Non-numeric value in arithmetic context produces VALUE error 48 + 49 + ### dfasm-macros.AC4: Static function calls wire correctly 50 + - **dfasm-macros.AC4.1 Success:** `$func a=&x |> @out` generates cross-context input edges with ctx_override=True 51 + - **dfasm-macros.AC4.2 Success:** @ret inside function body resolves to return trampoline 52 + - **dfasm-macros.AC4.3 Success:** @ret:L and @ret:R handle dual-output return nodes 53 + - **dfasm-macros.AC4.4 Success:** @ret_name handles named returns, wired via name=@dest at call site 54 + - **dfasm-macros.AC4.5 Success:** free_ctx auto-inserted on every return path 55 + - **dfasm-macros.AC4.6 Success:** Multiple call sites get distinct ctx slots and separate trampolines 56 + - **dfasm-macros.AC4.7 Success:** Cross-PE function calls work (caller and callee on different PEs) 57 + - **dfasm-macros.AC4.8 Success:** Assembled program with function calls runs correctly in emulator 58 + - **dfasm-macros.AC4.9 Failure:** Named arg not matching any function body label produces NAME error 59 + - **dfasm-macros.AC4.10 Failure:** Call to undefined function produces NAME error 60 + 61 + ### dfasm-macros.AC5: Allocator handles new model 62 + - **dfasm-macros.AC5.1 Success:** Context slots assigned per call site, not per function 63 + - **dfasm-macros.AC5.2 Success:** CTX_OVRD (ctx_mode=01) emitted on cross-context edges 64 + - **dfasm-macros.AC5.3 Success:** Auto-trampoline inserted when node needs both const and CTX_OVRD 65 + - **dfasm-macros.AC5.4 Success:** Macro scopes don't consume context slots 66 + - **dfasm-macros.AC5.5 Failure:** Context slot overflow produces RESOURCE error with per-PE breakdown 67 + 68 + ### dfasm-macros.AC6: Location directive disambiguation 69 + - **dfasm-macros.AC6.1 Success:** `@region:` parses as location_dir 70 + - **dfasm-macros.AC6.2 Success:** `@node` without colon in edge context parses as node_ref 71 + - **dfasm-macros.AC6.3 Failure:** Location directive without trailing colon produces PARSE error 72 + 73 + ### dfasm-macros.AC7: Dot-notation scope resolution 74 + - **dfasm-macros.AC7.1 Success:** $func.&label resolves to the qualified name inside the function 75 + - **dfasm-macros.AC7.2 Success:** #macro.&label resolves into a macro expansion's scope 76 + - **dfasm-macros.AC7.3 Failure:** Dot-ref into non-existent scope produces SCOPE error 77 + 78 + ### dfasm-macros.AC8: Built-in macro library 79 + - **dfasm-macros.AC8.1 Success:** Built-in macros available without explicit import 80 + - **dfasm-macros.AC8.2 Success:** User macro with same name shadows built-in 81 + - **dfasm-macros.AC8.3 Success:** #loop_counted expands to correct counted loop topology 82 + - **dfasm-macros.AC8.4 Success:** Program using built-in macros assembles and runs in emulator 83 + 84 + ## Glossary 85 + 86 + - **dfasm**: The assembly language for the OR1 dataflow CPU. Programs describe computation as a directed graph of instruction nodes connected by token-carrying edges. 87 + - **Token**: A data packet that travels along edges between nodes. A node fires when all required input tokens arrive. Two families: `CMToken` (computation, targeting PEs) and `SMToken` (memory, targeting SMs). 88 + - **PE (Processing Element)**: Hardware unit that matches incoming token pairs, fetches instructions from IRAM, executes via ALU, and emits output tokens. 89 + - **SM (Structure Memory)**: Hardware unit with I-structure (single-assignment) cell semantics, deferred reads, and atomic operations. 90 + - **IRAM**: Per-PE instruction storage indexed by `(ctx, offset)`. The allocate pass assigns these indices. 91 + - **Context slot (ctx)**: 4-bit tag identifying which computation instance a token belongs to. Enables concurrent activations of the same instructions. 16 slots per PE. 92 + - **IRGraph**: The assembler's intermediate representation containing `IRNode`, `IREdge`, `IRRegion`, and related types. 93 + - **MacroDef**: New IR type representing a macro definition — name, parameters, and body `IRGraph` template with `ParamRef` placeholders. 94 + - **ParamRef**: Placeholder within a macro template for a formal parameter. Supports prefix/suffix concatenation for token pasting. 95 + - **expand pass**: New pipeline stage (`asm/expand.py`) between `lower` and `resolve`. Collects macro definitions, expands invocations, wires function calls, inserts return trampolines. 96 + - **Token pasting**: Assembling a new identifier by concatenating a parameter value with literal prefix/suffix (e.g., `&__${func}_ctx_fan` → `&__fib_ctx_fan`). 97 + - **CTX_OVRD**: Instruction encoding (`ctx_mode=01`) that substitutes the output token's context slot from the IRAM const field rather than inheriting from the executing token. Used for cross-context function call edges. 98 + - **Return trampoline**: Synthetic `pass` node auto-inserted on function return paths. Routes the return token to the caller's context via CTX_OVRD and triggers `free_ctx`. 99 + - **`free_ctx`**: Instruction that releases a context slot by rotating its generation counter, invalidating stale tokens. 100 + - **`@ret`**: Reserved built-in node name inside function bodies marking return points. The expand pass replaces it with a return trampoline. 101 + - **Dot-notation**: Syntax (`$func.&label`, `#macro.&label`) for referencing names inside scoped regions from outside. 102 + - **Earley / LALR**: Parsing strategies supported by Lark. Earley handles ambiguous grammars (slower); LALR requires unambiguous grammar (faster). 103 + - **Variadic repetition**: Stretch-goal macro feature where a template body repeats once per variadic argument with an implicit `${_idx}` index. Analogous to Rust `$(...)*`. 104 + 105 + ## Architecture 106 + 107 + Three layers compose this design: grammar changes to the dfasm language, a macro system with IR-level expansion, and function call wiring that builds on both. 108 + 109 + ### Pipeline Integration 110 + 111 + The assembler pipeline gains a new `expand` pass between `lower` and `resolve`: 112 + 113 + ``` 114 + parse → lower → expand → resolve → place → allocate → codegen 115 + ``` 116 + 117 + The expand pass handles two responsibilities: 118 + 1. **Macro expansion** — collect `MacroDef` regions, process `IRMacroCall` entries, clone IR templates with parameter substitution, splice expanded nodes/edges into the graph. 119 + 2. **Function call wiring** — process call-site syntax (`$func args |> outputs`), generate cross-context input edges, resolve `@ret` markers into return trampolines with auto-inserted `free_ctx`. 120 + 121 + After expand completes, the IR contains only concrete `IRNode`/`IREdge` entries. No `ParamRef` placeholders, no `MacroDef` regions, no `IRMacroCall` entries remain. Resolve sees a normal `IRGraph`. 122 + 123 + ### Grammar Changes 124 + 125 + Five modifications to `dfasm.lark`: 126 + 127 + **Location directive disambiguation.** Trailing colon on region labels eliminates the ambiguity that currently requires Earley parsing: 128 + 129 + ``` 130 + ; Before: 131 + location_dir: qualified_ref 132 + 133 + ; After: 134 + location_dir: qualified_ref ":" 135 + ``` 136 + 137 + This may enable switching from Earley to LALR for a parse speed improvement. 138 + 139 + **Macro definition.** New rule paralleling `func_def`: 140 + 141 + ``` 142 + macro_def: "#" IDENT macro_params? FLOW_OUT "{" (_NL* statement)* _NL* "}" 143 + macro_params: IDENT ("," IDENT)* 144 + ``` 145 + 146 + Example: `#loop_counted init, limit, body, exit |> { ... }` 147 + 148 + **Macro invocation as statement.** Extends `macro_call` beyond `data_def` context: 149 + 150 + ``` 151 + macro_call_stmt: "#" IDENT (argument)* 152 + ``` 153 + 154 + Uses `argument` (which includes `named_arg` and `positional_arg`) for both positional and named parameters. 155 + 156 + **Macro references in edges.** New ref type for `#name` in edge contexts: 157 + 158 + ``` 159 + qualified_ref: (node_ref | label_ref | func_ref | macro_ref | scoped_ref) 160 + placement? port? 161 + macro_ref: "#" IDENT 162 + ``` 163 + 164 + Enables `#macro.&label` as an edge endpoint (referencing into a macro expansion's scope). 165 + 166 + **Dot-notation scope resolution.** Extends `qualified_ref` for cross-scope references: 167 + 168 + ``` 169 + scoped_ref: (func_ref | macro_scope_ref) "." (label_ref | node_ref) 170 + macro_scope_ref: "#" IDENT 171 + ``` 172 + 173 + Supports `$func.&label`, `#macro.&label`, `#macro.@node`. 174 + 175 + ### Macro IR Representation 176 + 177 + New IR types in `asm/ir.py`: 178 + 179 + **`MacroParam`** — formal parameter in a macro definition. Has a `name` and optional `default` value. 180 + 181 + **`MacroDef`** — a macro definition consisting of a name, parameter list, and body `IRGraph` containing `ParamRef` placeholders. Stored as `IRRegion(kind=RegionKind.MACRO)`. 182 + 183 + **`ParamRef`** — placeholder for a macro parameter within the template IR. Carries the formal parameter name plus optional `prefix`/`suffix` strings for token pasting. Appears in: 184 + - `IRNode.const` (widens to `Optional[int | ParamRef]`) 185 + - Edge source/dest fields (widens to `str | ParamRef`) 186 + - Node name fragments (for token-pasted label synthesis like `&__${func}_ctx_fan`) 187 + 188 + **`IRMacroCall`** — a macro invocation in the IR. Carries the macro name, positional args, named args, and source location. Stored in a new `IRGraph.macro_calls` field. 189 + 190 + **`RegionKind.MACRO`** — new enum value. Macro definition regions are consumed by the expand pass and removed before resolve. 191 + 192 + ### Macro Expansion 193 + 194 + The expand pass (`asm/expand.py`) processes the IR in this order: 195 + 196 + 1. **Collect definitions.** Walk regions, extract `RegionKind.MACRO` entries into a `macro_table: dict[str, MacroDef]`. Remove them from the graph. 197 + 198 + 2. **Process invocations.** For each `IRMacroCall` (in root graph and recursively in function region bodies): 199 + - Look up macro in table. Error if not found. 200 + - Validate arity against `MacroDef.params`. 201 + - Build substitution map: `{formal_name: actual_value}`. 202 + - Deep-clone the template `IRGraph` body. 203 + - Walk the clone, resolving all `ParamRef` instances: literal substitution for const fields, ref substitution for names/edges, string concatenation for token-pasted names. 204 + - Qualify all `&label` names with expansion scope: `#macroname_N.&label` where N is a global expansion counter. 205 + - Splice expanded nodes and edges into the parent graph. 206 + 207 + 3. **Recursive expansion.** If an expansion contains further macro calls, expand those too. Depth limit of 32 prevents infinite recursion. 208 + 209 + Macros inside function bodies get double-scoped: `$fib.#loop_counted_3.&counter`. The macro scope is for name uniqueness only — it does not allocate a context slot. The enclosing function's context is inherited. 210 + 211 + ### Function Call Wiring 212 + 213 + The expand pass processes function call syntax (`$func a=&x, b=&y |> @output`): 214 + 215 + **Input wiring.** Named arguments are matched to labels inside the function body. `a=&x` generates `IREdge(source="&x", dest="$func.&a", port=L)` with `ctx_override=True`. 216 + 217 + **Return wiring via `@ret`.** `@ret` is a reserved built-in node name recognised inside function bodies. The expand pass: 218 + 1. Finds all edges targeting `@ret` (with optional port qualifiers `:L`/`:R`) or named variants (`@ret_name`). 219 + 2. Creates a return trampoline node — a `pass` instruction that routes the return value back to the caller with CTX_OVRD. 220 + 3. Appends a `free_ctx` node triggered off the same return path to release the context slot. 221 + 4. Replaces the `@ret` edge destination with the trampoline, and wires the trampoline's output to the call site's specified destination. 222 + 223 + **Port-qualified returns.** `@ret:L` and `@ret:R` handle dual-output return nodes (e.g., a switch at the function boundary). **Named returns.** `@ret_name` handles multiple independent return paths, wired at the call site via `$func args |> name=@dest1, name2=@dest2`. 224 + 225 + **Context slot allocation.** Each call site allocates a fresh context slot on the PE(s) where the function body lives. The function's IRAM instructions are shared across all call sites. The `#ret` trampoline is duplicated per call site (each at a unique monadic IRAM offset) with the caller-specific return destination. 226 + 227 + **`free_ctx` on return paths.** Auto-inserted by the expand pass. The return trampoline fans out to both the caller destination and a `free_ctx` node. This makes context slots a concurrency budget rather than a program-wide limit — slots are reused at runtime via generation counter rotation. 228 + 229 + **Return strategy extensibility.** The trampoline approach works on all hardware. Future CHANGE_TAG-based dynamic returns can be substituted as an alternative strategy without changing the call-site syntax. The expansion logic should be structured so the return wiring strategy is a pluggable decision point (e.g., a strategy parameter on the expander or a system-level config). 230 + 231 + **Multi-site restriction.** Multiple call sites to the same function are supported. Each gets its own ctx slot and return trampoline. The cost is 1 monadic IRAM slot per `@ret` per call site. The assembler warns when context utilisation is high and errors on overflow. 232 + 233 + ### Allocator Changes 234 + 235 + **Context slot assignment.** Rule changes from "one ctx per function scope per PE" to: 236 + - Root scope gets ctx=0. 237 + - Each *call site* to a function allocates a fresh ctx slot. 238 + - Functions with no call sites (only direct edge wiring) retain a ctx slot by the existing scope rule. 239 + 240 + **Trampoline IRAM allocation.** Duplicated `@ret` nodes are monadic pass-through instructions allocated in the monadic offset range (32+). First call site reuses the original offset; subsequent call sites get new slots. 241 + 242 + **CTX_OVRD emission.** Cross-context edges (marked `ctx_override=True`) cause the allocator to set `ctx_mode=01` on the source instruction, packing `[target_ctx:4][target_gen:2][spare:2]` into the const field. Conflict detection: if a node needs both an ALU const and CTX_OVRD, the assembler auto-inserts a pass-through trampoline. 243 + 244 + **Macro scope handling.** `_extract_function_scope()` updated to recognise `#macro_N` scope segments. Macro scopes do not allocate context slots — only `$func` scopes do. 245 + 246 + ### Built-in Macro Library 247 + 248 + Standard macros shipped as a dfasm string constant in `asm/builtins.py`, prepended to user source before parsing. Goes through the same parse → lower pipeline as user code. 249 + 250 + Initial library: 251 + 252 + | Macro | Purpose | Parameters | 253 + |-------|---------|------------| 254 + | `#loop_counted` | Counted loop with feedback arc | init, limit, body, exit | 255 + | `#loop_while` | Condition-tested loop | test_node, body, exit | 256 + | `#permit_inject_1` through `_4` | Inject K permit tokens | gate | 257 + | `#reduce_2` through `_4` | Binary reduction tree | op, inputs, output | 258 + | `#call_stub_1`, `_2` | Dynamic call stub (future) | func, desc | 259 + 260 + Per-arity variants (`_1`, `_2`, etc.) are used until variadic repetition is implemented, at which point they collapse into single generic macros. 261 + 262 + `@ret` is NOT a library macro — it is a built-in keyword recognised by the expand pass. 263 + 264 + If a user defines a macro with the same name as a built-in, the user's definition shadows the built-in (last definition wins in the macro table). 265 + 266 + ## Existing Patterns 267 + 268 + Investigation of the existing `asm/` pipeline revealed these patterns this design follows: 269 + 270 + **Anonymous node synthesis.** `lower.py` already generates synthetic `&__anon_N` nodes for strong/weak edges via `_wire_anonymous_node()`. Macro expansion follows the same pattern at larger scale — synthetic nodes with qualified names, bundled with their edges as a composite result. 271 + 272 + **Name qualification via `_process_statements`.** The lowering pass qualifies `&label` names with `$func.` prefixes by walking statement results and calling `_qualify_name()`. Macro expansion applies the same qualification with `#macroname_N.` prefixes. 273 + 274 + **`IRGraph.update_graph_nodes()` for recursive updates.** Existing utility for modifying nodes while preserving region structure. The expand pass uses this for in-place resolution of `ParamRef` values. 275 + 276 + **`collect_all_nodes()` flattening.** Resolve already flattens all nodes from nested regions into a single namespace. Expanded macro nodes integrate naturally — as long as names follow the `scope.&label` convention, resolve picks them up. 277 + 278 + **Frozen dataclasses with `replace()`.** All IR types are frozen. Passes produce new instances via `dataclasses.replace()`. The expand pass follows this convention. 279 + 280 + **Divergence: new IR types.** `ParamRef`, `MacroDef`, `IRMacroCall` are new concepts with no existing analogues. `IRNode.const` and edge source/dest field types widen to accommodate `ParamRef`. This is the main structural change to the IR. 281 + 282 + <!-- START_PHASE_1 --> 283 + ### Phase 1: Grammar and Location Directive 284 + 285 + **Goal:** Update the grammar for trailing-colon location directives, macro definition/invocation syntax, macro references, and dot-notation scope resolution. Update the lower pass to handle the new productions. 286 + 287 + **Components:** 288 + - `dfasm.lark` — add `macro_def`, `macro_call_stmt`, `macro_ref`, `scoped_ref` rules; modify `location_dir` to require trailing colon; extend `qualified_ref` 289 + - `asm/lower.py` — add transformer methods for `macro_def`, `macro_call_stmt`, `macro_ref`, `scoped_ref`; update `location_dir` handler for new syntax 290 + - `asm/ir.py` — add `RegionKind.MACRO`, `MacroParam`, `MacroDef`, `IRMacroCall`, `ParamRef` types; add `macro_calls` field to `IRGraph` 291 + - Existing dfasm test fixtures and example programs — update location directives to use trailing colon syntax 292 + 293 + **Dependencies:** None (first phase) 294 + 295 + **Done when:** Grammar parses all new syntax forms. Lower pass produces `MacroDef` regions and `IRMacroCall` entries in the IR. Location directives require trailing colon. Existing tests updated and passing. Earley-to-LALR switch evaluated (attempted if feasible). 296 + <!-- END_PHASE_1 --> 297 + 298 + <!-- START_PHASE_2 --> 299 + ### Phase 2: Macro Expansion Pass — Core 300 + 301 + **Goal:** Implement the expand pass with basic parameter substitution (no token pasting or repetition yet). 302 + 303 + **Components:** 304 + - `asm/expand.py` — new module: `MacroExpander` class with `expand(graph) -> IRGraph`, template cloning, parameter substitution, scope qualification with expansion counter 305 + - `asm/__init__.py` — integrate expand into the pipeline between lower and resolve 306 + - `asm/ir.py` — `IRGraph` utility methods for splicing expanded nodes/edges 307 + 308 + **Dependencies:** Phase 1 (grammar and IR types) 309 + 310 + **Done when:** Macros with literal and ref parameters expand correctly. Expanded nodes are scope-qualified (`#macro_N.&label`). Expanded IR passes through resolve, place, allocate, and codegen without errors. Recursive expansion works with depth limit. 311 + <!-- END_PHASE_2 --> 312 + 313 + <!-- START_PHASE_3 --> 314 + ### Phase 3: Token Pasting and Constant Expressions 315 + 316 + **Goal:** Extend macro expansion to support `ParamRef` with prefix/suffix (token pasting) and basic constant arithmetic in macro arguments. 317 + 318 + **Components:** 319 + - `asm/expand.py` — `ParamRef` resolution with prefix/suffix concatenation, constant expression evaluator for `$desc + $idx + 1` style expressions 320 + - `asm/ir.py` — ensure `ParamRef` with prefix/suffix is handled in all contexts (node names, edge endpoints, const fields) 321 + 322 + **Dependencies:** Phase 2 (core expansion) 323 + 324 + **Done when:** Token-pasted labels generate correctly (e.g., `&__${func}_ctx_fan` → `&__fib_ctx_fan`). Constant arithmetic in macro arguments evaluates at expansion time. Error messages trace back to macro definition source locations. 325 + <!-- END_PHASE_3 --> 326 + 327 + <!-- START_PHASE_4 --> 328 + ### Phase 4: Function Call Wiring — Static Calls 329 + 330 + **Goal:** Implement `$func a=&x, b=&y |> @output` syntax with `@ret` resolution, return trampolines, auto-inserted `free_ctx`, and CTX_OVRD edge marking. 331 + 332 + **Components:** 333 + - `asm/expand.py` — function call wiring logic: argument matching, `@ret` resolution, trampoline generation, `free_ctx` insertion, `CallSite` metadata production 334 + - `asm/ir.py` — `CallSite` dataclass, `ctx_override` field on `IREdge` 335 + - `asm/allocate.py` — per-call-site context slot assignment, trampoline IRAM allocation, CTX_OVRD emission (ctx_mode=01 with packed const), auto-trampoline insertion for const+CTX_OVRD conflicts 336 + 337 + **Dependencies:** Phase 2 (expansion pass exists to host the wiring logic) 338 + 339 + **Done when:** Static function calls generate correct cross-context edges. Return trampolines are allocated in monadic IRAM range. `free_ctx` is auto-inserted on return paths. Multiple call sites to the same function each get distinct ctx slots and trampolines. CTX_OVRD is correctly emitted in codegen. End-to-end test: a program with function calls assembles and runs correctly in the emulator. 340 + <!-- END_PHASE_4 --> 341 + 342 + <!-- START_PHASE_5 --> 343 + ### Phase 5: Allocator Updates 344 + 345 + **Goal:** Update the allocator for the new context slot model and macro scope handling. 346 + 347 + **Components:** 348 + - `asm/allocate.py` — rewrite `_assign_context_slots()` for call-site-driven allocation; update `_extract_function_scope()` to handle `#macro_N` scope segments; add ctx budget warnings and overflow errors with diagnostic messages suggesting inlining 349 + - `asm/codegen.py` — emit CTX_OVRD on cross-context edges, handle trampoline nodes in both direct and token stream modes 350 + 351 + **Dependencies:** Phase 4 (call site metadata and edge annotations exist) 352 + 353 + **Done when:** Context slots assigned per call site. Macro scopes don't consume ctx slots. Budget warnings emitted when utilisation exceeds 75%. Overflow errors include per-PE breakdown and actionable suggestions. Codegen produces correct IRAM words with ctx_mode=01. 354 + <!-- END_PHASE_5 --> 355 + 356 + <!-- START_PHASE_6 --> 357 + ### Phase 6: Variadic Repetition (Stretch) 358 + 359 + **Goal:** Support `$($arg),*` style variadic repetition in macro bodies, with implicit index `${_idx}`. 360 + 361 + **Components:** 362 + - `dfasm.lark` — repetition syntax within macro bodies (e.g., `$( ... ),*` delimiters) 363 + - `asm/lower.py` — parse repetition blocks into IR template representation 364 + - `asm/expand.py` — expand repetition blocks by iterating over variadic arguments, incrementing `${_idx}` per iteration 365 + - `asm/ir.py` — IR representation for repetition blocks within `MacroDef` body templates 366 + 367 + **Dependencies:** Phase 3 (token pasting, as repetition bodies commonly use it) 368 + 369 + **Done when:** Variadic macros expand correctly. Per-arity built-in macros (`#permit_inject_1` through `_4`) can be replaced with single generic versions. `${_idx}` produces correct indices. 370 + <!-- END_PHASE_6 --> 371 + 372 + <!-- START_PHASE_7 --> 373 + ### Phase 7: Built-in Macro Library 374 + 375 + **Goal:** Author and ship the standard macro library as bundled dfasm. 376 + 377 + **Components:** 378 + - `asm/builtins.py` — string constant containing built-in macro definitions, loaded and prepended to user source in the pipeline entry points 379 + - `asm/__init__.py` — prepend `BUILTIN_MACROS` in `assemble()`, `assemble_to_tokens()`, `run_pipeline()` 380 + 381 + **Dependencies:** Phase 2 (core expansion), Phase 3 (token pasting for call stubs), Phase 6 (variadic repetition, if available — otherwise ship per-arity variants) 382 + 383 + **Done when:** Built-in macros are available in all programs without explicit import. `#loop_counted`, `#loop_while`, `#permit_inject_N`, `#reduce_N` all expand correctly. User-defined macros shadow built-ins. Integration test: a program using built-in macros assembles and runs in the emulator. 384 + <!-- END_PHASE_7 --> 385 + 386 + <!-- START_PHASE_8 --> 387 + ### Phase 8: Error Quality and Documentation 388 + 389 + **Goal:** Polish error messages for macro-related failures and update documentation. 390 + 391 + **Components:** 392 + - `asm/errors.py` — new error categories: `MACRO` (undefined macro, arity mismatch, expansion depth exceeded, reserved name collision), `CALL` (undefined function, argument mismatch, ctx overflow) 393 + - `asm/expand.py` — source location threading: error messages reference both the macro call site and the relevant position within the macro definition 394 + - `design-notes/dfasm-primer.md` — update with macro definition/invocation syntax, function call syntax, `@ret`, trailing-colon location directives, dot-notation 395 + - `design-notes/assembler-architecture.md` — update pipeline description to include expand pass 396 + 397 + **Dependencies:** All previous phases 398 + 399 + **Done when:** All error paths produce Rust-style formatted messages with source context. Macro errors include "expanded from #macro at line N" annotations. Documentation reflects all new syntax and pipeline changes. 400 + <!-- END_PHASE_8 --> 401 + 402 + ## Additional Considerations 403 + 404 + **Earley to LALR.** The trailing-colon location directive change may eliminate the grammar ambiguity that requires Earley parsing. This should be evaluated in Phase 1 — if LALR works, switch to it for a meaningful parse speed improvement. If other ambiguities remain, stay on Earley. 405 + 406 + **Return strategy extensibility.** The trampoline return approach is the only strategy implemented in this design. Future CHANGE_TAG-based dynamic returns can be added as an alternative without changing call-site syntax. The expand pass should structure return wiring as a strategy dispatch point to make this straightforward. 407 + 408 + **Context slot pressure.** With 4-bit ctx (16 slots) and `free_ctx` enabling runtime reuse, the practical limit is concurrent activations, not total call sites. The assembler's static analysis cannot generally determine maximum concurrency in a dataflow program. The 75% warning threshold is a heuristic — programmers must reason about concurrency themselves. 409 + 410 + **`@ret` reservation.** The `@ret` prefix is reserved globally. Any user-defined node starting with `@ret` produces an error. This is a small namespace restriction in exchange for clean return syntax.