docs updates, pulled in datasheets · nonbinary.computer/or1-design@3416127

+2 -2

CLAUDE.md

··· 68 68 - `asm/errors.py` — Structured error types with source context (ErrorCategory includes MACRO and CALL) 69 69 - `asm/opcodes.py` — Opcode mnemonic mapping and arity classification 70 70 - `asm/lower.py` — CST to IRGraph lowering pass 71 - - `asm/expand.py` — Macro expansion and function call wiring pass 72 - - `asm/builtins.py` — Built-in macro library 71 + - `asm/expand.py` — Macro expansion (opcode params, parameterized qualifiers, `@ret` wiring, variadic repetition) and function call wiring pass 72 + - `asm/builtins.py` — Built-in macro library (`#loop_counted`, `#loop_while`, `#permit_inject`, `#reduce_2`/`_3`/`_4`) 73 73 - `asm/resolve.py` — Name resolution pass 74 74 - `asm/place.py` — Placement validation and auto-placement 75 75 - `asm/allocate.py` — IRAM offset and context slot allocation

+5 -3

asm/CLAUDE.md

··· 15 15 ## Pipeline Passes 16 16 17 17 1. **Lower** (`lower.py`): Lark CST -> IRGraph. Creates IRNodes, IREdges, IRRegions (function/location scopes), IRDataDefs, SystemConfig from @system pragma. Qualifies names with function scope (e.g., `$main.&add`). May contain MacroCall nodes and MacroDef regions. 18 - 2. **Expand** (`expand.py`): Macro expansion and function call wiring. Clones macro bodies, substitutes parameters, evaluates const expressions, qualifies expanded names with scope prefixes. Processes function call sites, allocates context slots per call. After expand, IR contains only concrete IRNode/IREdge entries. No ParamRef placeholders, no MacroDef regions, no IRMacroCall entries remain. 18 + 2. **Expand** (`expand.py`): Macro expansion and function call wiring. Clones macro bodies, substitutes parameters (including opcodes via `${op}`, placement via `|${pe}`, ports via `:${port}`, context slots via `[${ctx}]`), evaluates const expressions, expands variadic repetition blocks, rewrites `@ret`/`@ret_name` macro outputs, qualifies expanded names with scope prefixes. Processes function call sites with `@ret` trampolines, `free_ctx` insertion, and cross-context wiring. After expand, IR contains only concrete IRNode/IREdge entries. No ParamRef placeholders, no MacroDef regions, no IRMacroCall entries remain. 19 19 3. **Resolve** (`resolve.py`): Validates all edge endpoints exist. Detects scope violations (cross-function label refs). Generates Levenshtein "did you mean" suggestions. 20 20 4. **Place** (`place.py`): Validates explicit PE placements. Auto-places unplaced nodes via greedy bin-packing with locality heuristic (prefer PE with most connected neighbours). 21 21 5. **Allocate** (`allocate.py`): Assigns IRAM offsets (dyadic first, then monadic). Assigns context slots (one per function scope per PE). Resolves symbolic destinations to `Addr(a, port, pe)`. ··· 33 33 - `TypeAwareOpToMnemonicDict` and `TypeAwareMonadicOpsSet` in opcodes.py: required because IntEnum subclasses share numeric values across types (e.g., `ArithOp.ADD == 0 == MemOp.READ`), so plain dict/set lookups would collide 34 34 - Errors use `IRGraph.errors` accumulation: all issues are reported rather than stopping at the first error 35 35 - `#` sigil for macro namespace: avoids collision with other sigils ($, &, @) 36 - - `@ret` reserved prefix for return markers: qualifies return label references in function calls 36 + - `@ret` reserved prefix for return markers: in function bodies, creates trampolines with cross-context routing and `free_ctx`; in macro bodies, rewrites edges to call-site destinations (no context management) 37 37 - Per-call-site context slot allocation: each function call site gets its own context slot, managed by CallSite metadata 38 - - Built-in macros prepended to user source: system macro definitions are automatically available in every program 38 + - Opcode parameters (`${op}`) resolved via `MNEMONIC_TO_OP`: enables generic macros like `#reduce_2 add` 39 + - Parameterized qualifiers (`|${pe}`, `:${port}`, `[${ctx}]`) resolved during expansion via `PlacementRef`, `PortRef`, `CtxSlotRef` 40 + - Built-in macros prepended to user source: `#loop_counted`, `#loop_while`, `#permit_inject` (variadic), `#reduce_2`/`_3`/`_4` (parameterized opcode) 39 41 40 42 ## Invariants 41 43

+8 -36

asm/builtins.py

··· 45 45 &gate |> @ret_exit:R 46 46 } 47 47 48 - ; --- Permit injection (per-arity variants) --- 49 - ; Each injects const 1 tokens. Outputs wired via @ret. 50 - ; Call with: #permit_inject_1 |> out=&gate_node 51 - #permit_inject_1 |> { 52 - &p0 <| const, 1 53 - &p0 |> @ret_out 54 - } 55 - 56 - ; Call with: #permit_inject_2 |> out0=&gate_a, out1=&gate_b 57 - #permit_inject_2 |> { 58 - &p0 <| const, 1 59 - &p1 <| const, 1 60 - &p0 |> @ret_out0 61 - &p1 |> @ret_out1 62 - } 63 - 64 - ; Call with: #permit_inject_3 |> out0=&g0, out1=&g1, out2=&g2 65 - #permit_inject_3 |> { 66 - &p0 <| const, 1 67 - &p1 <| const, 1 68 - &p2 <| const, 1 69 - &p0 |> @ret_out0 70 - &p1 |> @ret_out1 71 - &p2 |> @ret_out2 72 - } 73 - 74 - ; Call with: #permit_inject_4 |> out0=&g0, out1=&g1, out2=&g2, out3=&g3 75 - #permit_inject_4 |> { 76 - &p0 <| const, 1 77 - &p1 <| const, 1 78 - &p2 <| const, 1 79 - &p3 <| const, 1 80 - &p0 |> @ret_out0 81 - &p1 |> @ret_out1 82 - &p2 |> @ret_out2 83 - &p3 |> @ret_out3 48 + ; --- Permit injection (variadic) --- 49 + ; Injects one const(1) seed token per target. 50 + ; Call with: #permit_inject &gate_a, &gate_b, &gate_c 51 + #permit_inject *targets |> { 52 + $( 53 + &p <| const, 1 54 + &p |> ${targets} 55 + ),* 84 56 } 85 57 86 58 ; --- Binary reduction trees (parameterized opcode) ---

+46 -32

design-notes/alu-and-output-design.md

··· 230 230 not-taken side instead of the data value. See the Output Formatter section 231 231 for the not-taken trigger semantics. 232 232 233 - **Reserved opcode space (2 slots):** future candidates include hardware 234 - multiply, predicate store read/write, and debug/trace instructions. 233 + **Reserved opcode space (2 slots):** candidates include `MAP_PAGE` 234 + (write '610 mapping register) and `SET_PAGE` (select active IRAM 235 + bank via page latch — see `pe-design.md` IRAM Bank Switching section), 236 + hardware multiply, predicate store read/write, and debug/trace 237 + instructions. `MAP_PAGE` + `SET_PAGE` together cost 2 opcode slots 238 + but enable banked IRAM with one '610 + one latch per PE. 235 239 236 240 ### SM Instruction Dispatch 237 241 ··· 417 421 result:8/16 computed or passthrough data value 418 422 bool_out:1 boolean signal for SWITCH/GATE 419 423 420 - From IRAM instruction word (latched since stage 3): 424 + From IRAM half 1 (latched during Stage 3, cycle 2): 425 + has_dest2:1 bit 15: single vs dual destination 421 426 dest1_PE:2 target PE 422 - dest1_offset:4-7 instruction address at target (width varies by format) 423 - dest1_ctx:4 context slot at target 427 + dest1_offset:5 instruction address at target 424 428 dest1_port:1 L/R operand 425 - dest1_gen:2 generation counter for target slot 426 - dest1_type:2 output token format selector 427 - dest1_not_taken_op:1 for SWITCH: 0=NOOP, 1=FREE on not-taken trigger 429 + const_ext:7 (single-dest only) extended constant for CONST16 430 + dest2_PE:2 (dual-dest only) second target PE 431 + dest2_offset:5 (dual-dest only) second target offset 428 432 429 - dest2_PE:2 (same fields for second destination) 430 - dest2_offset:4-7 431 - dest2_ctx:4 432 - dest2_port:1 433 - dest2_gen:2 434 - dest2_type:2 435 - dest2_not_taken_op:1 433 + From IRAM half 0 (latched during Stage 3, cycle 1): 434 + ctx_mode:2 00=INHERIT, 01=CTX_OVRD, 10=CHANGE_TAG 435 + 436 + From pipeline latches (ctx_mode 00, INHERIT): 437 + ctx:4 inherited from executing token 438 + gen:2 inherited from executing token 436 439 437 - has_dest2:1 whether dest2 is active (enables DUAL mode) 440 + From IRAM const field (ctx_mode 01, CTX_OVRD): 441 + ctx:4 const[7:4] 442 + gen:2 const[3:2] 443 + 444 + From left operand bypass latch (ctx_mode 10, CHANGE_TAG): 445 + flit_1:16 entire flit 1 comes from left operand data 438 446 439 447 From decoder EEPROM: 440 448 output_mode:2 DUAL / SINGLE / SUPPRESS / SWITCH 441 449 output_data_sel:1 result = ALU output vs passthrough input 442 450 ``` 451 + 452 + See `iram-and-function-calls.md` for the definitive IRAM half 0/half 1 453 + bit layouts and ctx_mode semantics. 443 454 444 455 ### Output Modes 445 456 ··· 581 592 582 593 ### Source of ctx and gen in Output Tokens 583 594 584 - All destination fields (PE, offset, ctx, port, gen) come from the IRAM 585 - instruction word. They are fully specified at compile/load time. 595 + The ctx and gen fields in output tokens are controlled by the `ctx_mode` 596 + field in IRAM half 0 (2 bits). See `iram-and-function-calls.md` for 597 + full details. Summary: 586 598 587 - - **Same-activation outputs** (result feeds next instruction in same 588 - function fragment, same PE): ctx in the IRAM dest field matches the 589 - input token's ctx. The compiler set this up when it assigned context 590 - slots. 591 - - **Cross-activation outputs** (function calls, returns, cross-PE): ctx 592 - in the IRAM dest field is the pre-reserved context slot for the target 593 - activation. The compiler reserved these slots at compile time. 594 - - **gen for the destination slot**: known at compile time for statically- 595 - scheduled code. For dynamic scenarios (future), gen would come from 596 - the target PE's slot allocator, communicated via a setup token during 597 - function linkage. 599 + - **INHERIT (ctx_mode 00):** ctx and gen are inherited from the executing 600 + token's pipeline latches. The output token arrives at the destination 601 + in the same context as the originating computation. This is the common 602 + case for intra-function edges. 603 + - **CTX_OVRD (ctx_mode 01):** ctx and gen are taken from the IRAM const 604 + field (bits [7:4] → ctx, bits [3:2] → gen). Used for function call 605 + wiring where the output token must arrive in a different context slot 606 + than the one it was computed in. The compiler bakes the target context 607 + into the const field at load time. 608 + - **CHANGE_TAG (ctx_mode 10):** the entire flit 1 of the output token 609 + is taken from the left operand data, bypassing normal token formation. 610 + Used for return routing where the return address was carried as data 611 + through the function body. 598 612 599 - v0 assumption: all ctx and gen values are compile-time constants baked 600 - into IRAM instruction words. Dynamic context allocation is a future 601 - extension. 613 + PE and offset always come from IRAM half 1 (except in CHANGE_TAG mode). 614 + Port comes from IRAM half 1 dest1_port (single-dest) or is implied by 615 + the token format. 602 616 603 617 --- 604 618

+11 -8

design-notes/architecture-overview.md

··· 51 51 ### Data Width (Tentative) 52 52 - **16-bit** data words within PEs and SM (see `bus-architecture-and-width-decoupling.md`) 53 53 - **16-bit external bus**, with multi-flit token encoding (2 flits standard) 54 - - Three independent width domains: external bus (16-bit), IRAM (32-48 bits, 55 - decoupled), PE pipeline registers (wider, decomposed into parallel data 56 - and control paths) 54 + - Three independent width domains: external bus (16-bit), IRAM (32-bit 55 + effective via two-half read, upgradeable to parallel 32-bit; see 56 + `iram-and-function-calls.md`), PE pipeline registers (wider, decomposed 57 + into parallel data and control paths) 57 58 - Width conversion at FIFO boundaries via serialisers/deserialisers 58 59 59 60 > **⚠ Tentative:** 16-bit is the working assumption for the emulator and ··· 192 193 - **Generation counter only on dyadic tokens**: prevents ABA problem when 193 194 context slots are reused. Monadic and SM tokens don't need it. 194 195 - **Width domains are independent**: bus width (16-bit), token format 195 - (variable flit count), IRAM width (32-48 bits), and PE pipeline width 196 - (wider, decomposed) are each sized for their own constraints. See 197 - `bus-architecture-and-width-decoupling.md` for the full analysis. 196 + (variable flit count), IRAM width (32-bit effective, two-half read), 197 + and PE pipeline width (wider, decomposed) are each sized for their 198 + own constraints. See `bus-architecture-and-width-decoupling.md` and 199 + `iram-and-function-calls.md` for the full analysis. 198 200 199 201 ## Module Taxonomy 200 202 201 203 ### CM (Control Module) — execution and matching 202 204 - Instruction memory (IM / IRAM): stores dataflow program (function bodies) 203 - - Width decoupled from bus: 32-48 bits, sized for opcode + destination 204 - encoding (see `bus-architecture-and-width-decoupling.md`) 205 + - Width decoupled from bus: 32-bit effective (two 8-bit halves), 206 + sized for opcode + destination encoding. See 207 + `iram-and-function-calls.md` for bit-level format. 205 208 - **Runtime-writable** via IRAM write tokens (prefix `011+01`) 206 209 - Write from network stalls the pipeline (acceptable for config operations) 207 210 - Enables runtime reprogramming and eliminates need for separate config bus

+16 -4

design-notes/assembler-architecture.md

··· 138 138 139 139 1. Collect all `MacroDef` entries from the graph (including built-in macros prepended to every program) 140 140 2. For each `IRMacroCall`, clone the macro's body template 141 - 3. Substitute `ParamRef` placeholders with actual argument values 142 - 4. Evaluate `ConstExpr` arithmetic expressions (supports `+`, `-`, `*` on integers and `_idx`) 141 + 3. Substitute `ParamRef` placeholders with actual argument values in all contexts: 142 + - **Const fields**: literal value substitution 143 + - **Edge endpoints**: node reference substitution 144 + - **Node names**: token pasting with prefix/suffix 145 + - **Opcode position**: resolve mnemonic string via `MNEMONIC_TO_OP` to concrete `ALUOp`/`MemOp` 146 + - **Placement qualifiers**: resolve `"pe0"` → `0` (via `PlacementRef`) 147 + - **Port qualifiers**: resolve `"L"` → `Port.L` (via `PortRef`) 148 + - **Context slot qualifiers**: resolve integer values (via `CtxSlotRef`) 149 + 4. Evaluate `ConstExpr` arithmetic expressions (supports `+`, `-`, `*`, `//` on integers and `_idx`) 143 150 5. Expand `IRRepetitionBlock` entries once per variadic argument, binding `_idx` to the iteration index 144 151 6. Qualify expanded names with scope prefixes: `#macroname_N.&label` for top-level, `$func.#macro_N.&label` inside functions 152 + 153 + **Macro `@ret` wiring:** 154 + 155 + After body expansion, `@ret` / `@ret_name` markers in macro edge destinations are replaced with concrete node references from the call site's output list (`IRMacroCall.output_dests`). This is pure edge rewriting — no trampolines, no cross-context routing, no `free_ctx`. Macros inline into the caller's context. 145 156 146 157 **Function call wiring:** 147 158 ··· 149 160 2. Generate trampoline `PASS` nodes for return routing 150 161 3. Create `IREdge` entries with `ctx_override=True` for cross-context argument passing (becomes `ctx_mode=01` in codegen) 151 162 4. Generate `FREE_CTX` nodes for context teardown on call completion 152 - 5. Wire `@ret` / `@ret_name` synthetic nodes for return paths 163 + 5. Wire `@ret` / `@ret_name` synthetic nodes for return paths (function `@ret` is distinct from macro `@ret` — functions create trampolines with context management) 153 164 154 165 **Post-conditions:** 155 166 ··· 289 300 - **Wider placement heuristics**: graph partitioning, min-cut algorithms, or profile-guided placement for larger programs 290 301 - **Incremental reassembly**: modify part of the graph and re-run only affected passes 291 302 - **Hardware encoding pass**: translate ALUInst/SMInst to bit-level instruction words for actual IRAM loading 292 - - **Conditional macro expansion**: the current macro system supports variadic repetition, constant arithmetic, and nested macro invocation (depth limit 32), but not conditionals within macros 303 + - **Conditional macro expansion**: the current macro system supports variadic repetition, constant arithmetic, opcode parameters, parameterized qualifiers, `@ret` output wiring (including positional `@ret` in variadic repetition blocks), and nested macro invocation (depth limit 32), but not conditionals within macros 304 + - **Generic variadic reduction**: a single `#reduce` macro that infers tree depth from variadic argument count (requires conditional expansion to handle non-power-of-2 inputs)

+20 -24

design-notes/bus-architecture-and-width-decoupling.md

··· 17 17 |--------|-------|-----------| 18 18 | External bus (inter-module) | 16-bit | routing trace count, physical buildability | 19 19 | Token format (logical) | variable-length flits | encoding needs per token type | 20 - | IRAM (instruction memory) | 32-48+ bits | opcode + destination encoding needs | 20 + | IRAM (instruction memory) | 32-bit effective (two 8-bit halves) | opcode + destination encoding; see `iram-and-function-calls.md` | 21 21 | Matching store entries | 8 or 16-bit data + 1 presence bit | data word size, token width mode | 22 22 | PE pipeline registers | wide, decomposed | parallel data path + control path | 23 23 | SM internal datapath | 16-bit | SRAM word size | ··· 351 351 | Dyadic wide | 00 | 2 | 32 | offset + ctx + port + gen + 16-bit data | 352 352 | Monadic normal | 010 | 2 | 32 | offset + ctx + 16-bit data | 353 353 | Dyadic narrow | 011+00 | 2 | 32 | offset + ctx + 8-bit data + port + gen | 354 - | IRAM write | 011+01 | 2-3 | 32-48 | iram_addr + instruction word | 354 + | IRAM write | 011+01 | 2-3 | 32-48 | iram_addr + instruction word (32-bit effective) | 355 355 | Monadic inline | 011+10 | 1 | 16 | offset + ctx only, no data | 356 356 | SM standard | 1 | 2 | 32 | SM_id + op + addr + 16-bit data or ret routing | 357 357 ··· 492 492 | Flags | immediate mode, inline output, etc. | 2-4 | 493 493 | Immediate | small constant for immediate-mode ops | 0-8 | 494 494 495 - This sums to roughly **32-48 bits** depending on address space size and 496 - how aggressively fields are packed. IRAM per PE is 128 entries (7-bit 497 - address), keeping destination address fields compact. 498 - 499 - Note: the IRAM instruction word includes a **width flag** for each output 500 - destination, determining whether the output token is formed as narrow 501 - (8-bit data) or wide (16-bit data). The source instruction controls the 502 - format of the token it produces. 495 + > **Update:** The field table above was a preliminary estimate. The 496 + > committed format is a **32-bit effective width** using a two-half read 497 + > (two 8-bit halves across two cycles). See `iram-and-function-calls.md` 498 + > for the definitive bit-level layout including CM compute, SM operations, 499 + > single vs dual destination, ctx_mode, and CONST16 wide-immediate. 503 500 504 501 ### IRAM Physical Organisation 505 502 506 - - 128 entries (7-bit address) per PE 507 - - Dyadic instructions at offsets 0..N-1 (where N = 16 or 32 depending on 508 - width mode of tokens targeting this PE) 509 - - Monadic instructions at any offset 0..127 510 - - Width: 2-3 parallel 8-bit SRAM chips for 16-48 bit instruction words, 511 - all addressed by the same address lines 512 - - Read in one pipeline stage (instruction fetch, after matching completes) 503 + - 128 entries (7-bit address) per PE, 256 bytes total (two 8-bit halves) 504 + - Address format: `[offset:7][half:1]` = 8 bits 505 + - Dyadic instructions at offsets 0-31 (5-bit offset in dyadic token format) 506 + - Monadic instructions at any offset 0-127 (7-bit offset range) 507 + - Two-half read: half 0 read in cycle N feeds decoder/ALU; half 1 read in 508 + cycle N+1 is latched for Stage 5. ALU executes during half 1 read — no 509 + pipeline bubble. 513 510 - Written only during program loading (IRAM write tokens) with valid-bit 514 511 protection 515 - 516 - Because IRAM read is a single pipeline stage using parallel SRAM chips, 517 - the width costs physical chips but does NOT add pipeline latency. A 48-bit 518 - read from three 8-bit SRAMs takes the same time as a 16-bit read from one. 512 + - **Upgrade path:** If two-half sequential read causes timing issues, widen 513 + to true 32-bit parallel read (4x 8-bit SRAMs). Same encoding, same 514 + 128 entries, just reads both halves in one cycle. Cost: 2 more SRAM 515 + chips per PE. 519 516 520 517 ### IRAM Sizing vs PE Count 521 518 ··· 688 685 689 686 ## Open Questions 690 687 691 - 1. **IRAM width** — 32 or 48 bits? Depends on destination address field 692 - sizes (now 7-10 bits given smaller IRAM), opcode count, immediate 693 - field. Next design task. 688 + 1. ~~**IRAM width**~~ — **Resolved.** 32-bit effective width via two-half 689 + read. See `iram-and-function-calls.md` for the definitive format. 694 690 2. **Mode B clock ratio** — exactly 2x, or design for arbitrary integer 695 691 ratios? 2x is simplest (toggle flip-flop). 696 692 3. **8-bit vs 16-bit PE configuration** — per-PE config register, or

design-notes/datasheets/74ls170.pdf