···50505151## Project Structure
52525353-- `cm_inst.py` — Instruction set definitions (ALUOp hierarchy, ALUInst, SMInst, Addr)
5454-- `tokens.py` — Token type hierarchy (Token -> CMToken -> DyadToken/MonadToken; SMToken, CfgToken -> LoadInstToken/RouteSetToken, IOToken)
5353+- `cm_inst.py` — Instruction set definitions (Port, MemOp, CfgOp, ALUOp hierarchy, ALUInst, SMInst, Addr)
5454+- `tokens.py` — Token type hierarchy (Token -> CMToken -> DyadToken/MonadToken; SMToken, CfgToken -> LoadInstToken/RouteSetToken, IOToken). Imports ISA enums from cm_inst.
5555- `sm_mod.py` — Structure Memory cell model (Presence enum, SMCell dataclass, StructureMem resource)
5656- `dfasm.lark` — Lark grammar for dfasm graph assembly language
5757- `emu/` — Behavioural emulator package (SimPy-based discrete event simulation)
···102102- `SMToken(Token)` -- `addr: int`, `op: MemOp`, `flags`, `data`, `ret: Optional[CMToken]`
103103- `SysToken(Token)` -- `addr: Optional[int]`
104104 - `CfgToken(SysToken)` -- `op: CfgOp` (base class, no payload)
105105- - `LoadInstToken(CfgToken)` -- `instructions: tuple[ALUInst | SMInst, ...]`
106106- - `RouteSetToken(CfgToken)` -- `pe_routes: tuple[int, ...]`, `sm_routes: tuple[int, ...]`
105105+ - `LoadInstToken(CfgToken)` -- `instructions: tuple[ALUInst | SMInst, ...]` (contiguous block from `addr`)
106106+ - `RouteSetToken(CfgToken)` -- `pe_routes: frozenset[int]`, `sm_routes: frozenset[int]`
107107 - `IOToken(SysToken)` -- `data: Optional[List[int]]`
108108109109### Instruction Set (cm_inst.py)
···134134- Generation counter (`gen_counters[ctx]`): stale tokens (gen mismatch) are discarded
135135136136**Output routing modes** (determined by `_output_mode()`):
137137-- `SUPPRESS` -- FREE op, or GATE with `bool_out=False`, or no dest_l
137137+- `SUPPRESS` -- FREE_CTX op, or GATE with `bool_out=False`, or no dest_l
138138- `SINGLE` -- dest_l only (no dest_r)
139139- `DUAL` -- both dest_l and dest_r (non-switch)
140140- `SWITCH` -- SW* routing ops: `bool_out=True` sends data to dest_l, trigger to dest_r; vice versa
···185185186186### Module Dependency Graph
187187188188-Root-level modules (`cm_inst.py`, `tokens.py`, `sm_mod.py`) define the ISA and token types. The `emu/` package imports from root-level modules but root-level modules never import from `emu/`. The `asm/` package imports from both root-level modules and `emu/types.py` (for PEConfig/SMConfig), but neither root-level modules nor `emu/` import from `asm/`.
188188+`cm_inst.py` defines ISA enums and instruction types (no dependencies). `tokens.py` imports from `cm_inst.py` and defines the token hierarchy. `sm_mod.py` is independent. The `emu/` package imports from root-level modules but root-level modules never import from `emu/`. The `asm/` package imports from both root-level modules and `emu/types.py` (for PEConfig/SMConfig), but neither root-level modules nor `emu/` import from `asm/`.
189189190190```
191191-tokens.py <-- cm_inst.py <-- emu/types.py
192192- ^ | |
193193- | v v
194194-sm_mod.py emu/alu.py emu/pe.py <--> emu/sm.py
195195- | \ /
196196- | emu/network.py
197197- | ^
198198- v |
191191+cm_inst.py <-- tokens.py <-- emu/types.py
192192+ | | |
193193+ v v v
194194+ emu/alu.py sm_mod.py emu/pe.py <--> emu/sm.py
195195+ \ /
196196+ emu/network.py
197197+ ^
198198+ |
199199asm/ir.py <-- asm/opcodes.py asm/codegen.py
200200 | | |
201201 v v v
+2-2
asm/CLAUDE.md
···22222323## Dependencies
24242525-- **Uses**: `cm_inst` (ALUOp, ALUInst, SMInst, Addr, MemOp), `tokens` (Port, MonadToken, SMToken, CfgToken, CfgOp, MemOp), `sm_mod` (Presence), `emu/types` (PEConfig, SMConfig), `lark` (parser)
2525+- **Uses**: `cm_inst` (Port, MemOp, CfgOp, ALUOp, ALUInst, SMInst, Addr), `tokens` (MonadToken, SMToken, CfgToken, LoadInstToken, RouteSetToken), `sm_mod` (Presence), `emu/types` (PEConfig, SMConfig), `lark` (parser)
2626- **Used by**: Test suite, user programs
2727- **Boundary**: `emu/` and root-level modules must NEVER import from `asm/`
2828···5050## Gotchas
51515252- `MemOp.WRITE` arity depends on const: monadic when const is set (cell_addr from const), dyadic when const is None (cell_addr from left operand)
5353-- `RoutingOp.FREE` (ALU free) and `MemOp.FREE` (SM free) share the name "free" -- assembler uses `free` for ALU and `free_sm` for SM to disambiguate
5353+- `RoutingOp.FREE_CTX` (ALU context deallocation) and `MemOp.FREE` (SM free) are disambiguated by mnemonic: assembler uses `free_ctx` for ALU and `free` for SM
54545555<!-- freshness: 2026-02-23 -->
···11-# Dynamic Dataflow CPU — Architectural Positioning & Research Notes
22-33-Working notes capturing design philosophy decisions, research insights,
44-and prior art observations. Not a design spec — a reference for "why we
55-chose this direction" and "what we learned from the literature."
66-77-## Companion Documents
88-99-- `architecture-overview.md` — master architecture reference
1010-- `pe-design.md` — PE pipeline, matching store, context slots
1111-- `design-alternatives.md` — rejected/deferred approaches
1212-- `Prior_Art_Reference_Guide_for_a_Discrete-Logic_Dynamic_Dataflow_CPU.md`
1313- — comprehensive bibliography
1414-1515----
1616-1717-## 1. Core Architectural Commitment: Pure Dataflow, Not Hybrid
1818-1919-This project is a **dynamic dataflow machine**, not a multithreaded RISC
2020-core with dataflow-style synchronisation primitives.
2121-2222-The MIT lineage went: Manchester → TTDA → Monsoon → *T → Sparcle, with
2323-each step making the PE more like a conventional CPU and reducing the
2424-dataflow aspects to synchronisation mechanisms (presence bits on memory,
2525-fast context switching, message passing). By *T (1992), the "dataflow"
2626-part is essentially hardware semaphores bolted onto a modified SPARC.
2727-2828-**We are not going down that road.** The point of this project is the
2929-different execution model — where synchronisation is implicit in the
3030-data flow, not explicit in the program. A PE that needs a program counter,
3131-register file, bypass network, and branch prediction is solving a
3232-different problem than we're solving.
3333-3434-Specific non-goals:
3535-- Sequential instruction streams within a PE (*T, Sparcle, EARTH)
3636-- Register files as primary operand storage
3737-- Program counter / sequential fetch logic
3838-- Branch prediction hardware
3939-- Hardware semaphores / presence-bit memory traps (*T model)
4040-- "Make each PE a full CPU" — this blows the transistor budget from
4141- 4 PEs down to 1-2, losing the parallelism that's the whole point
4242-4343-### Where We Sit on the Spectrum
4444-4545-```
4646-Pure dataflow ←————————————————————————→ Pure von Neumann
4747-Manchester Monsoon *T/Sparcle OoO superscalar
4848- | | | |
4949- | [this project] | |
5050- | | | |
5151-hash matching ETS/direct RISC+sync register renaming
5252-no PC no PC has PC has PC
5353-no registers no regs register file register file
5454-```
5555-5656-We're roughly at the Monsoon point on this spectrum: direct-indexed
5757-matching with presence bits (independently derived, see §2), token-driven
5858-execution, no program counter. But with a smaller/simpler PE than Monsoon
5959-(fewer pipeline stages, smaller frames, generation counters for ABA
6060-protection instead of Monsoon's tighter deallocation control).
6161-6262-### What IS Worth Mining from the Hybrid Work
6363-6464-The Papadopoulos & Traub 1991 paper ("Multithreading: A Revisionist View
6565-of Dataflow Architectures") contains one microarchitectural optimisation
6666-that's relevant without changing the architecture:
6767-6868-**Sequential scheduling of monadic chains.** If A feeds B feeds C and all
6969-are monadic (single-input), the tokens cycle through the full pipeline
7070-for each hop: token-in → match-bypass → fetch → execute → token-out →
7171-back to token-in. If the PE could recognise this pattern and keep the
7272-result in-pipeline for the next instruction (skipping token formatting
7373-and input FIFO), that's a significant latency win on sequential chains.
7474-7575-This is a **microarchitectural shortcut**, not an architectural change.
7676-The token semantics don't change. The compiler doesn't need to know. It's
7777-just an optimisation where the PE notices "output goes to me, monadic,
7878-next instruction" and short-circuits the pipeline. Worth considering
7979-post-v0 if sequential throughput is a problem.
8080-8181----
8282-8383-## 2. Convergence with Monsoon's Explicit Token Store
8484-8585-The matching store design in `pe-design.md` — direct-indexed context slots
8686-with occupied bits, compiler-assigned slot IDs, bump allocator — was
8787-derived independently from first principles:
8888-8989-1. Manchester's hash table has terrible utilisation (<20%) and enormous
9090- hardware cost (16 SRAM banks + comparators per PE)
9191-2. Amamiya's semi-CAM is better but CAM chips are tiny at discrete scale
9292-3. If the compiler assigns context IDs statically, you can use them as
9393- direct SRAM addresses → no associative lookup at all
9494-4. Occupied bit = 1-bit presence flag per matching entry
9595-5. Generation counter handles ABA on slot reuse
9696-9797-This turns out to be essentially the same thing Papadopoulos and Culler
9898-arrived at with the Explicit Token Store (ETS) for Monsoon (1990), via
9999-a similar line of reasoning from the TTDA's frame-based matching (Arvind
100100-& Nikhil 1987/1990).
101101-102102-**Key differences from Monsoon ETS:**
103103-104104-| Aspect | Monsoon ETS | This design |
105105-|--------|-------------|-------------|
106106-| Frame size | 128 words (fixed) | 16-32 entries (configurable) |
107107-| Allocation | Shared free-list of frame pointers | Bump allocator + bitmap/FIFO |
108108-| ABA protection | Tight dealloc control, no reuse until drained | 2-bit generation counter per slot |
109109-| Pipeline depth | 8 stages | 5 stages (target) |
110110-| Matching entry ID | Compiler-assigned slot offset in frame | Compiler-assigned match_entry in context slot |
111111-| Overflow | Not handled (compiler must fit) | Stall + optional future CAM buffer |
112112-113113-The generation counter is more defensive than Monsoon's approach, which
114114-is appropriate for a first build where catching bugs matters more than
115115-saving 2 bits per slot. Monsoon's free-list is cleaner in theory but the
116116-bump allocator is simpler hardware (counter vs. FIFO management).
117117-118118-**Actionable insight from this convergence:** the ETS papers (especially
119119-Papadopoulos's 1988 PhD thesis) contain detailed pipeline timing, hazard
120120-analysis, and state-bit logic that's directly applicable to our matching
121121-store, even though the designs were arrived at independently. Worth
122122-reading for implementation details, not just architecture.
123123-124124----
125125-126126-## 3. Clock Efficiency as Primary PE Constraint
127127-128128-In discrete logic, clock speed is the hard constraint. Individual gate
129129-delays are ~10-30ns per stage depending on technology, and pipeline depth
130130-directly multiplies total latency. At realistic clock speeds (5-20 MHz
131131-for well-designed discrete logic), every wasted cycle is expensive.
132132-133133-This means:
134134-- **Single-cycle matching is non-negotiable** (achieved via direct indexing)
135135-- **Pipeline depth should be minimised** — each stage adds a clock period
136136- of latency. Monsoon's 8 stages would give 400-1600ns per token at
137137- discrete-logic speeds. Our target of 5 stages is aggressive but
138138- achievable.
139139-- **Monadic bypass matters** — monadic tokens skipping the matching stage
140140- saves a full cycle per monadic operation. At 50% monadic ops (typical
141141- for many programs), this is significant.
142142-- **Network latency is the enemy** — every hop between PEs adds pipeline
143143- latency. Compiler-assigned locality (keeping communicating nodes on
144144- the same PE) is critical. This is why we care about static PE
145145- assignment even though matching is dynamic.
146146-- **The sequential scheduling shortcut (§1) becomes more valuable** the
147147- slower the clock is — if a monadic chain of 5 ops takes 25 cycles
148148- through the full pipeline but could take 5 cycles with short-circuit
149149- execution, that's 100-500ns saved at discrete speeds.
150150-151151-### Implications for PE Design
152152-153153-Every pipeline stage must justify its existence in terms of critical path.
154154-If a stage can be merged with an adjacent stage without extending the
155155-critical path beyond the target clock period, merge it.
156156-157157-Stages that are "free" (can overlap with SRAM access time): address
158158-generation, mux selection, comparator setup.
159159-160160-Stages that set the clock period: SRAM read (15-25ns for fast async SRAM),
161161-ALU operation (depends on width — 8-bit add ~15ns in 74HC, 16-bit ~25ns
162162-with carry lookahead).
163163-164164----
165165-166166-## 4. Technology Notes
167167-168168-### Not Strictly TTL
169169-170170-The project is described as "74-series TTL + SRAM" but the actual target
171171-technology is more nuanced:
172172-173173-- **74HC / 74HCT CMOS** is the likely primary logic family, not original
174174- 74-series TTL. HC/HCT gives lower power, better noise margins, and
175175- similar or better speed at the gate level. HCT is input-compatible
176176- with TTL levels.
177177-- **74AC / 74ACT** (Advanced CMOS) for critical-path stages where the
178178- extra speed matters. ~5ns propagation vs ~10ns for HC.
179179-- **74F** (FAST TTL) is an option for specific high-speed paths but draws
180180- more power and is less available.
181181-- **Async SRAM** (IS61C256, AS6C4008, etc.) for all bulk storage:
182182- instruction memory, matching store, token FIFOs, structure memory.
183183- 15-25ns access times are the pipeline clock floor.
184184-- **EEPROMs** (AT28C256 or similar) for instruction memory where runtime
185185- reprogramming via type-11 is acceptable with higher write latency.
186186- Or SRAM with battery backup / external loading.
187187-188188-The key constraint is **no large-scale integration beyond commodity SRAM
189189-and EEPROM**. No FPGAs in the final build (though FPGA prototyping is
190190-encouraged). No custom ASICs. No microcontrollers in the datapath (though
191191-a microcontroller as external test fixture / bootstrap host is fine for
192192-development).
193193-194194-The "period-plausible" framing refers to the transistor budget being
195195-comparable to processors from the late 1970s / early 1980s (68000-class),
196196-not to the specific technology used. Modern CMOS 74-series parts are
197197-faster and lower power than original TTL but the logic complexity and
198198-design methodology are the same.
199199-200200----
201201-202202-## 5. Priority Reading List (from Prior Art Survey)
203203-204204-Based on the reference guide and current design state, prioritised for
205205-maximum impact on near-term design decisions:
206206-207207-### Must-Read (directly affects current design choices)
208208-209209-1. **Papadopoulos & Culler, "Monsoon: An Explicit Token-Store
210210- Architecture" (ISCA 1990)** — ETS mechanics, 8-stage pipeline,
211211- frame memory organisation. Closest prior art to our matching store.
212212-213213-2. **Culler & Papadopoulos, "The Explicit Token Store" (JPDC 1990)** —
214214- Extended journal version with more detail on state-bit mechanism
215215- and pipeline stages.
216216-217217-3. **Papadopoulos PhD thesis (MIT, 1988)** — THE most detailed Monsoon
218218- hardware source. Board-level design, chip selection, pipeline timing.
219219- Hard to find but worth the effort.
220220-221221-4. **Sakai et al., "An Architecture of a Dataflow Single Chip Processor"
222222- (ISCA 1989)** — EM-4 core paper. 50K-gate PE, circular pipeline,
223223- direct matching, strongly connected arc model. Most sophisticated
224224- pipeline design in the literature.
225225-226226-5. **da Silva & Watson, "A Pseudo-Associative Matching Store with Hardware
227227- Hashing" (IEE 1983)** — Even though we're not using hash matching,
228228- understanding WHY Manchester went this way and what the tradeoffs
229229- were informs our design.
230230-231231-6. **Culler, "Resource Management for the Tagged Token Dataflow
232232- Architecture" (MIT TR-332, 1985)** — Token store overflow, deadlock,
233233- frame-space management. Essential for understanding throttling and
234234- resource constraints.
235235-236236-### Should-Read (informs broader design context)
237237-238238-7. **Dennis, "Building Blocks for Data Flow Prototypes" (ISCA 1980)** —
239239- Modular hardware building blocks for discrete-logic dataflow. May
240240- directly influence our board-level module decomposition.
241241-242242-8. **Sakai et al., EM-4 network paper (Parallel Computing 1993)** —
243243- Circular omega network design and deadlock prevention. Relevant
244244- when we scale past shared bus.
245245-246246-9. **Arvind & Nikhil, "Executing a Program on the MIT TTDA" (IEEE TC
247247- 1990)** — TTDA PE organisation, tag format, I-structure memory.
248248- Foundational context for understanding Monsoon.
249249-250250-10. **Papadopoulos & Traub, "Multithreading: A Revisionist View" (ISCA
251251- 1991)** — Sequential scheduling optimisation. Not for the
252252- architecture, but for the microarchitectural shortcut idea.
253253-254254-### Background (fills in the picture)
255255-256256-11. **Lee & Hurson, "Dataflow Architectures and Multithreading" (IEEE
257257- Computer 1994)** — Survey bridging pure dataflow to multithreading era.
258258-259259-12. **Arvind & Culler, "Dataflow Architectures" (Annual Review 1986)** —
260260- MIT perspective, good overview of the design space.
261261-262262-13. **Grafe et al., "The Epsilon Dataflow Processor" (ISCA 1989)** —
263263- Hybrid approach (Sandia), interesting as a contrast to show where
264264- we DON'T want to go.
265265-266266-14. **Watson & Gurd, "A Practical Data Flow Computer" (IEEE Computer
267267- 1982)** — Board-level Manchester hardware aimed at hardware engineers.
268268-269269----
270270-271271-## 6. Key Open Questions Informed by Research
272272-273273-Things the prior art survey surfaced that we should think about:
274274-275275-1. **Monsoon's 128-word frames vs our 16-32 entry slots**: are we too
276276- small? Monsoon's larger frames reduce allocation frequency but waste
277277- space on small activations. Our smaller slots are more efficient but
278278- may cause more allocation churn. Need to compile some test programs
279279- and measure.
280280-281281-2. **EM-4's circular pipeline**: their PE reuses pipeline stages for
282282- different phases of token processing, reducing total hardware per PE.
283283- Worth investigating whether our 5-stage pipeline could benefit from
284284- a similar trick.
285285-286286-3. **EM-4's strongly connected arc model**: a different take on monadic
287287- chains where consecutive operations within a thread execute without
288288- re-entering the network. Related to the sequential scheduling idea
289289- but architecturally distinct. Need to read the papers to understand
290290- the hardware implications.
291291-292292-4. **I-structure memory (Arvind)**: presence bits on structure memory
293293- words for synchronisation. Our SM doesn't currently have this — SM
294294- operations are simple read/write/RMW. I-structures enable deferred
295295- reads (read of empty word blocks until write arrives). This is a
296296- significant capability for certain parallel patterns. Worth
297297- evaluating whether SM should support it.
298298-299299-5. **Dennis's "Building Blocks" approach**: modular, composable hardware
300300- units for dataflow. May suggest a different physical decomposition
301301- than our current CM/SM/IO split. Need to read the paper.
-290
design-notes/versions/architecture-overview.md
···11-# Dynamic Dataflow CPU — Architecture Overview
22-33-Master reference document. For detailed design of individual subsystems, see
44-companion documents. For rejected/deferred approaches and decision rationale,
55-see `design-alternatives.md`.
66-77-## Companion Documents
88-99-- `pe-design.md` — PE pipeline, matching store, instruction memory, context slots
1010-- `sm-design.md` — structure memory interface, operations, banking, address space
1111-- `network-and-communication.md` — interconnect, routing, clocking, handshaking
1212-- `io-and-bootstrap.md` — I/O subsystem, bootstrap sequence, type-11 protocol
1313-- `design-alternatives.md` — rejected/deferred approaches with rationale
1414-1515-## Project Goals
1616-1717-- Dynamic dataflow CPU achievable with discrete logic (74-series TTL + SRAM)
1818-- Multi-PE design targeting superscalar-equivalent IPC
1919-- "Period-plausible" transistor budget: ~25-35K logic transistors + SRAM chips
2020- - Comparable to a 68000 or a couple of Z80s in logic complexity
2121- - Reference builds for physical scale: Fabian Schuiki's superscalar CPU,
2222- James Sharman's pipelined CPU
2323-- Must be able to load and execute a binary over serial without a substantial
2424- conventional control core
2525-- Incremental build plan: single PE first, expand to multi-PE
2626-- Architecture must not rule out future evolution: specifically, must preserve
2727- design space for asynchronous operation, network topology changes, and
2828- runtime reprogramming
2929-3030-## Key Architectural Decisions
3131-3232-### Execution Model
3333-- **Dynamic dataflow** (tagged-token), not static like the Electron E1
3434-- Compiler performs static PE assignment and routing configuration (E1-like)
3535-- Matching store operates dynamically within each PE for concurrent activations
3636-- This is a hybrid: static routing topology, dynamic operand matching
3737-3838-### Influences / Reference Architectures
3939-- **Manchester Dataflow Machine** (Gurd 1985): pipeline structure, matching
4040- unit design, overflow handling
4141-- **DFM / Amamiya 1982**: semi-CAM concept, computational locality,
4242- function-instance-based addressing, CM/SM split, TTL prototype
4343-- **Pao et al. (IP lookup)**: subtree bit-vector parallel search via bitwise
4444- AND — useful for collision resolution or routing
4545-- **Electron E1**: compile-time spatial mapping, tile-based PEs, control core
4646- for bootstrap
4747-- **Yang et al. (DDR SDRAM IP lookup)**: hash + small CAM for collision overflow
4848-4949-### Data Width
5050-- 8 or 16-bit data words within PEs (TBD, likely 16-bit for practicality)
5151-- Internal token packets are wider (~24-32 bits for local, multi-flit for remote)
5252-- Instruction words will be "chunkier" due to tags/destinations
5353-5454-## Token Packet Format (type-tagged, 32-bit)
5555-5656-The 2-bit type field is the primary routing discriminator. It determines both
5757-the physical destination (which class of module receives the packet) and the
5858-interpretation of the remaining 30 bits.
5959-6060-### Type Field Semantics
6161-6262-```
6363-Type 00 — DYADIC: destination is a CM. token carries operand for a dyadic
6464- (two-input) instruction. requires matching store lookup.
6565-Type 01 — MONADIC: destination is a CM. token carries operand for a monadic
6666- (single-input) instruction. bypasses matching store.
6767-Type 10 — STRUCTURE: destination is an SM bank. carries a memory operation
6868- request (read, write, atomic RMW, etc.).
6969-Type 11 — SYSTEM: destination is the I/O subsystem, OR carries an extended/
7070- config operation. subtype field discriminates:
7171- 11 + 00: I/O operation (routed to I/O controller)
7272- 11 + 01: extended address / config write (e.g., remote instruction
7373- memory write, routing table config)
7474- 11 + 10: reserved (future: debug/trace, DMA)
7575- 11 + 11: reserved
7676-```
7777-7878-Types 00/01 hit CMs only. Type 10 hits SM banks only. Type 11 can hit the
7979-I/O controller, target PEs (for config writes), or future system infrastructure
8080-depending on subtype.
8181-8282-### Packet Formats
8383-8484-```
8585-Type 00 — DYADIC (needs matching, carries generation counter):
8686-[type:2][PE_id:2][ctx_slot:4][gen:2][offset:7][port:1][data:14]
8787-8888-Type 01 — MONADIC (bypass matching, no gen needed):
8989-[type:2][PE_id:2][offset:8][data:20]
9090-9191-Type 10 — STRUCTURE (memory access to SM):
9292-[type:2][SM_id:2][operation:3][address:9][data:16]
9393-9494-Type 11 — SYSTEM (I/O, extended addressing, config):
9595-[type:2][subtype:2][...28 bits interpreted per subtype...]
9696- Subtype 00 (I/O): [device:N][register:N][R/W:1][data:...]
9797- Subtype 01 (config): [target_PE:2][target_addr:...][data:...]
9898- (exact bit allocation TBD per subtype — 28 bits of payload is generous)
9999- Multi-flit when 28 bits isn't enough (config writes carrying full
100100- instruction words).
101101-```
102102-103103-### Key Design Rationale
104104-- Different token types have different overhead requirements — no point paying
105105- generation counter + context slot tax on monadic ops or memory accesses
106106-- Dyadic tokens carry 14-bit data (sufficient for most intermediates; full
107107- 16-bit literals can be loaded via monadic "load immediate" feeding into
108108- dyadic node)
109109-- Monadic tokens get full 20-bit data payload on same 32-bit bus
110110-- Structure tokens carry full 16-bit data + 9-bit address for SM operations
111111-- Type 11 is the system management channel: I/O, config, and future
112112- debug/trace infrastructure all live here with subtype discrimination
113113-- Generation counter (2-bit) ONLY on dyadic tokens — prevents ABA problem
114114- when context slots are reused after deallocation
115115-- 32-bit bus width works with 8-bit-wide SRAM (4 bytes per token)
116116-- If 14-bit dyadic data is too tight, bump to 36-bit bus (9 nibbles, works
117117- with 4-bit-wide SRAM). Decision deferred.
118118-119119-## Module Taxonomy
120120-121121-### CM (Control Module) — execution and matching
122122-- Instruction memory (IM): stores dataflow program (function bodies)
123123- - **Runtime-writable** via type-11 config packets from the network
124124- - Write from network stalls the pipeline (acceptable for config operations)
125125- - Enables runtime reprogramming and eliminates need for separate config bus
126126-- Operand memory (OM) / matching store: buffers arriving operands, performs
127127- matching
128128-- Receives tokens from CN (types 00/01) and DN (SM results repackaged as
129129- type 00/01), produces tokens to CN and AN
130130-- Contains the bump allocator, throttle, and generation counter logic
131131-- Each PE has a unique ID, set via EEPROM (instruction decoder doubles as
132132- ID store) or DIP switches during prototyping
133133-- See `pe-design.md` for pipeline details
134134-135135-### SM (Structure Memory) — data storage and structure operations
136136-- Banked data memory (cells) for arrays, lists, heap data
137137-- Embedded functional units for structure operations (read, write, atomic
138138- RMW, etc.)
139139-- Receives operation requests via AN (type 10), returns results via DN
140140- (repackaged as type 00/01 tokens)
141141-- Operates asynchronously from CMs — split-phase memory access
142142-- Pure data storage — no I/O mapping (I/O lives in the type-11 subsystem)
143143-- See `sm-design.md` for interface and banking details
144144-145145-### I/O Controller — peripheral interface
146146-- Fixed-function device on the network, NOT a full PE
147147-- Receives type-11 subtype-00 packets, interprets as I/O commands
148148-- Returns results as type 00/01 tokens to the requesting CM
149149-- Can spontaneously emit tokens (unsolicited I/O: UART RX, interrupt
150150- equivalent) — the only network participant that generates tokens
151151- without receiving one first
152152-- Also handles type-11 subtype-01 during bootstrap (reading from UART/flash,
153153- formatting config writes to load programs into PEs)
154154-- See `io-and-bootstrap.md` for design details
155155-156156-### Three Logical Interconnects (shared physical bus for v0)
157157-158158-```
159159-CN (Communication Network): CM <-> CM, types 00/01
160160-AN (Arbitration Network): CM -> SM, type 10
161161-DN (Distribution Network): SM -> CM, type 10 results repackaged as 00/01
162162-System channel: any <-> I/O controller, type 11
163163-```
164164-165165-For v0 (4 PEs + 1-2 SMs + I/O controller), all traffic shares a single
166166-physical bus with type-based routing. Routing nodes inspect the type field
167167-and forward to the appropriate destination. Multiple packets can be in
168168-flight simultaneously if the bus is pipelined with latches at each stage.
169169-170170-The AN/DN can be split onto separate physical paths later if SM access
171171-contention becomes a bottleneck. The type-field-based routing means this
172172-is a topology change, not a protocol change — no module interfaces need
173173-to change.
174174-175175-See `network-and-communication.md` for routing, clocking, and scaling details.
176176-177177-## Transistor Budget Estimate (4-PE system)
178178-179179-| Component | Transistors |
180180-|-----------|------------|
181181-| 4x PE logic | 20-32K |
182182-| Routing network (4 PEs) | 2-3K |
183183-| I/O controller | ~1-2K |
184184-| **Total logic** | **~25-35K** |
185185-| SRAM chips (instruction mem, matching stores, token queues) | 8-16 chips |
186186-187187-Note: bootstrap microsequencer removed from budget — bootstrap is handled
188188-by the I/O controller + type-11 config writes, or by an external
189189-microcontroller during early prototyping. No dedicated bootstrap hardware
190190-in the final architecture.
191191-192192-## IPC / Performance Expectations
193193-194194-- "Superscalar" is the wrong term for dataflow — there's no single
195195- instruction stream
196196-- With 4 PEs and single-cycle matching (common case), peak is 4 ops/clock
197197-- Realistic sustained throughput depends on:
198198- - Network crossing frequency (adds routing latency)
199199- - Hash path hits vs direct index (matching latency)
200200- - Available parallelism in the program
201201- - Network contention (shared bus at v0 scale)
202202-- Parallel workloads (matrix multiply, FFT): near peak
203203-- Sequential/pointer-chasing code: ~0.5-1 ops/clock (still competitive
204204- with 6502)
205205-- Key insight: matching store performance is the primary bottleneck, as
206206- Manchester discovered
207207-208208-## Build Order
209209-210210-### Phase 0: SM (Structure Memory) — BUILD FIRST
211211-- Self-contained module, testable in isolation
212212-- Drive with microcontroller (Arduino/RP2040) for testing
213213-- Defined interface: receive operation request, process, return result
214214-- Key deliverables:
215215- - Banked SRAM with address decoding
216216- - Simple operation unit (read/write at minimum, cons/car/cdr stretch goals)
217217- - Input interface (receive request packets)
218218- - Output interface (send result packets)
219219- - Test harness: microcontroller sends requests, validates responses
220220-221221-### Phase 1: CM (Control Module) — single PE
222222-- Instruction memory (SRAM)
223223-- Matching store with direct-indexed context slots
224224-- Bump allocator + throttle + generation counters
225225-- 8/16-bit ALU
226226-- Token FIFO (input)
227227-- Token output formatting
228228-- Test with microcontroller injecting tokens, verify matching + execution
229229-230230-### Phase 2: CM + SM pair
231231-- Connect via shared bus with type routing
232232-- Load a program using microcontroller (external, via type-11 config writes
233233- or direct SRAM programming)
234234-- Execute a dataflow graph that uses structure memory
235235-- First real program: fibonacci, small FFT, or similar
236236-237237-### Phase 3: Multi-module
238238-- Second CM, routing network
239239-- Prove cross-PE token routing works
240240-- Demonstrate actual parallel execution speedup
241241-242242-### Phase 4: System
243243-- Expand to 4 CMs + 1-2 SMs
244244-- I/O controller (type-11 subsystem) with UART
245245-- Bootstrap via I/O controller reading from flash/serial
246246-- ISR support (compiler-assigned PE with interrupt token injection from
247247- I/O controller)
248248-- Performance benchmarking vs period-equivalent CPUs
249249-250250-## Open Questions / Next Steps
251251-252252-1. **SM internal design** — CURRENT FOCUS: banking scheme, operation set,
253253- interface protocol
254254-2. **Matching store SRAM addressing** — detailed direct-index + hash fallback
255255- scheme
256256-3. **Context slot count per CM** — 4 bits = 16 slots (12KB SRAM each) vs wider
257257-4. **Data width decision** — 14-bit dyadic payload okay, or bump bus to 36 bits?
258258-5. **Instruction encoding** — operation set, format, how wide
259259-6. **Type-11 packet format** — exact bit allocation for I/O and config subtypes
260260-7. **I/O controller internal design** — state machine, UART bridge, unsolicited
261261- token generation
262262-8. **Compiler / assembler** — hand-written dataflow asm for v0, assembler that
263263- packs token fields
264264-9. **Monadic/dyadic optimisation** — deferred, revisit after v0 matching store
265265- works
266266-267267-## Key Papers in Project
268268-269269-- `gurd1985.pdf` — Manchester Dataflow Machine (matching unit details,
270270- overflow, pipeline)
271271-- `Dataflow_Machine_Architecture.pdf` — Veen survey (comprehensive overview,
272272- matching space analysis)
273273-- `amamiya1982.pdf` — DFM architecture (semi-CAM, structure memory, TTL
274274- prototype)
275275-- `17407_17358.pdf` — DFM evaluation (implementation details, benchmarks,
276276- VLSI projection)
277277-- `efficienthardwarearchitectureforfastipaddresslookup.pdf` — Pao et al.
278278- (binary-trie partitioning, bit-vector parallel search, SRAM pipeline)
279279-- `mclaughlin2005.pdf` — IP lookup survey (comparison of trie vs hash
280280- approaches in hardware)
281281-- `HighperformanceIPlookupcircuitusingDDRSDRAM.pdf` — Yang et al. (hash +
282282- CAM overflow, DDR burst for multi-bank)
283283-- `NonStrict_Execution_in_Parallel_and_Distributed_C.pdf` — non-strict
284284- execution, split-phase memory
285285-- `NATLS219821.pdf` — National Semiconductor 100142 CAM chip (4x4-bit,
286286- reference for discrete CAM scale)
287287-- `MOSES071271.pdf` — Motorola MCM69C233 CAM (32-bit match width, reference
288288- for CAM interface design)
289289-- `yuba1983.pdf` — Yuba et al. (PE pipeline sections, pseudo-result handling,
290290- packet formats)
···11-# Dynamic Dataflow CPU — Architecture Notes
22-33-## Project Goals
44-55-- Dynamic dataflow CPU achievable with discrete logic (74-series TTL + SRAM)
66-- Multi-PE design targeting superscalar-equivalent IPC
77-- "Period-plausible" transistor budget: ~25-35K logic transistors + SRAM chips
88- - Comparable to a 68000 or a couple of Z80s in logic complexity
99- - Reference builds for physical scale: Fabian Schuiki's superscalar CPU, James Sharman's pipelined CPU
1010-- Must be able to load and execute a binary over serial without a substantial conventional control core
1111-- Incremental build plan: single PE first, expand to multi-PE
1212-1313-## Key Architectural Decisions
1414-1515-### Execution Model
1616-- **Dynamic dataflow** (tagged-token), not static like the Electron E1
1717-- Compiler performs static PE assignment and routing configuration (E1-like)
1818-- Matching store operates dynamically within each PE for concurrent activations
1919-- This is a hybrid: static routing topology, dynamic operand matching
2020-2121-### Influences / Reference Architectures
2222-- **Manchester Dataflow Machine** (Gurd 1985): pipeline structure, matching unit design, overflow handling
2323-- **DFM / Amamiya 1982**: semi-CAM concept, computational locality, function-instance-based addressing
2424-- **Pao et al. (IP lookup)**: subtree bit-vector parallel search via bitwise AND — useful for collision resolution or routing
2525-- **Electron E1**: compile-time spatial mapping, tile-based PEs, control core for bootstrap
2626-- **Yang et al. (DDR SDRAM IP lookup)**: hash + small CAM for collision overflow
2727-2828-### Data Width
2929-- 8 or 16-bit data words within PEs (TBD, likely 16-bit for practicality)
3030-- Internal token packets are wider (~24-32 bits for local, multi-flit for remote)
3131-- Instruction words will be "chunkier" due to tags/destinations
3232-3333-### Token Packet Format (type-tagged, 32-bit)
3434-3535-Four token types, distinguished by 2-bit type field:
3636-3737-```
3838-Type 00 — DYADIC (needs matching, carries generation counter):
3939-[type:2][PE_id:2][ctx_slot:4][gen:2][offset:7][port:1][data:14]
4040-4141-Type 01 — MONADIC (bypass matching, no gen needed):
4242-[type:2][PE_id:2][offset:8][data:20]
4343-4444-Type 10 — STRUCTURE (memory access to SM):
4545-[type:2][SM_id:2][operation:3][address:9][data:16]
4646-4747-Type 11 — EXTENDED (multi-flit, remote/interrupt/control):
4848-[type:2][subtype:4][...first 26 bits of extended payload...]
4949-[...second flit: remaining 32 bits of payload...]
5050-```
5151-5252-Key design rationale:
5353-- Different token types have different overhead requirements — no point paying
5454- generation counter + context slot tax on monadic ops or memory accesses
5555-- Dyadic tokens carry 14-bit data (sufficient for most intermediates; full 16-bit
5656- literals can be loaded via monadic "load immediate" feeding into dyadic node)
5757-- Monadic tokens get full 20-bit data payload on same 32-bit bus
5858-- Structure tokens carry full 16-bit data + 9-bit address for SM operations
5959-- Extended type is the escape hatch: remote routing, interrupts, control signals
6060-- Generation counter (2-bit) ONLY on dyadic tokens — prevents ABA problem
6161- when context slots are reused after deallocation
6262-- 32-bit bus width works with 8-bit-wide SRAM (4 bytes per token)
6363-- If 14-bit dyadic data is too tight, bump to 36-bit bus (9 nibbles, works with
6464- 4-bit-wide SRAM). Decision deferred.
6565-6666-### Context Slot Lifecycle
6767-6868-- **Allocation**: bump allocator (counter + register) per PE, assigns slot ID on
6969- function activation. Trivial hardware: counter, adder, gate.
7070-- **Deallocation**: compiler inserts explicit "free" instruction on every exit path
7171- of a function body. Multiple frees are harmless (idempotent).
7272-- **ABA protection**: 2-bit generation counter per slot, incremented on each
7373- reallocation. Tokens carry generation they were created under. Mismatch =
7474- stale token, discarded. 4 generations before wraparound; stale tokens drain
7575- in 2-5 cycles, so wraparound collision is effectively impossible.
7676-- **Throttle**: saturating counter tracks active slots per PE. When full, stalls
7777- new allocations until a free occurs. Hardware cost: counter + comparator + gate.
7878- (~10 TTL chips)
7979-8080-### Token Routing Network
8181-- **Hierarchical prefix-based routing**, NOT Manchester-style omega network
8282-- Omega networks have fixed latency regardless of distance (bad — "DRAM from the moon")
8383-- Prefix routing gives variable latency: local = 1 hop, cross-cluster = 2-3 hops
8484-- Average latency depends on program locality, which the compiler can optimise
8585-- Each routing node has a small prefix lookup table, configured at program load time
8686-- Top bits of PE_id select cluster, lower bits select within cluster
8787-- Pao's bitwise AND trick potentially useful for routing decisions or small associative lookups at routing nodes
8888-8989-### Module Architecture: CM/SM Split (Amamiya-inspired)
9090-9191-Two module types with distinct roles, connected by potentially separate buses:
9292-9393-**CM (Control Module)** — execution and matching:
9494-- Instruction memory (IM): stores dataflow program (function bodies)
9595-- Operand memory (OM) / matching store: buffers arriving operands, performs matching
9696-- Receives tokens from CN and DN, produces tokens to CN and AN
9797-- Contains the bump allocator, throttle, and generation counter logic
9898-9999-**SM (Structure Memory)** — data storage and structure operations:
100100-- Banked data memory (cells) for arrays, lists, heap data
101101-- Embedded functional units for structure operations (read, write, cons, car, cdr, etc.)
102102-- Receives operation requests via AN, returns results via DN
103103-- Operates asynchronously from CMs — split-phase memory access
104104-105105-**Three interconnects (can share physical bus with type-based routing, or be separate):**
106106-- **CN** (Communication Network): CM-to-CM, carries dyadic/monadic tokens (types 00, 01)
107107-- **AN** (Arbitration Network): CM-to-SM, carries structure operation requests (type 10)
108108-- **DN** (Distribution Network): SM-to-CM, carries structure operation results
109109-110110-Rationale for the split:
111111-- Different traffic types have different width requirements — no need to force
112112- them all onto one fat bus
113113-- SM can handle memory operations concurrently while CMs continue matching/executing
114114-- SM has its own functional units, so memory operations don't consume CM ALU cycles
115115-- SM banking allows parallel access from multiple CMs, reducing contention
116116-- Aligns with Amamiya 1982 DFM architecture (prototype built in TTL)
117117-118118-### Matching Store Design (highest-risk component)
119119-- **Primary path: direct-indexed context slots** (Amamiya-style semi-CAM)
120120- - Bump allocator (counter + register) assigns context slot IDs to function activations
121121- - Context slot ID directly addresses a bank of SRAM
122122- - Instruction offset within function body used as direct address within that bank
123123- - Single-cycle matching for the common case — no hashing, no search
124124-- **Fallback path: hash-based matching** for dynamic/overflow cases
125125- - Multiplicative hashing: `(a * K) >> (w - m)` — simple to implement in hardware
126126- - Multi-bank (4-8 banks) checked in parallel for collision tolerance (Manchester-style set-associative)
127127- - Overflow to linked list or dedicated overflow buffer for worst case
128128-- **Compiler-assisted tag assignment**:
129129- - Static-lifetime values get contiguous, dense tags — sequential readout, no hashing
130130- - Dynamic activations get allocated tags via bump allocator
131131- - Potential for hybrid: half of matching store uses precalculated tags, half uses runtime hash
132132-- **Deallocation / reuse**:
133133- - Explicit "free" instruction on every function exit path (compiler-inserted)
134134- - Multiple frees are idempotent / harmless
135135- - Generation counter (2-bit) prevents ABA problem on slot reuse
136136- - Throttle (saturating counter) prevents matching store overflow
137137-- **Monadic/dyadic optimisation (optional)**:
138138- - Compiler assigns matching store indices only to dyadic nodes
139139- - Monadic nodes bypass matching, don't consume matching store cells
140140- - Requires indirection: matching store cell includes instruction address pointer
141141- - Cell width increases (~8 bits for instr_addr) but cell count decreases (~60% fewer)
142142- - local_offset in token = matching store index, NOT instruction address
143143- - Deferred for v0: simpler to have local_offset = instruction address = matching store address
144144-145145-### PE Pipeline (5-stage sketch)
146146-147147-```
148148-Stage 1: TOKEN INPUT
149149- - Receive token from network
150150- - Buffer in small FIFO (8-deep, 32-bit)
151151- - ~1K transistors (flip-flops) or use small SRAM
152152-153153-Stage 2: MATCH / BYPASS
154154- - Direct-index into context slot array (common case, single cycle)
155155- - Hash path for dynamic/overflow (multi-cycle)
156156- - Monadic instructions bypass matching entirely
157157- - Estimated: 2-3K transistors + SRAM
158158-159159-Stage 3: INSTRUCTION FETCH
160160- - Use local offset to read from PE's instruction SRAM
161161- - External SRAM chip, so just address generation logic
162162- - ~200 transistors of logic
163163-164164-Stage 4: EXECUTE
165165- - 8/16-bit ALU
166166- - ~500-2000 transistors depending on width and features
167167-168168-Stage 5: TOKEN OUTPUT
169169- - Form result token with routing prefix
170170- - Inject into network
171171- - ~300 transistors
172172-```
173173-174174-Pipeline registers between stages: ~500 transistors
175175-Control logic (state machine, handshaking): ~500-1000 transistors
176176-177177-**Per-PE total: ~5-8K transistors of logic + SRAM chips**
178178-179179-### Transistor Budget Estimate (4-PE system)
180180-181181-| Component | Transistors |
182182-|-----------|------------|
183183-| 4x PE logic | 20-32K |
184184-| Routing network (4 PEs) | 2-3K |
185185-| Bootstrap/loader microsequencer | 1-2K |
186186-| **Total logic** | **~25-35K** |
187187-| SRAM chips (instruction mem, matching stores, token queues) | 8-16 chips |
188188-189189-### Bootstrap / Program Loading
190190-- Hardwired microsequencer (NOT a full CPU)
191191-- Receives serial data, writes to instruction memory and routing tables via dedicated configuration bus
192192-- Config bus is separate from the token network
193193-- ROM-based state machine + UART receiver + bus master interface
194194-- ~20-30 TTL chips estimated
195195-- Issues "start" signal to release token flow
196196-- Alternative: a PE hardwired to run a built-in "loader" program from ROM
197197-198198-### Interrupt Handling
199199-- ISRs are subgraphs in the dataflow program, compiled and mapped to specific PEs like any other code
200200-- Compiler designates which PE(s) handle which interrupts
201201-- Hardware cost: edge detector on I/O pin, gated into token input FIFO of the assigned PE
202202-- Interrupt token injected into FIFO — PE doesn't need special hardware, just sees a token arrive
203203-- Priority: interrupt tokens can jump the FIFO queue (~3 extra chips)
204204-- ISR runs *concurrently* with main program on its reserved PEs — no context switch
205205-- Main program has nodes waiting for "interrupt result" tokens
206206-- Trade-off: reserved ISR PEs sit idle when no interrupts pending
207207-- Scalable: compile-time assignment means you can have multiple ISR PEs for different interrupt sources
208208-209209-### IPC / Performance Expectations
210210-- "Superscalar" is the wrong term for dataflow — there's no single instruction stream
211211-- With 4 PEs and single-cycle matching (common case), peak is 4 ops/clock
212212-- Realistic sustained throughput depends on:
213213- - Network crossing frequency (adds routing latency)
214214- - Hash path hits vs direct index (matching latency)
215215- - Available parallelism in the program
216216-- Parallel workloads (matrix multiply, FFT): near peak
217217-- Sequential/pointer-chasing code: ~0.5-1 ops/clock (still competitive with 6502)
218218-- Key insight: matching store performance is the primary bottleneck, as Manchester discovered
219219-220220-## Build Order
221221-222222-### Phase 0: SM (Structure Memory) — BUILD FIRST
223223-- Self-contained module, testable in isolation
224224-- Drive with microcontroller (Arduino/RP2040) for testing
225225-- Defined interface: receive operation request, process, return result
226226-- Key deliverables:
227227- - Banked SRAM with address decoding
228228- - Simple operation unit (read/write at minimum, cons/car/cdr stretch goals)
229229- - AN input interface (receive request packets)
230230- - DN output interface (send result packets)
231231- - Test harness: microcontroller sends requests, validates responses
232232-233233-### Phase 1: CM (Control Module) — single PE
234234-- Instruction memory (SRAM)
235235-- Matching store with direct-indexed context slots
236236-- Bump allocator + throttle + generation counters
237237-- 8/16-bit ALU
238238-- Token FIFO (input)
239239-- Token output formatting
240240-- Test with microcontroller injecting tokens, verify matching + execution
241241-242242-### Phase 2: CM + SM pair
243243-- Connect via AN/DN (or shared bus with type routing)
244244-- Load a program over serial via bootstrap microsequencer
245245-- Execute a dataflow graph that uses structure memory
246246-- First real program: fibonacci, small FFT, or similar
247247-248248-### Phase 3: Multi-module
249249-- Second CM, routing network
250250-- Prove cross-PE token routing works
251251-- Demonstrate actual parallel execution speedup
252252-253253-### Phase 4: System
254254-- Expand to 4 CMs + 1-2 SMs
255255-- Full bootstrap/loader microsequencer (serial load, configure routing, start)
256256-- ISR support (compiler-assigned PE with interrupt token injection)
257257-- Performance benchmarking vs period-equivalent CPUs
258258-259259-## Open Questions / Next Steps
260260-261261-1. **SM internal design** — CURRENT FOCUS: banking scheme, operation set, interface protocol
262262-2. **Matching store SRAM addressing** — detailed direct-index + hash fallback scheme
263263-3. **Context slot count per CM** — 4 bits = 16 slots (12KB SRAM each) vs wider
264264-4. **Data width decision** — 14-bit dyadic payload okay, or bump bus to 36 bits?
265265-5. **Instruction encoding** — operation set, format, how wide
266266-6. **Routing network topology** — exact interconnect for multi-CM/SM
267267-7. **Compiler / assembler** — hand-written dataflow asm for v0, assembler that packs token fields
268268-8. **Monadic/dyadic optimisation** — deferred, revisit after v0 matching store works
269269-270270-## Key Papers in Project
271271-272272-- `gurd1985.pdf` — Manchester Dataflow Machine (matching unit details, overflow, pipeline)
273273-- `Dataflow_Machine_Architecture.pdf` — Veen survey (comprehensive overview, matching space analysis)
274274-- `amamiya1982.pdf` — DFM architecture (semi-CAM, structure memory, TTL prototype)
275275-- `17407_17358.pdf` — DFM evaluation (implementation details, benchmarks, VLSI projection)
276276-- `efficienthardwarearchitectureforfastipaddresslookup.pdf` — Pao et al. (binary-trie partitioning, bit-vector parallel search, SRAM pipeline)
277277-- `mclaughlin2005.pdf` — IP lookup survey (comparison of trie vs hash approaches in hardware)
278278-- `HighperformanceIPlookupcircuitusingDDRSDRAM.pdf` — Yang et al. (hash + CAM overflow, DDR burst for multi-bank)
279279-- `NonStrict_Execution_in_Parallel_and_Distributed_C.pdf` — non-strict execution, split-phase memory
280280-- `NATLS219821.pdf` — National Semiconductor 100142 CAM chip (4x4-bit, reference for discrete CAM scale)
281281-- `MOSES071271.pdf` — (in project, not yet examined)
282282-- `yuba1983.pdf` — (in project, not yet examined)
···11-# Dynamic Dataflow CPU — Architecture Notes
22-33-## Project Goals
44-55-- Dynamic dataflow CPU achievable with discrete logic (74-series TTL + SRAM)
66-- Multi-PE design targeting superscalar-equivalent IPC
77-- "Period-plausible" transistor budget: ~25-35K logic transistors + SRAM chips
88- - Comparable to a 68000 or a couple of Z80s in logic complexity
99- - Reference builds for physical scale: Fabian Schuiki's superscalar CPU, James Sharman's pipelined CPU
1010-- Must be able to load and execute a binary over serial without a substantial conventional control core
1111-- Incremental build plan: single PE first, expand to multi-PE
1212-1313-## Key Architectural Decisions
1414-1515-### Execution Model
1616-- **Dynamic dataflow** (tagged-token), not static like the Electron E1
1717-- Compiler performs static PE assignment and routing configuration (E1-like)
1818-- Matching store operates dynamically within each PE for concurrent activations
1919-- This is a hybrid: static routing topology, dynamic operand matching
2020-2121-### Influences / Reference Architectures
2222-- **Manchester Dataflow Machine** (Gurd 1985): pipeline structure, matching unit design, overflow handling
2323-- **DFM / Amamiya 1982**: semi-CAM concept, computational locality, function-instance-based addressing
2424-- **Pao et al. (IP lookup)**: subtree bit-vector parallel search via bitwise AND — useful for collision resolution or routing
2525-- **Electron E1**: compile-time spatial mapping, tile-based PEs, control core for bootstrap
2626-- **Yang et al. (DDR SDRAM IP lookup)**: hash + small CAM for collision overflow
2727-2828-### Data Width
2929-- 8 or 16-bit data words within PEs (TBD, likely 16-bit for practicality)
3030-- Internal token packets are wider (~24-32 bits for local, multi-flit for remote)
3131-- Instruction words will be "chunkier" due to tags/destinations
3232-3333-### Token Packet Format (type-tagged, 32-bit)
3434-3535-Four token types, distinguished by 2-bit type field:
3636-3737-```
3838-Type 00 — DYADIC (needs matching, carries generation counter):
3939-[type:2][PE_id:2][ctx_slot:4][gen:2][offset:7][port:1][data:14]
4040-4141-Type 01 — MONADIC (bypass matching, no gen needed):
4242-[type:2][PE_id:2][offset:8][data:20]
4343-4444-Type 10 — STRUCTURE (memory access to SM):
4545-[type:2][SM_id:2][operation:3][address:9][data:16]
4646-4747-Type 11 — EXTENDED (multi-flit, remote/interrupt/control):
4848-[type:2][subtype:4][...first 26 bits of extended payload...]
4949-[...second flit: remaining 32 bits of payload...]
5050-```
5151-5252-Key design rationale:
5353-- Different token types have different overhead requirements — no point paying
5454- generation counter + context slot tax on monadic ops or memory accesses
5555-- Dyadic tokens carry 14-bit data (sufficient for most intermediates; full 16-bit
5656- literals can be loaded via monadic "load immediate" feeding into dyadic node)
5757-- Monadic tokens get full 20-bit data payload on same 32-bit bus
5858-- Structure tokens carry full 16-bit data + 9-bit address for SM operations
5959-- Extended type is the escape hatch: remote routing, interrupts, control signals
6060-- Generation counter (2-bit) ONLY on dyadic tokens — prevents ABA problem
6161- when context slots are reused after deallocation
6262-- 32-bit bus width works with 8-bit-wide SRAM (4 bytes per token)
6363-- If 14-bit dyadic data is too tight, bump to 36-bit bus (9 nibbles, works with
6464- 4-bit-wide SRAM). Decision deferred.
6565-6666-### Context Slot Lifecycle
6767-6868-- **Allocation**: bump allocator (counter + register) per PE, assigns slot ID on
6969- function activation. Trivial hardware: counter, adder, gate.
7070-- **Deallocation**: compiler inserts explicit "free" instruction on every exit path
7171- of a function body. Multiple frees are harmless (idempotent).
7272-- **ABA protection**: 2-bit generation counter per slot, incremented on each
7373- reallocation. Tokens carry generation they were created under. Mismatch =
7474- stale token, discarded. 4 generations before wraparound; stale tokens drain
7575- in 2-5 cycles, so wraparound collision is effectively impossible.
7676-- **Throttle**: saturating counter tracks active slots per PE. When full, stalls
7777- new allocations until a free occurs. Hardware cost: counter + comparator + gate.
7878- (~10 TTL chips)
7979-8080-### Token Routing Network
8181-- **Hierarchical prefix-based routing**, NOT Manchester-style omega network
8282-- Omega networks have fixed latency regardless of distance (bad — "DRAM from the moon")
8383-- Prefix routing gives variable latency: local = 1 hop, cross-cluster = 2-3 hops
8484-- Average latency depends on program locality, which the compiler can optimise
8585-- Each routing node has a small prefix lookup table, configured at program load time
8686-- Top bits of PE_id select cluster, lower bits select within cluster
8787-- Pao's bitwise AND trick potentially useful for routing decisions or small associative lookups at routing nodes
8888-8989-### SM (Structure Memory) — Detailed Design
9090-9191-#### Interface Protocol
9292-9393-Stateless request handling: the request token carries its own return routing info
9494-in the bits that are unused by that operation type. SM never maintains pending-request
9595-state — result packets are self-addressed.
9696-9797-```
9898-READ request (data field repurposed for return routing):
9999-[type:2][SM_id:2][op:3][address:9][ret_CM:2][ret_offset:8][ret_ctx:4][ret_port:1][pad:1]
100100-101101-WRITE request (data field carries write data, no response needed):
102102-[type:2][SM_id:2][op:3][address:9][data:16]
103103-104104-READ_INC / READ_DEC (same as READ format — return routing in data field):
105105-[type:2][SM_id:2][op:3][address:9][ret_CM:2][ret_offset:8][ret_ctx:4][ret_port:1][pad:1]
106106-107107-CAS — compare-and-swap (two-flit operation):
108108-Flit 1: [type:2][SM_id:2][op:3][address:9][expected_value:16]
109109-Flit 2: [new_value:16][ret_CM:2][ret_offset:8][ret_ctx:4][ret_port:1][pad:1]
110110-```
111111-112112-Result packet on DN (SM -> CM):
113113-Repackaged as a dyadic or monadic token destined for [ret_CM, ret_offset, ret_ctx, ret_port]
114114-with the fetched data as payload.
115115-116116-#### Operation Set (3-bit opcode, 8 slots)
117117-118118-```
119119-000: READ — read address, return data via DN
120120-001: WRITE — write data to address (no DN response)
121121-010: READ_INC — atomic fetch-and-add(+1), return old value (= atomic ptr increment)
122122-011: READ_DEC — atomic fetch-and-add(-1), return old value (= refcount decrement)
123123-100: CAS — compare-and-swap (two-flit), return old value + success bit
124124-101: ALLOC — (future) allocate N cells, return base address
125125-110: FREE — (future) mark cells as available
126126-111: RESERVED
127127-```
128128-129129-READ_INC / READ_DEC are fetch-and-add primitives — they give atomic pointer
130130-operations and reference counting without dedicated refcount hardware. CM checks
131131-returned value for zero (refcount exhausted) using its normal ALU.
132132-133133-#### Hardware Architecture
134134-135135-```
136136-AN Input Interface DN Output Interface
137137- (receive request) (send result)
138138- | ^
139139- v |
140140- [Request FIFO] [Result FIFO]
141141- | ^
142142- v |
143143- [Op Decoder]----+ [Result Formatter]
144144- | | ^
145145- v v |
146146- [Addr Decode] [ALU for inc/dec/cas] [Bank Read Data]
147147- | | ^
148148- v v |
149149- [SRAM Bank 0] [SRAM Bank 1] ... [SRAM Bank N]
150150-```
151151-152152-- Banking: start with 2 banks (1 address bit selects bank) for v0
153153-- 9-bit address = 512 cells per SM = 1KB at 16-bit data width
154154-- Each bank is one SRAM chip with room to spare
155155-- ALU is minimal: increment, decrement, compare. Not a full ALU.
156156-- Op decoder determines: is this read/write/RMW? one-flit or two-flit?
157157- does it need a DN response? how to pack the result?
158158-- Result formatter extracts return routing from request, constructs DN token
159159-160160-#### V0 Test Plan
161161-- Drive AN input with microcontroller (RP2040 / Arduino)
162162-- Microcontroller formats 32-bit request packets, clocks into request FIFO
163163-- Read 32-bit result packets from DN output FIFO
164164-- Test suite: sequential read/write, random access, read_inc sequences,
165165- bank contention (same bank back-to-back), boundary conditions
166166-167167-#### 4x4-bit CAM chips (National Semiconductor 100142)
168168-- Available in DIP, period-appropriate
169169-- 4 words x 4 bits each — very small but potentially useful for:
170170- - Small routing tables at network nodes (4-8 entries)
171171- - Context slot allocation lookup (which slots are free)
172172- - NOT practical for bulk matching store (too few entries per chip)
173173-- Datasheet scan in project: NATLS219821.pdf
174174-175175-### Module Architecture: CM/SM Split (Amamiya-inspired)
176176-177177-Two module types with distinct roles, connected by potentially separate buses:
178178-179179-**CM (Control Module)** — execution and matching:
180180-- Instruction memory (IM): stores dataflow program (function bodies)
181181-- Operand memory (OM) / matching store: buffers arriving operands, performs matching
182182-- Receives tokens from CN and DN, produces tokens to CN and AN
183183-- Contains the bump allocator, throttle, and generation counter logic
184184-185185-**SM (Structure Memory)** — data storage and structure operations:
186186-- Banked data memory (cells) for arrays, lists, heap data
187187-- Embedded functional units for structure operations (read, write, cons, car, cdr, etc.)
188188-- Receives operation requests via AN, returns results via DN
189189-- Operates asynchronously from CMs — split-phase memory access
190190-191191-**Three interconnects (can share physical bus with type-based routing, or be separate):**
192192-- **CN** (Communication Network): CM-to-CM, carries dyadic/monadic tokens (types 00, 01)
193193-- **AN** (Arbitration Network): CM-to-SM, carries structure operation requests (type 10)
194194-- **DN** (Distribution Network): SM-to-CM, carries structure operation results
195195-196196-Rationale for the split:
197197-- Different traffic types have different width requirements — no need to force
198198- them all onto one fat bus
199199-- SM can handle memory operations concurrently while CMs continue matching/executing
200200-- SM has its own functional units, so memory operations don't consume CM ALU cycles
201201-- SM banking allows parallel access from multiple CMs, reducing contention
202202-- Aligns with Amamiya 1982 DFM architecture (prototype built in TTL)
203203-204204-### Matching Store Design (highest-risk component)
205205-- **Primary path: direct-indexed context slots** (Amamiya-style semi-CAM)
206206- - Bump allocator (counter + register) assigns context slot IDs to function activations
207207- - Context slot ID directly addresses a bank of SRAM
208208- - Instruction offset within function body used as direct address within that bank
209209- - Single-cycle matching for the common case — no hashing, no search
210210-- **Fallback path: hash-based matching** for dynamic/overflow cases
211211- - Multiplicative hashing: `(a * K) >> (w - m)` — simple to implement in hardware
212212- - Multi-bank (4-8 banks) checked in parallel for collision tolerance (Manchester-style set-associative)
213213- - Overflow to linked list or dedicated overflow buffer for worst case
214214-- **Compiler-assisted tag assignment**:
215215- - Static-lifetime values get contiguous, dense tags — sequential readout, no hashing
216216- - Dynamic activations get allocated tags via bump allocator
217217- - Potential for hybrid: half of matching store uses precalculated tags, half uses runtime hash
218218-- **Deallocation / reuse**:
219219- - Explicit "free" instruction on every function exit path (compiler-inserted)
220220- - Multiple frees are idempotent / harmless
221221- - Generation counter (2-bit) prevents ABA problem on slot reuse
222222- - Throttle (saturating counter) prevents matching store overflow
223223-- **Monadic/dyadic optimisation (optional)**:
224224- - Compiler assigns matching store indices only to dyadic nodes
225225- - Monadic nodes bypass matching, don't consume matching store cells
226226- - Requires indirection: matching store cell includes instruction address pointer
227227- - Cell width increases (~8 bits for instr_addr) but cell count decreases (~60% fewer)
228228- - local_offset in token = matching store index, NOT instruction address
229229- - Deferred for v0: simpler to have local_offset = instruction address = matching store address
230230-231231-### PE Pipeline (5-stage sketch)
232232-233233-```
234234-Stage 1: TOKEN INPUT
235235- - Receive token from network
236236- - Buffer in small FIFO (8-deep, 32-bit)
237237- - ~1K transistors (flip-flops) or use small SRAM
238238-239239-Stage 2: MATCH / BYPASS
240240- - Direct-index into context slot array (common case, single cycle)
241241- - Hash path for dynamic/overflow (multi-cycle)
242242- - Monadic instructions bypass matching entirely
243243- - Estimated: 2-3K transistors + SRAM
244244-245245-Stage 3: INSTRUCTION FETCH
246246- - Use local offset to read from PE's instruction SRAM
247247- - External SRAM chip, so just address generation logic
248248- - ~200 transistors of logic
249249-250250-Stage 4: EXECUTE
251251- - 8/16-bit ALU
252252- - ~500-2000 transistors depending on width and features
253253-254254-Stage 5: TOKEN OUTPUT
255255- - Form result token with routing prefix
256256- - Inject into network
257257- - ~300 transistors
258258-```
259259-260260-Pipeline registers between stages: ~500 transistors
261261-Control logic (state machine, handshaking): ~500-1000 transistors
262262-263263-**Per-PE total: ~5-8K transistors of logic + SRAM chips**
264264-265265-### Transistor Budget Estimate (4-PE system)
266266-267267-| Component | Transistors |
268268-|-----------|------------|
269269-| 4x PE logic | 20-32K |
270270-| Routing network (4 PEs) | 2-3K |
271271-| Bootstrap/loader microsequencer | 1-2K |
272272-| **Total logic** | **~25-35K** |
273273-| SRAM chips (instruction mem, matching stores, token queues) | 8-16 chips |
274274-275275-### Bootstrap / Program Loading
276276-- Hardwired microsequencer (NOT a full CPU)
277277-- Receives serial data, writes to instruction memory and routing tables via dedicated configuration bus
278278-- Config bus is separate from the token network
279279-- ROM-based state machine + UART receiver + bus master interface
280280-- ~20-30 TTL chips estimated
281281-- Issues "start" signal to release token flow
282282-- Alternative: a PE hardwired to run a built-in "loader" program from ROM
283283-284284-### Interrupt Handling
285285-- ISRs are subgraphs in the dataflow program, compiled and mapped to specific PEs like any other code
286286-- Compiler designates which PE(s) handle which interrupts
287287-- Hardware cost: edge detector on I/O pin, gated into token input FIFO of the assigned PE
288288-- Interrupt token injected into FIFO — PE doesn't need special hardware, just sees a token arrive
289289-- Priority: interrupt tokens can jump the FIFO queue (~3 extra chips)
290290-- ISR runs *concurrently* with main program on its reserved PEs — no context switch
291291-- Main program has nodes waiting for "interrupt result" tokens
292292-- Trade-off: reserved ISR PEs sit idle when no interrupts pending
293293-- Scalable: compile-time assignment means you can have multiple ISR PEs for different interrupt sources
294294-295295-### IPC / Performance Expectations
296296-- "Superscalar" is the wrong term for dataflow — there's no single instruction stream
297297-- With 4 PEs and single-cycle matching (common case), peak is 4 ops/clock
298298-- Realistic sustained throughput depends on:
299299- - Network crossing frequency (adds routing latency)
300300- - Hash path hits vs direct index (matching latency)
301301- - Available parallelism in the program
302302-- Parallel workloads (matrix multiply, FFT): near peak
303303-- Sequential/pointer-chasing code: ~0.5-1 ops/clock (still competitive with 6502)
304304-- Key insight: matching store performance is the primary bottleneck, as Manchester discovered
305305-306306-## Build Order
307307-308308-### Phase 0: SM (Structure Memory) — BUILD FIRST
309309-- Self-contained module, testable in isolation
310310-- Drive with microcontroller (Arduino/RP2040) for testing
311311-- Defined interface: receive operation request, process, return result
312312-- Key deliverables:
313313- - Banked SRAM with address decoding
314314- - Simple operation unit (read/write at minimum, cons/car/cdr stretch goals)
315315- - AN input interface (receive request packets)
316316- - DN output interface (send result packets)
317317- - Test harness: microcontroller sends requests, validates responses
318318-319319-### Phase 1: CM (Control Module) — single PE
320320-- Instruction memory (SRAM)
321321-- Matching store with direct-indexed context slots
322322-- Bump allocator + throttle + generation counters
323323-- 8/16-bit ALU
324324-- Token FIFO (input)
325325-- Token output formatting
326326-- Test with microcontroller injecting tokens, verify matching + execution
327327-328328-### Phase 2: CM + SM pair
329329-- Connect via AN/DN (or shared bus with type routing)
330330-- Load a program over serial via bootstrap microsequencer
331331-- Execute a dataflow graph that uses structure memory
332332-- First real program: fibonacci, small FFT, or similar
333333-334334-### Phase 3: Multi-module
335335-- Second CM, routing network
336336-- Prove cross-PE token routing works
337337-- Demonstrate actual parallel execution speedup
338338-339339-### Phase 4: System
340340-- Expand to 4 CMs + 1-2 SMs
341341-- Full bootstrap/loader microsequencer (serial load, configure routing, start)
342342-- ISR support (compiler-assigned PE with interrupt token injection)
343343-- Performance benchmarking vs period-equivalent CPUs
344344-345345-## Open Questions / Next Steps
346346-347347-1. **SM internal design** — CURRENT FOCUS: banking scheme, operation set, interface protocol
348348-2. **Matching store SRAM addressing** — detailed direct-index + hash fallback scheme
349349-3. **Context slot count per CM** — 4 bits = 16 slots (12KB SRAM each) vs wider
350350-4. **Data width decision** — 14-bit dyadic payload okay, or bump bus to 36 bits?
351351-5. **Instruction encoding** — operation set, format, how wide
352352-6. **Routing network topology** — exact interconnect for multi-CM/SM
353353-7. **Compiler / assembler** — hand-written dataflow asm for v0, assembler that packs token fields
354354-8. **Monadic/dyadic optimisation** — deferred, revisit after v0 matching store works
355355-356356-## Key Papers in Project
357357-358358-- `gurd1985.pdf` — Manchester Dataflow Machine (matching unit details, overflow, pipeline)
359359-- `Dataflow_Machine_Architecture.pdf` — Veen survey (comprehensive overview, matching space analysis)
360360-- `amamiya1982.pdf` — DFM architecture (semi-CAM, structure memory, TTL prototype)
361361-- `17407_17358.pdf` — DFM evaluation (implementation details, benchmarks, VLSI projection)
362362-- `efficienthardwarearchitectureforfastipaddresslookup.pdf` — Pao et al. (binary-trie partitioning, bit-vector parallel search, SRAM pipeline)
363363-- `mclaughlin2005.pdf` — IP lookup survey (comparison of trie vs hash approaches in hardware)
364364-- `HighperformanceIPlookupcircuitusingDDRSDRAM.pdf` — Yang et al. (hash + CAM overflow, DDR burst for multi-bank)
365365-- `NonStrict_Execution_in_Parallel_and_Distributed_C.pdf` — non-strict execution, split-phase memory
366366-- `NATLS219821.pdf` — National Semiconductor 100142 CAM chip (4x4-bit, reference for discrete CAM scale)
367367-- `MOSES071271.pdf` — (in project, not yet examined)
368368-- `yuba1983.pdf` — (in project, not yet examined)
···11-# Dynamic Dataflow CPU — Architecture Notes
22-33-## Project Goals
44-55-- Dynamic dataflow CPU achievable with discrete logic (74-series TTL + SRAM)
66-- Multi-PE design targeting superscalar-equivalent IPC
77-- "Period-plausible" transistor budget: ~25-35K logic transistors + SRAM chips
88- - Comparable to a 68000 or a couple of Z80s in logic complexity
99- - Reference builds for physical scale: Fabian Schuiki's superscalar CPU, James Sharman's pipelined CPU
1010-- Must be able to load and execute a binary over serial without a substantial conventional control core
1111-- Incremental build plan: single PE first, expand to multi-PE
1212-1313-## Key Architectural Decisions
1414-1515-### Execution Model
1616-- **Dynamic dataflow** (tagged-token), not static like the Electron E1
1717-- Compiler performs static PE assignment and routing configuration (E1-like)
1818-- Matching store operates dynamically within each PE for concurrent activations
1919-- This is a hybrid: static routing topology, dynamic operand matching
2020-2121-### Influences / Reference Architectures
2222-- **Manchester Dataflow Machine** (Gurd 1985): pipeline structure, matching unit design, overflow handling
2323-- **DFM / Amamiya 1982**: semi-CAM concept, computational locality, function-instance-based addressing
2424-- **Pao et al. (IP lookup)**: subtree bit-vector parallel search via bitwise AND — useful for collision resolution or routing
2525-- **Electron E1**: compile-time spatial mapping, tile-based PEs, control core for bootstrap
2626-- **Yang et al. (DDR SDRAM IP lookup)**: hash + small CAM for collision overflow
2727-2828-### Data Width
2929-- 8 or 16-bit data words within PEs (TBD, likely 16-bit for practicality)
3030-- Internal token packets are wider (~24-32 bits for local, multi-flit for remote)
3131-- Instruction words will be "chunkier" due to tags/destinations
3232-3333-### Token Packet Format (working sketch)
3434-3535-```
3636-LOCAL TOKEN (single flit):
3737-[PE_id: 3b][local_offset: 8b][context_slot: 4b][port: 1b][data: 8-16b]
3838- = 24-32 bits total
3939-4040-EXTENDED TOKEN (two flits, for cross-cluster / off-machine):
4141-Flit 1: [RESERVED_PE_id: 3b (all 1s)][extended header bits]
4242-Flit 2: [full remote destination + data]
4343-```
4444-4545-- PE_id field with reserved value (all-1s) triggers extended addressing mode
4646-- Remote tokens travel as two flits on the network — no bus locking needed
4747-- Routing nodes optimised for the single-flit common case
4848-- Composable: third flit could carry inter-machine routing
4949-5050-### Token Routing Network
5151-- **Hierarchical prefix-based routing**, NOT Manchester-style omega network
5252-- Omega networks have fixed latency regardless of distance (bad — "DRAM from the moon")
5353-- Prefix routing gives variable latency: local = 1 hop, cross-cluster = 2-3 hops
5454-- Average latency depends on program locality, which the compiler can optimise
5555-- Each routing node has a small prefix lookup table, configured at program load time
5656-- Top bits of PE_id select cluster, lower bits select within cluster
5757-- Pao's bitwise AND trick potentially useful for routing decisions or small associative lookups at routing nodes
5858-5959-### Matching Store Design (highest-risk component)
6060-- **Primary path: direct-indexed context slots** (Amamiya-style semi-CAM)
6161- - Bump allocator (counter + register) assigns context slot IDs to function activations
6262- - Context slot ID directly addresses a bank of SRAM
6363- - Instruction offset within function body used as direct address within that bank
6464- - Single-cycle matching for the common case — no hashing, no search
6565-- **Fallback path: hash-based matching** for dynamic/overflow cases
6666- - Multiplicative hashing: `(a * K) >> (w - m)` — simple to implement in hardware
6767- - Multi-bank (4-8 banks) checked in parallel for collision tolerance (Manchester-style set-associative)
6868- - Overflow to linked list or dedicated overflow buffer for worst case
6969-- **Compiler-assisted tag assignment**:
7070- - Static-lifetime values get contiguous, dense tags — sequential readout, no hashing
7171- - Dynamic activations get allocated tags via bump allocator
7272- - Potential for hybrid: half of matching store uses precalculated tags, half uses runtime hash
7373-- **Deallocation / reuse**:
7474- - Bump allocator handles allocation trivially
7575- - Deallocation is the hard part
7676- - Throttle mechanism (limit concurrent activations) enables context slot reuse
7777- - For statically-verifiable lifetimes, compiler manages reuse directly
7878- - For dynamic lifetimes, track via secondary lookup (TBD)
7979-8080-### PE Pipeline (5-stage sketch)
8181-8282-```
8383-Stage 1: TOKEN INPUT
8484- - Receive token from network
8585- - Buffer in small FIFO (8-deep, 32-bit)
8686- - ~1K transistors (flip-flops) or use small SRAM
8787-8888-Stage 2: MATCH / BYPASS
8989- - Direct-index into context slot array (common case, single cycle)
9090- - Hash path for dynamic/overflow (multi-cycle)
9191- - Monadic instructions bypass matching entirely
9292- - Estimated: 2-3K transistors + SRAM
9393-9494-Stage 3: INSTRUCTION FETCH
9595- - Use local offset to read from PE's instruction SRAM
9696- - External SRAM chip, so just address generation logic
9797- - ~200 transistors of logic
9898-9999-Stage 4: EXECUTE
100100- - 8/16-bit ALU
101101- - ~500-2000 transistors depending on width and features
102102-103103-Stage 5: TOKEN OUTPUT
104104- - Form result token with routing prefix
105105- - Inject into network
106106- - ~300 transistors
107107-```
108108-109109-Pipeline registers between stages: ~500 transistors
110110-Control logic (state machine, handshaking): ~500-1000 transistors
111111-112112-**Per-PE total: ~5-8K transistors of logic + SRAM chips**
113113-114114-### Transistor Budget Estimate (4-PE system)
115115-116116-| Component | Transistors |
117117-|-----------|------------|
118118-| 4x PE logic | 20-32K |
119119-| Routing network (4 PEs) | 2-3K |
120120-| Bootstrap/loader microsequencer | 1-2K |
121121-| **Total logic** | **~25-35K** |
122122-| SRAM chips (instruction mem, matching stores, token queues) | 8-16 chips |
123123-124124-### Bootstrap / Program Loading
125125-- Hardwired microsequencer (NOT a full CPU)
126126-- Receives serial data, writes to instruction memory and routing tables via dedicated configuration bus
127127-- Config bus is separate from the token network
128128-- ROM-based state machine + UART receiver + bus master interface
129129-- ~20-30 TTL chips estimated
130130-- Issues "start" signal to release token flow
131131-- Alternative: a PE hardwired to run a built-in "loader" program from ROM
132132-133133-### Interrupt Handling
134134-- ISRs are subgraphs in the dataflow program, compiled and mapped to specific PEs like any other code
135135-- Compiler designates which PE(s) handle which interrupts
136136-- Hardware cost: edge detector on I/O pin, gated into token input FIFO of the assigned PE
137137-- Interrupt token injected into FIFO — PE doesn't need special hardware, just sees a token arrive
138138-- Priority: interrupt tokens can jump the FIFO queue (~3 extra chips)
139139-- ISR runs *concurrently* with main program on its reserved PEs — no context switch
140140-- Main program has nodes waiting for "interrupt result" tokens
141141-- Trade-off: reserved ISR PEs sit idle when no interrupts pending
142142-- Scalable: compile-time assignment means you can have multiple ISR PEs for different interrupt sources
143143-144144-### IPC / Performance Expectations
145145-- "Superscalar" is the wrong term for dataflow — there's no single instruction stream
146146-- With 4 PEs and single-cycle matching (common case), peak is 4 ops/clock
147147-- Realistic sustained throughput depends on:
148148- - Network crossing frequency (adds routing latency)
149149- - Hash path hits vs direct index (matching latency)
150150- - Available parallelism in the program
151151-- Parallel workloads (matrix multiply, FFT): near peak
152152-- Sequential/pointer-chasing code: ~0.5-1 ops/clock (still competitive with 6502)
153153-- Key insight: matching store performance is the primary bottleneck, as Manchester discovered
154154-155155-## Open Questions / Next Steps
156156-157157-1. **Matching store SRAM addressing scheme** — detailed design of direct-indexed + hash fallback, including bump allocator hardware
158158-2. **Context slot sizing** — how many concurrent contexts per PE? determines SRAM requirements
159159-3. **Instruction encoding** — what operations, what format, how wide
160160-4. **Routing network topology** — exact interconnect for 4-8 PEs
161161-5. **Compiler / assembler** — even a basic assembler for hand-written dataflow assembly
162162-6. **Throttle mechanism** — how to limit concurrent activations to prevent matching store overflow
163163-7. **Deallocation** — hardware mechanism for freeing context slots when activations complete
164164-8. **v0 milestone** — single PE + loader, load and execute fibonacci or similar over serial
165165-166166-## Key Papers in Project
167167-168168-- `gurd1985.pdf` — Manchester Dataflow Machine (matching unit details, overflow, pipeline)
169169-- `Dataflow_Machine_Architecture.pdf` — Veen survey (comprehensive overview, matching space analysis)
170170-- `amamiya1982.pdf` — DFM architecture (semi-CAM, structure memory, TTL prototype)
171171-- `17407_17358.pdf` — DFM evaluation (implementation details, benchmarks, VLSI projection)
172172-- `efficienthardwarearchitectureforfastipaddresslookup.pdf` — Pao et al. (binary-trie partitioning, bit-vector parallel search, SRAM pipeline)
173173-- `mclaughlin2005.pdf` — IP lookup survey (comparison of trie vs hash approaches in hardware)
174174-- `HighperformanceIPlookupcircuitusingDDRSDRAM.pdf` — Yang et al. (hash + CAM overflow, DDR burst for multi-bank)
175175-- `NonStrict_Execution_in_Parallel_and_Distributed_C.pdf` — non-strict execution, split-phase memory
176176-- `NATLS219821.pdf` — National Semiconductor 100142 CAM chip (4x4-bit, reference for discrete CAM scale)
177177-- `MOSES071271.pdf` — (in project, not yet examined)
178178-- `yuba1983.pdf` — (in project, not yet examined)
-362
design-notes/versions/design-alternatives.md
···11-# Dynamic Dataflow CPU — Design Alternatives & Roads Not Travelled
22-33-Companion document to the architecture docs. Captures rejected, deferred,
44-and superseded approaches, their advantages, disadvantages, and why we went
55-the way we did.
66-77-Updated to reflect decisions from ongoing design discussions.
88-99----
1010-1111-## 1. Routing Network Topology
1212-1313-### Chosen: Hierarchical Prefix-Based Routing (target architecture)
1414-### v0 Implementation: Shared Bus with Type-Based Routing
1515-1616-For 4 PEs + 1-2 SMs + I/O controller, a shared pipelined bus with latches
1717-is sufficient. type field in the packet header is the primary routing
1818-discriminator. prefix routing is the target for scaling but doesn't need
1919-to be built until Phase 3+.
2020-2121-### Alternative A: Manchester-Style Omega / Sorting Network
2222-- **How it works**: log2(n) stages of 2x2 routing elements. Every token
2323- traverses all stages. Destination bits are consumed one per stage.
2424-- **Advantages**:
2525- - Maximally general: any-to-any routing in fixed time
2626- - No routing tables to configure — topology IS the routing algorithm
2727- - Well-understood, proven in Manchester hardware
2828-- **Disadvantages**:
2929- - Fixed latency regardless of distance (the "DRAM from the moon" problem)
3030- - Latency grows with PE count even for local traffic
3131- - All tokens pay full traversal cost — devastating for locality-heavy programs
3232- - Hardware grows as n * log2(n) routing elements
3333-- **Why rejected**: our design explicitly exploits compiler-assigned locality.
3434- paying full network traversal for a token going to the PE next door is
3535- wasteful. hierarchical routing makes the common case fast.
3636-3737-### Alternative B: Crossbar
3838-- **How it works**: full n*n switch. any source to any destination in one cycle.
3939-- **Advantages**:
4040- - Minimum latency: everything is 1 hop
4141- - Simple conceptually
4242-- **Disadvantages**:
4343- - Hardware grows as n^2. 4 PEs = 16 crosspoints, fine. 8 PEs = 64.
4444- - Each crosspoint needs a mux + arbiter. gets expensive fast.
4545- - Contention handling needs buffering or stalling
4646-- **Why rejected**: doesn't scale. fine for 4 PEs but we want the architecture
4747- to extend beyond that. could revisit as a LOCAL interconnect within a
4848- cluster, with hierarchical routing between clusters.
4949-5050-### Alternative C: Ring Bus
5151-- **How it works**: tokens travel around a ring, each node inspects and either
5252- consumes or passes through.
5353-- **Advantages**:
5454- - Dead simple hardware: each node is a register + comparator + mux
5555- - Trivially extensible: add a node, extend the ring
5656-- **Disadvantages**:
5757- - Worst-case latency is n-1 hops
5858- - Average latency grows linearly with PE count
5959- - Bandwidth shared: total ring bandwidth is fixed regardless of PE count
6060-- **Status**: not rejected, just unnecessary at v0 scale. **worth
6161- reconsidering** as an intermediate step between shared bus and full
6262- prefix routing if the system grows to 8-16 PEs.
6363-6464-### Alternative D: Shared Bus (chosen for v0)
6565-- **Advantages**:
6666- - Absolute minimum hardware
6767- - Trivially simple
6868- - With pipelined latches, multiple packets in flight
6969-- **Disadvantages**:
7070- - Bandwidth limited
7171- - Doesn't scale past ~4-8 nodes
7272-- **Status**: v0 physical implementation. token format is designed for the
7373- prefix-routed future, but physical wires are shared bus. the type field
7474- provides a natural decomposition path — CN/AN/DN can be split onto
7575- separate physical paths when contention shows up.
7676-7777----
7878-7979-## 2. Token Format
8080-8181-### Chosen: Type-Tagged 32-bit Tokens (4 types)
8282-- 2-bit type field as primary routing discriminator
8383-- Type 11 subdivided by 2-bit subtype for I/O, config, and future system
8484- management traffic
8585-- See `architecture-overview.md` for full format specification
8686-8787-No changes to alternatives from previous version. the type-11 subtype
8888-scheme was added to handle I/O and config writes without consuming PE
8989-or SM address space.
9090-9191-### Alternative A: Fixed-Field Flat Token
9292-- **Why rejected**: wastes bits on monadic and structure tokens. type-tagged
9393- approach reclaims 6+ bits. decoding cost is trivial.
9494-9595-### Alternative B: 36-bit Bus
9696-- **Status**: still the escape hatch if 14-bit dyadic data is too limiting.
9797- decision deferred.
9898-9999-### Alternative C: Variable-Width Tokens
100100-- **Why rejected**: complexity cost too high, slows common case. multi-flit
101101- used only for type-11 extended operations (rare).
102102-103103----
104104-105105-## 3. Matching Store Architecture
106106-107107-### Chosen: Direct-Indexed Context Slots (Amamiya semi-CAM) + Hash Fallback
108108-No changes. see `pe-design.md` for detailed design.
109109-110110-### Alternative A: Pure Hashing (Manchester-Style)
111111-- **Why rejected**: <20% memory utilisation, 16 parallel banks per PE,
112112- overflow subsystem. too much hardware for the benefit. the semi-CAM
113113- approach gives single-cycle matching for the common case.
114114-115115-### Alternative B: Full CAM
116116-- **Why rejected**: discrete CAM chips are tiny (4x4 bits) or expensive.
117117- can't practically build a matching store out of them at needed scale.
118118-119119-### Alternative C: Software Matching (in the PE pipeline)
120120-- **Why rejected**: turns every dyadic operation into a multi-cycle search.
121121- destroys throughput. the whole point is hardware matching.
122122-123123-### FPGA Prototyping (recommended)
124124-Before committing to a TTL matching store, prototype in a small FPGA
125125-(iCE40, etc.). validate the addressing scheme, test with real token
126126-streams, measure collision rates. doesn't compromise the "discrete logic"
127127-goal — it's a prototyping step. **strongly recommended** before building
128128-boards.
129129-130130----
131131-132132-## 4. I/O Architecture
133133-134134-### Chosen: I/O Controller on Type-11 System Channel
135135-136136-**UPDATED**: previous design considered SM-mapped I/O (I/O registers at
137137-specific SM addresses). this has been superseded by a dedicated I/O
138138-controller that lives on the type-11 system channel.
139139-140140-See `io-and-bootstrap.md` for full design.
141141-142142-### Alternative A: I/O as SM Bank (superseded)
143143-- **How it works**: an SM bank that isn't memory — it's a UART/SPI/etc.
144144- behind the same AN/DN interface. CM issues type-10 READ/WRITE to
145145- specific SM addresses and gets I/O responses.
146146-- **Advantages**:
147147- - Zero new architecture. CMs already talk to SM banks.
148148- - Simple mental model (memory-mapped I/O, just like everyone else)
149149-- **Disadvantages**:
150150- - Burns an SM_id slot (2-bit field, only 4 banks). real cost.
151151- - Semantic mismatch: SM operations are stateless request/response,
152152- but I/O often needs unsolicited events (UART RX)
153153- - An SM bank can't spontaneously generate tokens — it can only respond
154154- to requests. this means no interrupt equivalent, only polling.
155155- - Shoehorning I/O config (baud rate, etc.) into SM address space is
156156- awkward
157157-- **Why superseded**: the type-11 I/O controller approach gives a free
158158- packet format (not constrained by SM token layout), supports
159159- unsolicited token generation (dataflow-native interrupts), and doesn't
160160- consume SM address space. strictly better.
161161-162162-### Alternative B: I/O as Dedicated PE
163163-- **How it works**: a "PE" that isn't a real PE — it's an I/O controller
164164- with a CN network interface. receives type-00/01 tokens, interprets
165165- them as I/O commands.
166166-- **Advantages**:
167167- - I/O operations are function calls in the dataflow graph
168168- - Can generate unsolicited tokens (like the type-11 approach)
169169-- **Disadvantages**:
170170- - Consumes a PE address slot (2-bit PE_id, only 4 slots)
171171- - Token format is constrained by CM token layout (context, offset,
172172- port fields repurposed for I/O semantics — awkward fit)
173173-- **Why superseded**: type-11 approach gives the same benefits (network
174174- participant, unsolicited token generation) without consuming a PE
175175- address slot, and with a completely free packet format for I/O.
176176-177177-### Alternative C: Polling via SM
178178-- **How it works**: a PE periodically issues SM reads to check I/O status.
179179- no hardware interrupt mechanism at all.
180180-- **Advantages**:
181181- - Zero additional hardware
182182- - Entirely in the dataflow paradigm (it's just a program)
183183- - Deterministic timing
184184-- **Disadvantages**:
185185- - Latency = polling interval (potentially very high)
186186- - Wastes PE cycles on polling when no event pending
187187-- **Status**: rejected as general-purpose I/O. but **could coexist** with
188188- the type-11 I/O controller for very low-priority status checks.
189189-190190----
191191-192192-## 5. Separate Communication Networks (CN/AN/DN)
193193-194194-### Chosen: Shared Physical Bus for v0, Logically Separate
195195-196196-**UPDATED**: the Amamiya architecture has physically separate CN, AN, and
197197-DN. we're sharing a physical bus for v0 but maintaining the logical
198198-separation via the type field.
199199-200200-The type field provides a clean decomposition path:
201201-- When SM access contention becomes measurable, split type-10 traffic
202202- onto a dedicated AN/DN bus
203203-- CN (types 00/01) and system (type 11) stay on the original bus
204204-- Further splits as needed
205205-206206-This is a topology change, not a protocol change. no module interfaces
207207-change when the bus is split.
208208-209209-### Alternative: Physically Separate from Day One
210210-- **Advantages**:
211211- - No contention between traffic classes
212212- - Closer to Amamiya's proven architecture
213213-- **Disadvantages**:
214214- - 3x the bus wiring, routing logic, and board area for v0
215215- - At 4 PEs, contention is unlikely to be the bottleneck
216216- - Premature optimisation
217217-- **Why deferred**: build it when the measurements say you need it.
218218-219219----
220220-221221-## 6. Interrupt Handling
222222-223223-### Chosen: Unsolicited Token Injection from I/O Controller
224224-225225-**UPDATED**: previous design had interrupt tokens injected directly into
226226-PE input FIFOs via hardware edge detectors on I/O pins. this is superseded
227227-by the I/O controller model where the controller generates and injects
228228-tokens onto the network.
229229-230230-Advantages over the previous approach:
231231-- No per-PE interrupt hardware needed
232232-- I/O controller centralises all external event handling
233233-- Destination PE is configurable, not hardwired
234234-- Same mechanism works for all I/O devices
235235-236236-See `io-and-bootstrap.md` for the unsolicited token generation model.
237237-238238-### Previous Approach: Direct Interrupt Token Injection
239239-- Edge detector on I/O pin, gated into specific PE's input FIFO
240240-- Compiler designates which PE handles which interrupts
241241-- Hardware cost: edge detector + FIFO priority injection per interrupt
242242- source per PE
243243-- **Why superseded**: works, but ties interrupt handling to specific
244244- physical PE pins. the I/O controller model is more flexible and
245245- doesn't require any per-PE interrupt hardware.
246246-247247-### Alternative: Conventional Control Core for ISR
248248-- **Why rejected**: explicitly a non-goal to depend on a conventional
249249- control core for runtime operations.
250250-251251-### Alternative: Priority Routing for Interrupt Tokens
252252-- **Why rejected for v0**: over-engineered. the I/O controller approach
253253- provides adequate interrupt latency. priority can be added later
254254- (a priority bit + FIFO bypass is ~3 chips per routing node).
255255-256256----
257257-258258-## 7. Bootstrap / Program Loading
259259-260260-### Chosen: Layered Approach (microcontroller -> I/O controller)
261261-262262-**UPDATED**: previous design specified a dedicated hardwired microsequencer
263263-with a separate config bus. this has been superseded by a layered approach:
264264-265265-- **Phase 0-2**: external microcontroller as test fixture and bootstrap source
266266-- **Phase 4+**: I/O controller handles bootstrap via type-11 config writes
267267-268268-No separate config bus. bootstrap traffic travels the normal network as
269269-type-11 subtype-01 packets. this eliminates a dedicated bus and means the
270270-bootstrap path is also the runtime reprogramming path.
271271-272272-See `io-and-bootstrap.md` for the full bootstrap sequence.
273273-274274-### Previous Approach: Hardwired Microsequencer + Config Bus (superseded)
275275-- ROM state machine + UART + dedicated config bus
276276-- ~20-30 TTL chips
277277-- **Why superseded**: the type-11 config write mechanism eliminates the
278278- need for a separate config bus. the I/O controller (or external
279279- microcontroller during development) injects config writes onto the
280280- normal network. simpler architecture, fewer buses, and the same
281281- mechanism enables runtime reprogramming.
282282-283283-### Alternative A: Bootstrap PE (hardwired to run loader from ROM)
284284-- **Status**: deferred but not rejected. the I/O controller bootstrap
285285- model is essentially a simplified version of this — a fixed-function
286286- device that reads from storage and emits config writes. evolving the
287287- I/O controller toward a full PE with boot ROM is a natural future step.
288288- the architecture doesn't prevent it.
289289-290290-### Alternative B: External Host (6502, Z80, RP2040, etc.)
291291-- **Status**: the RP2040/Arduino IS the external host during Phase 0-2.
292292- it's a development tool, not part of the architecture. the long-term
293293- goal remains self-hosted bootstrap via the I/O controller.
294294-295295----
296296-297297-## 8. Data Width
298298-299299-No changes from previous version. 16-bit data, 32-bit bus, 36-bit escape
300300-hatch if needed.
301301-302302----
303303-304304-## 9. Clocking
305305-306306-### Chosen: Globally Synchronous with Gated Clocks (v0), Async Design Space Preserved
307307-308308-**NEW SECTION**: see `network-and-communication.md` for full details.
309309-310310-Three options were considered:
311311-312312-**Option A (chosen for v0): Globally synchronous, locally gated.**
313313-One master clock, stages stall independently via gated clocks. simplest
314314-TTL implementation.
315315-316316-**Option B: Mesochronous.** Same frequency, no phase alignment. dual-clock
317317-FIFOs at boundaries. more complex, not needed at v0 scale.
318318-319319-**Option C: Fully asynchronous.** No global clock, request/acknowledge
320320-handshaking everywhere. theoretically ideal for dataflow (fast paths go
321321-fast, slow paths don't hold things up). but designing and debugging async
322322-TTL is painful.
323323-324324-The architecture preserves Option C by mandating ready/valid handshaking
325325-at every inter-module boundary and FIFOs at every domain crossing. under
326326-Option A, FIFOs are simple circular buffers sharing a clock. under Option
327327-C, FIFO internals change to async circuits but the interface signals are
328328-identical. no module redesign required.
329329-330330-The inter-PE network is the highest-value target for early async adoption,
331331-even while PEs themselves stay synchronous.
332332-333333----
334334-335335-## 10. Miscellaneous Ideas Not Yet Integrated
336336-337337-### SM as Coprocessor for Complex Operations
338338-- SM could handle matrix multiply, FFT butterfly, etc. by embedding
339339- specialised functional units alongside data.
340340-- Very Amamiya-inspired (he embedded list operators in structure memory).
341341-- Deferred: v0 SM has only read/write/fetch-and-add/CAS.
342342-343343-### 4x4 CAM Chips (100142) for Small Associative Lookups
344344-- Too tiny for matching store (4 words x 4 bits per chip)
345345-- Potentially useful for routing table entries at network nodes (4-8 entries)
346346-- Keep in the parts bin. don't design around them.
347347-348348-### Compile-Time Token Route Scheduling
349349-- Partially static routing: compiler pre-computes common routes, configures
350350- routing tables. network still handles dynamic routing for runtime-generated
351351- tokens.
352352-- This is essentially what we've landed on. noted for clarity.
353353-354354-### Instruction Memory as Write-Back Cache
355355-- Future idea: if instruction memory is writable at runtime, could it
356356- function as a write-back cache for a larger backing store? PE fetches
357357- function bodies on demand from SM or flash, caches them in local
358358- instruction SRAM. evicts on capacity pressure.
359359-- Very speculative. would require significant additional hardware (tag
360360- memory, eviction logic, demand-fetch state machine). probably not
361361- worth it for v0-v4. but the writable instruction memory path means
362362- the hardware foundation exists.
-274
design-notes/versions/io-and-bootstrap.md
···11-# Dynamic Dataflow CPU — I/O & Bootstrap
22-33-Covers the type-11 subsystem: I/O controller design, peripheral interface,
44-bootstrap sequence, and the path from microcontroller-assisted bring-up to
55-self-hosted boot.
66-77-See `architecture-overview.md` for type-11 packet semantics.
88-See `network-and-communication.md` for how the I/O controller connects to
99-the bus.
1010-1111-## Type 11 Subtypes
1212-1313-Type 11 is the "system management" channel. the 2-bit subtype field
1414-immediately after the type field discriminates traffic classes:
1515-1616-```
1717-11 + 00: I/O operation — routed to the I/O controller
1818-11 + 01: Extended address / config write — target PE instruction memory,
1919- routing table, or other config registers
2020-11 + 10: Reserved (future: debug/trace, DMA, performance counters)
2121-11 + 11: Reserved
2222-```
2323-2424-All type-11 traffic is low frequency relative to types 00/01/10. it is
2525-acceptable for decode and handling to take extra cycles.
2626-2727-## I/O Controller
2828-2929-### What It Is
3030-3131-A fixed-function device on the network. NOT a PE — no matching store, no
3232-instruction memory, no ALU. it receives type-11 subtype-00 packets,
3333-interprets them as I/O commands, and responds.
3434-3535-it is also the only network participant that can **spontaneously generate
3636-tokens** without first receiving one. this is how external events (UART
3737-RX, sensor interrupts, timer ticks) enter the dataflow graph.
3838-3939-### Token Format for I/O (subtype 00)
4040-4141-28 bits of payload after [type:2][subtype:2]. suggested allocation:
4242-4343-```
4444-I/O Request (CM -> I/O controller):
4545-[type:2=11][subtype:2=00][device:3][register:4][R/W:1][data:16][pad:4]
4646-4747-I/O Response (I/O controller -> CM):
4848-Repackaged as a normal type 00 or 01 token addressed to the requesting CM.
4949-Return routing must be provided somewhere — either in the request's
5050-padding/data field (for reads), or preconfigured in the I/O controller.
5151-```
5252-5353-**Open question**: return routing for I/O reads. options:
5454-- (a) I/O read requests carry return routing in the data field (same
5555- pattern as SM READ requests — data field is unused on reads)
5656-- (b) I/O controller has a preconfigured "return to" address per device
5757- (simpler requests, but less flexible)
5858-- (c) I/O controller always returns to a fixed "I/O result handler" node
5959- in a designated PE (simplest, but rigid)
6060-6161-option (a) is most consistent with how SM works. likely the right call.
6262-6363-### Hardware
6464-6565-```
6666-Network Interface
6767- (receive type-11 subtype-00 packets)
6868- |
6969- v
7070- [Input FIFO]
7171- |
7272- v
7373- [Subtype Check] -- not subtype 00? --> discard or forward
7474- |
7575- v
7676- [Device/Register Decode] --- EEPROM or small logic
7777- |
7878- +---> [UART chip (6850/16550/etc.)]
7979- | - TX data register
8080- | - RX data register
8181- | - Status register
8282- | - Baud/config registers
8383- |
8484- +---> [future: SPI, GPIO, timer, etc.]
8585- |
8686- v
8787- [Result Formatter] --- constructs type 00/01 return token
8888- |
8989- v
9090- [Output FIFO]
9191- |
9292- v
9393- Network (token injected as type 00/01)
9494-```
9595-9696-Estimated hardware: ~15-25 TTL chips + UART chip. comparable to the
9797-microsequencer it replaces, but architecturally integrated.
9898-9999-### Unsolicited Token Generation (Interrupt Equivalent)
100100-101101-When an external event occurs (e.g., UART receives a byte), the I/O
102102-controller generates a token and injects it onto the network. from the
103103-receiving CM's perspective, data just arrived — exactly like any other
104104-token. no interrupt hardware needed on the CM side.
105105-106106-The destination for unsolicited tokens is preconfigured: either hardcoded
107107-in the I/O controller's EEPROM, or set at bootstrap via a type-11 config
108108-write to the I/O controller itself. "when UART RX fires, send the byte
109109-to PE 2, offset 0x10, context slot 3, port 0."
110110-111111-This is the **dataflow-native interrupt model**: external events are
112112-token sources. they feed into the dataflow graph at designated entry
113113-points. the receiving PE doesn't need to do anything special — it just
114114-sees a token arrive and processes it like any other.
115115-116116-Implications:
117117-- the I/O controller is a **source node** in the dataflow graph
118118-- it breaks the invariant that "tokens are only produced in response to
119119- other tokens" — external reality leaks in here
120120-- the network must accept tokens from the I/O controller even when no
121121- request was sent (the I/O controller's output FIFO can fill independently)
122122-- if the destination PE's input FIFO is full, backpressure propagates
123123- to the I/O controller. UART RX bytes could be lost if the system can't
124124- keep up. the I/O controller should have a small internal buffer
125125- (or the UART chip's built-in FIFO handles this).
126126-127127-## Config Writes (subtype 01)
128128-129129-### Purpose
130130-131131-Type-11 subtype-01 packets write to PE instruction memory, routing tables,
132132-or other configuration state. they are the mechanism for:
133133-134134-1. Bootstrap program loading
135135-2. Runtime reprogramming (future)
136136-3. Routing table configuration
137137-138138-### Packet Format
139139-140140-```
141141-Config Write:
142142-[type:2=11][subtype:2=01][target_PE:2][target_addr:10][data:16]
143143-144144-For wider addresses or multi-word writes, use two flits:
145145-Flit 1: [type:2=11][subtype:2=01][target_PE:2][flags:2][addr_hi:8][addr_lo:8][pad:8]
146146-Flit 2: [data:16][...additional fields as needed...]
147147-```
148148-149149-**Open question**: exact bit allocation depends on how wide instruction
150150-memory addresses need to be and how wide instruction words are. if
151151-instruction words are wider than 16 bits, config writes are necessarily
152152-multi-flit.
153153-154154-### Routing
155155-156156-Config writes are addressed to a specific PE by the target_PE field. the
157157-routing network delivers them like any other token — type 11 is inspected
158158-by routing nodes only to the extent of "this is not type 00/01/10, forward
159159-toward the target." the target PE recognises the subtype-01 packet and
160160-routes it to the instruction memory write port (see `pe-design.md`).
161161-162162-Routing tables themselves can be written via config writes. the target is
163163-a routing node, not a PE. routing nodes need a small amount of config
164164-write handling: recognise "this config write is for me" (based on node ID)
165165-and update the local prefix table. during bootstrap, routing nodes are in
166166-default mode (fixed-address routing), so config writes reach them reliably
167167-without needing configured routing.
168168-169169-## Bootstrap Sequence
170170-171171-### Development / Early Prototyping
172172-173173-For Phase 0-2, an external microcontroller (RP2040, Arduino) acts as the
174174-bootstrap source. it is NOT part of the architecture — it's a test fixture.
175175-176176-The microcontroller:
177177-1. Formats type-11 subtype-01 packets (config writes)
178178-2. Injects them into the network (via a dedicated injection port or by
179179- bit-banging the bus interface)
180180-3. Writes instruction words to each PE's instruction memory
181181-4. Optionally writes routing table entries to routing nodes
182182-5. Optionally writes initial SM contents via type-10 packets
183183-6. Injects seed token(s) — type 00/01 packets that kick off execution
184184-7. Releases the bus (goes high-impedance or disconnects)
185185-186186-This lets PE and SM hardware be tested without any of the I/O controller
187187-or bootstrap logic existing. the microcontroller is the bootstrap, the
188188-debug interface, and the test harness all in one.
189189-190190-### Self-Hosted Bootstrap (Phase 4+)
191191-192192-The I/O controller replaces the microcontroller as the bootstrap source:
193193-194194-1. On reset, the I/O controller enters bootstrap mode
195195-2. It reads program data from a connected flash/EEPROM (via SPI or
196196- parallel interface) or receives it over UART from an external host
197197-3. It formats config write packets (type-11 subtype-01) and injects them
198198- onto the network
199199-4. Each PE receives config writes and loads its instruction memory
200200-5. Routing tables are configured via config writes to routing nodes
201201-6. I/O controller injects seed token(s) to start execution
202202-7. I/O controller transitions to normal mode (handling I/O requests)
203203-204204-The I/O controller's bootstrap logic is a state machine, likely driven
205205-by a small ROM or EEPROM. it doesn't need to be a general-purpose
206206-processor — it just sequences reads from storage and formats them as
207207-config writes.
208208-209209-### Chicken-and-Egg: Routing During Bootstrap
210210-211211-During bootstrap, routing tables are not yet configured. the network uses
212212-fixed-address default routing (see `network-and-communication.md`):
213213-214214-- Each PE has a unique ID (EEPROM / DIP switches)
215215-- Routing nodes forward by PE_id without consulting tables
216216-- At v0 scale (shared bus), this is trivially true — everything sees
217217- everything
218218-- At larger scale, default routing must be sufficient to reach all PEs
219219- from the bootstrap source. this constrains the physical topology
220220- (bootstrap source must be topologically reachable from all PEs via
221221- default forwarding).
222222-223223-The I/O controller's own ID and the default routing path to it are
224224-hardwired or EEPROM-configured. it doesn't depend on routing tables.
225225-226226-### Seed Token Injection
227227-228228-After program loading, the I/O controller (or microcontroller) injects
229229-one or more seed tokens to start execution. these are normal type 00/01
230230-tokens addressed to the entry point(s) of the loaded program.
231231-232232-For a simple program with one entry point: one seed token to one PE.
233233-For a program with multiple independent entry points (e.g., main program
234234-+ I/O handler): multiple seed tokens to different PEs.
235235-236236-The seed tokens are part of the program image — the compiler specifies
237237-"to start this program, inject these tokens." the bootstrap loader reads
238238-them from the program image and injects them after loading is complete.
239239-240240-## Layering Summary
241241-242242-The I/O and bootstrap design is explicitly layered for incremental
243243-development:
244244-245245-| Phase | Bootstrap Source | I/O | Network Config |
246246-|-------|-----------------|-----|----------------|
247247-| 0-1 | Microcontroller (external) | None | Direct SRAM programming |
248248-| 2 | Microcontroller via type-11 | None | Config writes on bus |
249249-| 3 | Microcontroller via type-11 | Polling via SM (optional) | Config writes |
250250-| 4 | I/O controller (self-hosted) | I/O controller (type 11) | Config writes from I/O controller |
251251-252252-Each phase adds capability without redesigning previous work. the key
253253-enabler is that config writes (type-11 subtype-01) work the same whether
254254-they come from a microcontroller or the I/O controller. the network
255255-doesn't know or care about the source.
256256-257257-## Open Design Questions
258258-259259-1. **I/O return routing** — option (a), (b), or (c) from above?
260260-2. **Unsolicited token destination config** — hardcoded or runtime-
261261- configurable? if configurable, via what mechanism? (probably a
262262- type-11 config write to the I/O controller itself)
263263-3. **I/O controller bootstrap ROM** — how big? what's in it? just a
264264- state machine for "read flash, emit config writes" or something more?
265265-4. **Flash/EEPROM interface** — SPI? parallel? what storage device?
266266-5. **Program image format** — what does the compiler output? a stream of
267267- (target_PE, address, instruction_word) tuples? plus seed tokens at
268268- the end?
269269-6. **Multiple I/O devices** — how does the device field in the I/O token
270270- scale? 3 bits = 8 devices. enough?
271271-7. **I/O controller as bootstrap PE** — at what point (if ever) does it
272272- make sense to make the I/O controller a full PE with a boot ROM
273273- instead of fixed-function? probably not for v0-v4, but worth keeping
274274- in mind architecturally.
···11-# Dynamic Dataflow CPU — Network & Communication
22-33-Covers the interconnect between CMs, SMs, and the I/O subsystem. Routing
44-topology, clocking discipline, handshaking protocols, and the scaling path
55-from shared bus to split networks.
66-77-See `architecture-overview.md` for the token format and type field semantics
88-that drive routing decisions.
99-1010-## Logical Networks
1111-1212-Three logical traffic classes, distinguished by the type field in the
1313-32-bit token packet:
1414-1515-| Network | Direction | Token Types | Traffic |
1616-|---------|-----------|-------------|---------|
1717-| CN | CM <-> CM | 00 (dyadic), 01 (monadic) | operand tokens between PEs |
1818-| AN | CM -> SM | 10 (structure) | memory operation requests |
1919-| DN | SM -> CM | (results repackaged as 00/01) | memory operation results |
2020-| System | any <-> I/O, any -> any (config) | 11 (system) | I/O, config writes, future debug |
2121-2222-DN traffic is interesting: SM produces results that are repackaged as
2323-type 00 or 01 tokens destined for the requesting CM. so from a routing
2424-perspective, DN results look like CN traffic once they leave the SM. the
2525-SM result formatter handles this — it extracts return routing from the
2626-original request and constructs a properly-typed token.
2727-2828-## Physical Implementation: v0
2929-3030-### Shared Bus with Pipelined Latches
3131-3232-For 4 PEs + 1-2 SMs + I/O controller (~6-7 nodes), all traffic shares a
3333-single physical 32-bit bus. routing nodes inspect the type field and
3434-forward to the appropriate destination:
3535-3636-- Types 00/01: route by PE_id field to destination CM
3737-- Type 10: route by SM_id field to destination SM bank
3838-- Type 11: route to I/O controller (subtype 00) or to target PE's config
3939- input (subtype 01)
4040-4141-Multiple packets can be in flight simultaneously. each hop through a
4242-routing node takes one cycle, and latches at each stage hold the packet.
4343-with 2-3 hop maximum paths, 2-3 packets can be in transit concurrently.
4444-4545-Bus arbitration: if multiple sources want to inject a packet in the same
4646-cycle, priority logic or round-robin selects one and the others assert
4747-backpressure (their output latch stays full, which stalls their pipeline
4848-via the ready/valid protocol).
4949-5050-Hardware estimate: ~2-3K transistors for the routing network at this
5151-scale. each routing node is essentially a comparator on the type/destination
5252-fields + a mux + a latch.
5353-5454-### Scaling Path: Split Networks
5555-5656-When (if) contention becomes measurable, the first split is to separate
5757-the AN/DN from the CN:
5858-5959-- CN carries types 00, 01, and 11 (CM-to-CM + system traffic)
6060-- AN/DN carry type 10 (SM traffic) on a dedicated path
6161-6262-This is a topology change, not a protocol change. no module interfaces
6363-change. routing nodes on each network simply stop seeing the traffic types
6464-that moved to the other network.
6565-6666-Further splits (dedicated type-11 system bus, per-SM AN/DN paths) follow
6767-the same pattern. the type field makes this incrementally decomposable.
6868-6969-### Fixed-Address Bootstrap Routing
7070-7171-During bootstrap, before routing tables are configured, all PEs are
7272-reachable via fixed physical addresses. each PE has a unique ID
7373-(set via EEPROM or DIP switches — see `pe-design.md`). routing nodes
7474-use a hardwired default mode where they route purely by PE ID without
7575-consulting lookup tables.
7676-7777-Two approaches, not mutually exclusive:
7878-7979-**Default routing mode**: each routing node has a "configured" bit.
8080-when unset (power-on default), the node routes by simple PE_id matching —
8181-if the destination is on this node's port, deliver; otherwise, forward.
8282-once routing tables are loaded via type-11 config writes, the
8383-"configured" bit is set and the node switches to table-based routing.
8484-hardware cost: one bit + one mux per routing node.
8585-8686-**Flat addressing**: at 4 PEs on a shared bus, every node can see every
8787-packet anyway. destination PE_id in the packet header is sufficient.
8888-hierarchical prefix routing is irrelevant until the network topology has
8989-multiple levels. for v0, "flat addressing" is just how shared buses work.
9090-9191-## Routing Topology (Multi-PE)
9292-9393-### Hierarchical Prefix-Based Routing
9494-9595-NOT Manchester-style omega network. prefix routing gives variable latency:
9696-local = 1 hop, cross-cluster = 2-3 hops. average latency depends on
9797-program locality, which the compiler can optimise.
9898-9999-- Top bits of PE_id select cluster, lower bits select within cluster
100100-- Each routing node has a small prefix lookup table, configured at
101101- program load time (via type-11 config writes)
102102-- Pao's bitwise AND trick potentially useful for routing decisions or
103103- small associative lookups at routing nodes
104104-105105-This topology doesn't need to be built until Phase 3+. the token format
106106-supports it already. see `design-alternatives.md` for comparison with
107107-omega, crossbar, ring, and shared bus approaches.
108108-109109-## Clocking Discipline
110110-111111-### Design Principle
112112-113113-**Every inter-module boundary communicates via ready/valid handshaking.
114114-no module assumes anything about the timing of the module on the other
115115-side of a FIFO.**
116116-117117-This is the single most important architectural constraint for preserving
118118-future design space. it enables starting with globally synchronous clocking
119119-and evolving toward partially or fully asynchronous operation without
120120-changing any module interfaces.
121121-122122-### v0: Globally Synchronous, Locally Gated (Option A)
123123-124124-One master clock. each pipeline stage only advances when its input FIFO
125125-has data AND its output FIFO has space. the clock to each stage is gated:
126126-127127-```
128128-stage_clock = master_clock AND input_not_empty AND output_not_full
129129-```
130130-131131-Each stage is flip-flop based, everything is referenced to the same edge,
132132-but stages can stall independently. this is the simplest TTL implementation
133133-and sufficient for initial bring-up and testing.
134134-135135-### Future: Fully Asynchronous (Option C)
136136-137137-No global clock. each stage signals "data ready" to the next, which signals
138138-"accepted" back (4-phase or 2-phase handshake). fast paths go fast, slow
139139-paths (SM access, cross-PE routing) take longer without holding anything
140140-else up.
141141-142142-Designing fully async in TTL is painful (hazard-prone, harder to debug),
143143-but the architecture does NOT rule it out, provided the ready/valid
144144-discipline is maintained from the start.
145145-146146-### Where Async Pays Off Most: The Inter-PE Network
147147-148148-Even under Option A (synchronous PEs), the routing network between PEs
149149-benefits from asynchronous handshaking. routing latency is variable
150150-(depends on path length and contention), which is awkward to handle in a
151151-synchronous pipeline. with async handshaking on routing nodes, tokens
152152-propagate at wire speed + gate delay and land in the destination PE's
153153-input FIFO, which synchronises to the local clock.
154154-155155-This is a small, contained piece of async design that buys a lot. it means
156156-the inter-PE network doesn't constrain the PE clock frequency, and adding
157157-routing hops doesn't require slowing the whole system down.
158158-159159-### Concrete Requirements to Preserve Option C
160160-161161-1. **FIFO interfaces are defined as**:
162162- - Input side: `data_in`, `write_enable`, `full`
163163- - Output side: `data_out`, `read_enable`, `empty`
164164- - No clock crossing assumptions in the protocol
165165-166166-2. **Under Option A** (synchronous): both sides share a clock.
167167- `write_enable` / `read_enable` are gated clock enables. FIFO is a
168168- simple circular buffer.
169169-170170-3. **Under Option C** (async): FIFO internals become async (gray-code
171171- pointers or 4-phase handshake). interface signals are identical.
172172- nothing on either side of the FIFO changes.
173173-174174-4. **Never design a path where module A asserts a signal and module B is
175175- assumed to see it on the next clock edge without a FIFO or latch in
176176- between.** If this discipline is maintained, swapping to async later
177177- is a FIFO-internal change, not an architectural one.
178178-179179-5. **Arbitration interfaces use request/grant, not clock-phase
180180- assumptions.** This matters for instruction memory arbitration (pipeline
181181- vs network write — see `pe-design.md`) and shared bus access. a
182182- synchronous arbiter uses request/grant resolved on a clock edge. an
183183- async arbiter (Seitz mutual exclusion element) uses request/grant
184184- resolved by circuit dynamics. the interface is the same.
185185-186186-### SM Clock Independence
187187-188188-SM bank access time may differ from the PE pipeline clock. the split-phase
189189-nature of SM access (request on AN, result on DN whenever ready) already
190190-accommodates this. SM can run on its own clock, or at its own speed in an
191191-async design. FIFOs at the AN input and DN output handle the domain
192192-crossing.
193193-194194-## Backpressure
195195-196196-All flow control is via FIFO fullness. when a FIFO is full, the upstream
197197-module stalls:
198198-199199-- PE output FIFO full -> PE pipeline stalls at token output stage
200200-- Routing node output latch full -> upstream routing node holds packet
201201-- SM request FIFO full -> AN stops accepting from CMs
202202-- PE input FIFO full -> network stops delivering to that PE
203203-204204-This propagates backpressure naturally without deadlock, **provided there
205205-are no circular dependencies in the flow graph that require simultaneous
206206-forward progress on multiple paths.** in practice, dataflow graphs are
207207-DAGs (or have cycles broken by the matching store / context slot mechanism),
208208-so this is generally safe. worth verifying per-program, though.
209209-210210-The one risk is the shared bus at v0: if all PE input FIFOs are full and
211211-all PEs are trying to send, you get global stall. this is a capacity
212212-problem (FIFOs too small or too much parallelism for the bus bandwidth),
213213-not a protocol problem. increasing FIFO depth or splitting the bus
214214-resolves it.
215215-216216-## Open Design Questions
217217-218218-1. Exact routing node logic — comparator + mux + latch, or something
219219- more sophisticated?
220220-2. Bus arbitration policy — round-robin vs priority? priority for type-11
221221- config traffic during bootstrap?
222222-3. FIFO depth at each boundary — 8-deep at PE input is the current sketch,
223223- what about routing node latches and SM FIFOs?
224224-4. Async routing node prototype — worth building one async routing node
225225- early to validate the handshake protocol?
-502
design-notes/versions/pe-design(1).md
···11-# Dynamic Dataflow CPU — PE (Processing Element) Design
22-33-Covers the CM (Control Module) pipeline, matching store, instruction memory,
44-context slot management, and per-PE identity.
55-66-See `architecture-overview.md` for token format and module taxonomy.
77-See `network-and-communication.md` for how tokens enter/leave the PE.
88-99-## Design Philosophy: Static Assignment, Compiler-Driven Sizing
1010-1111-This design diverges significantly from both Manchester and Amamiya in how
1212-PEs are used. Understanding the difference is critical to understanding why
1313-the matching store can be so much smaller here.
1414-1515-**Amamiya DFM (1982/17407 papers):** every PE has ALL function bodies
1616-pre-loaded in instruction memory (8KW, 58 bits/word per PE, identical
1717-contents across all PEs). Function *instances* are dynamically assigned to
1818-PEs at runtime by a CCU (Cluster Control Unit) that picks the least-loaded
1919-PE. The OM (operand matching memory) needs 1024 CAM blocks per PE because
2020-any function can run anywhere, and deep Lisp recursion means many
2121-simultaneous activations. The "semi-CAM" was their solution to making this
2222-affordable — instance name directly addresses a block, then 4-way
2323-set-associative lookup within the block on instruction identifier.
2424-2525-**Manchester (Gurd 1985):** similar story but with hashing instead of
2626-semi-CAM. 16 parallel 64K-token memory banks per PE for set-associative
2727-hash lookup. 1M token capacity matching store. Plus an overflow unit
2828-(initially emulated on the host). The matching unit alone was 16 memory
2929-boards per PE.
3030-3131-Both machines sized their matching stores for worst-case dynamic scheduling
3232-of arbitrary programs. The whole program lives in every PE (or in a single
3333-PE's matching unit), and any activation can land anywhere. That's why
3434-those matching stores are enormous.
3535-3636-**This design:** the compiler statically assigns function bodies (or chunks
3737-of them) to specific PEs. Different PEs have different instruction memory
3838-contents. The compiler knows at compile time which functions run where,
3939-and can calculate maximum concurrent activations per PE. This means:
4040-4141-- Instruction memory is NOT replicated — each PE only holds its assigned
4242- function bodies. IM can be much smaller.
4343-- The matching store only needs enough context slots for the maximum
4444- concurrent activations the compiler predicts for that specific PE.
4545- Not 1024. Probably 16-32.
4646-- No CCU needed for dynamic PE allocation. Scheduling decisions are
4747- made at compile time.
4848-- The tradeoff is scheduling flexibility — you can't dynamically
4949- rebalance load at runtime. The compiler must get it roughly right.
5050-5151-### Function Splitting Across PEs
5252-5353-A "function" in the source language does NOT need to map 1:1 to a
5454-contiguous block on one PE. The compiler can split a function body at
5555-any data-dependency boundary. The token network doesn't know or care
5656-whether two instructions are "in the same function" — it just sees tokens
5757-with destinations.
5858-5959-A 40-instruction function body could be split into three chunks of ~13
6060-instructions across three PEs, each chunk fitting in a smaller context
6161-slot. The "function" as the architecture sees it is really "a set of
6262-instructions that share a context slot ID on this PE." The compiler
6363-defines what that grouping means.
6464-6565-This is a powerful lever for keeping context slots small: if a function
6666-body is too big for the slot size, the compiler splits it. The split
6767-introduces inter-PE token traffic (extra network hops), but keeps
6868-per-PE hardware simple. The compiler can optimise the split points to
6969-minimise cross-PE traffic.
7070-7171-**Implication for context slot semantics:** a context slot doesn't mean
7272-"one function activation." It means "one chunk of work sharing a local
7373-operand namespace on this PE." Multiple context slots on different PEs
7474-might collectively represent one function activation. The token's ctx_slot
7575-field scopes operand matching to a local context, nothing more.
7676-7777-**Implication for the compiler:** this architecture actively wants either
7878-small functions or functions distributed across PEs. The compiler is free
7979-to treat any subgraph of the dataflow graph as a "chunk" and assign it to
8080-a PE, regardless of source-level function boundaries. Loop bodies, branch
8181-arms, pipeline stages — all valid chunk boundaries. The grain of
8282-scheduling is the subgraph, not the function.
8383-8484-## PE Identity
8585-8686-Each PE has a unique ID used for routing. Two mechanisms, not mutually
8787-exclusive:
8888-8989-**EEPROM-based**: the instruction decoder EEPROM already contains
9090-per-PE truth tables. The PE ID can be encoded as additional input bits
9191-to the EEPROM, meaning the EEPROM contents are unique per PE but the
9292-circuit board is identical. The instruction decoder "knows" which PE
9393-it is because its EEPROM was burned with that ID.
9494-9595-**DIP switches**: 3-4 switches give 8-16 PE addresses. Better for early
9696-prototyping — reconfigurable without reflashing. Can coexist with the
9797-EEPROM approach (switches provide ID bits that feed into the EEPROM
9898-address lines).
9999-100100-The PE ID is needed in two places:
101101-1. Input token filtering: "is this token addressed to me?"
102102-2. Output token formatting: "set the source PE field" (if result tokens
103103- carry source info for return routing)
104104-105105-## PE Pipeline (5-stage sketch)
106106-107107-```
108108-Stage 1: TOKEN INPUT
109109- - Receive token from network
110110- - Classify: type 00/01 (normal), type 11 subtype 01 (config write)
111111- - Normal tokens -> pipeline FIFO
112112- - Config writes -> instruction memory write port (stalls pipeline)
113113- - Buffer in small FIFO (8-deep, 32-bit)
114114- - ~1K transistors (flip-flops) or use small SRAM
115115-116116-Stage 2: MATCH / BYPASS
117117- - Type 00 (dyadic): direct-index into context slot array
118118- - Check generation counter: mismatch = stale, discard
119119- - First operand: store in slot, advance to wait state
120120- - Second operand: read partner from slot, both proceed
121121- - Type 01 (monadic): bypass matching entirely, proceed directly
122122- - Single cycle for all cases (no hash path, no CAM search —
123123- direct indexing only, see matching store section below)
124124- - Estimated: ~200-300 transistors + SRAM
125125-126126-Stage 3: INSTRUCTION FETCH
127127- - Use local offset to read from PE's instruction SRAM
128128- - External SRAM chip, so just address generation logic
129129- - ~200 transistors of logic
130130- - NOTE: instruction memory is shared between pipeline reads and
131131- network config writes — see "Instruction Memory" section below
132132-133133-Stage 4: EXECUTE
134134- - 8/16-bit ALU
135135- - ~500-2000 transistors depending on width and features
136136-137137-Stage 5: TOKEN OUTPUT
138138- - Form result token with routing prefix (type, destination PE/SM,
139139- offset, context, etc.)
140140- - Inject into network via output FIFO
141141- - ~300 transistors
142142-```
143143-144144-Pipeline registers between stages: ~500 transistors
145145-Control logic (state machine, handshaking): ~500-1000 transistors
146146-147147-**Per-PE total: ~3-5K transistors of logic + SRAM chips**
148148-149149-(Revised down from earlier 5-8K estimate. The matching stage is dramatically
150150-simpler than originally sketched now that hash fallback is removed from the
151151-primary pipeline. See matching store section.)
152152-153153-## Instruction Memory
154154-155155-### Static Assignment, Per-PE Contents
156156-157157-Unlike Amamiya where every PE has identical IM contents (full program),
158158-each PE here holds only the function bodies (or function chunks) assigned
159159-to it by the compiler. This means:
160160-161161-- IM is smaller per PE (only assigned code, not the whole program)
162162-- Different PEs have different IM contents (loaded at bootstrap)
163163-- The compiler emits a per-PE instruction image as part of the program
164164-165165-### Runtime Writability
166166-167167-Instruction memory is **not** read-only. It is writable from the network
168168-via type-11 subtype-01 (config/extended address) packets. This serves
169169-two purposes:
170170-171171-1. **Bootstrap**: loading programs before execution starts
172172-2. **Runtime reprogramming**: loading new function bodies while other PEs
173173- continue executing (future capability, not needed for v0)
174174-175175-Runtime writability also means instruction memory size is not a hard
176176-architectural limit — if a program needs more code than fits in one PE's
177177-IM, the runtime (or a management PE) could swap function bodies in and
178178-out. Very speculative, but the hardware path exists.
179179-180180-### Implementation
181181-182182-Instruction memory is external SRAM. The PE pipeline reads from it during
183183-Stage 3 (instruction fetch). The network can write to it via config
184184-packets received at Stage 1.
185185-186186-Shared SRAM means arbitration between two users:
187187-- Pipeline reads (instruction fetch): high frequency, performance-critical
188188-- Network writes (config): low frequency, can tolerate delay
189189-190190-**Arbitration approach**: network writes get priority when they arrive
191191-(they're rare and bursty during bootstrap). When a config write is in
192192-progress, the pipeline stalls for one cycle at Stage 3. Hardware cost:
193193-mux on SRAM address/data buses + write-enable gating + stall signal to
194194-pipeline. Roughly 5-8 TTL chips.
195195-196196-**Async-compatible arbitration**: defined as request/grant interface.
197197-Synchronous implementation: priority mux resolved on clock edge. Async
198198-implementation: mutual exclusion element (Seitz arbiter). Interface is
199199-the same in both cases. See `network-and-communication.md` for clocking
200200-discipline.
201201-202202-### EEPROM-Based Instruction Decoding
203203-204204-The instruction decoder can be implemented as an EEPROM acting like a PLD.
205205-Input bits = instruction opcode fields + PE ID bits. Output bits = control
206206-signals for the ALU, matching store, token output formatter, etc.
207207-208208-This gives significant flexibility:
209209-- Instruction set can be changed by reflashing the EEPROM (no board changes)
210210-- Per-PE customisation (different PEs could theoretically have different
211211- instruction subsets, though unlikely for v0)
212212-- The PE ID is "free" — it's just more EEPROM address bits
213213-214214-## Matching Store Design
215215-216216-### Why It Can Be Small
217217-218218-The matching store is the highest-risk component in any dataflow machine.
219219-Manchester needed 16 memory boards per PE. Amamiya needed 1024 CAM blocks
220220-(32KW at 43 bits/word) per PE. Both were sized for worst-case dynamic
221221-scheduling of arbitrary programs.
222222-223223-This design avoids that because:
224224-225225-1. **Static PE assignment**: the compiler knows which functions run on
226226- which PE and can calculate maximum concurrent activations per PE.
227227-2. **Function splitting**: the compiler can split large function bodies
228228- across PEs so no single PE needs a huge context slot.
229229-3. **Compiler-controlled slot allocation**: the compiler assigns context
230230- slot IDs at compile time for statically-known activations. Only
231231- genuinely dynamic activations (runtime-determined recursion depth)
232232- need runtime allocation.
233233-234234-The matching store size is therefore a *compiler parameter*, not an
235235-architectural constant. The hardware provides N context slots of M entries
236236-each. The compiler must generate code that fits within those limits,
237237-splitting and scheduling accordingly.
238238-239239-### Architecture: Pure Direct-Indexed Context Slots
240240-241241-No hash fallback. No CAM search. No set-associative lookup.
242242-243243-The matching operation is:
244244-245245-```
246246-SRAM_address = [ctx_slot (from token) : match_entry (from token or instr)]
247247-read SRAM at that address
248248-check generation counter: mismatch = stale, discard token
249249-250250-if occupied bit set:
251251- -> match found, read stored operand, proceed to instruction fetch
252252- -> clear occupied bit
253253-else:
254254- -> no match, write incoming operand, set occupied bit
255255- -> token consumed, advance to next input token
256256-```
257257-258258-Single cycle. Always. The only comparison is the generation counter check
259259-(2-bit XOR, trivial). There is no "miss" path that requires multi-cycle
260260-recovery, because the address is deterministic — the compiler guaranteed
261261-that ctx_slot + match_entry uniquely identifies this matching location.
262262-263263-Hardware cost:
264264-- One SRAM chip (matching store data)
265265-- Small register file or SRAM (occupied bitmap + port flags)
266266-- Address generation: concatenate ctx_slot bits and match_entry bits (wires)
267267-- Read/write control: occupied bit check + generation compare (~3-4 chips)
268268-- Generation counter storage: 2 bits per context slot (~1 chip for 32 slots)
269269-270270-Total matching logic per PE: **~200-300 transistors + one SRAM chip + bitmap.**
271271-Order of magnitude less than the 2-3K transistors estimated when the
272272-design included a hash fallback path.
273273-274274-### Context Slot Sizing
275275-276276-The slot count (N) vs entry count (M) tradeoff maps to:
277277-- **N (slots)**: how many concurrent activations can this PE handle
278278-- **M (entries per slot)**: how many dyadic instructions per function chunk
279279-280280-Both are compiler-controllable. More slots = more parallelism headroom.
281281-More entries per slot = bigger function bodies without splitting. The
282282-compiler balances these.
283283-284284-**Candidate configurations (targeting clean SRAM utilisation):**
285285-286286-```
287287-Config A: 32 slots x 16 entries = 512 cells
288288- - 9-bit SRAM address (5 ctx + 4 offset)
289289- - 16-bit values: 1KB exactly in one 8Kbit SRAM chip
290290- - Good concurrency headroom, smaller function chunks
291291-292292-Config B: 16 slots x 32 entries = 512 cells
293293- - 9-bit SRAM address (4 ctx + 5 offset)
294294- - 16-bit values: 1KB exactly
295295- - Matches current 4-bit ctx_slot token format
296296- - Fewer concurrent activations, bigger function chunks
297297-298298-Config C: 32 slots x 32 entries = 1024 cells
299299- - 10-bit SRAM address (5 ctx + 5 offset)
300300- - 16-bit values: 2KB, fits in one 16Kbit SRAM chip
301301- - Most headroom in both dimensions
302302- - Probably the sweet spot for SRAM utilisation vs headroom
303303-304304-Config D: 64 slots x 16 entries = 1024 cells
305305- - 10-bit SRAM address (6 ctx + 4 offset)
306306- - 16-bit values: 2KB
307307- - Favours concurrency over function chunk size
308308- - 64 concurrent activations likely overkill for v0 but future-proof
309309-```
310310-311311-**SRAM layout:** store 16-bit operand values in the main SRAM chip
312312-(standard 8-bit-wide chips, 2 bytes per entry, sequential access or
313313-use 16-bit-wide SRAM if available). Occupied flags + port indicators
314314-stored separately:
315315-316316-- **Occupied bitmap**: 1 bit per cell. 512 cells = 64 bytes (trivially
317317- small — a few flip-flops or one tiny SRAM). 1024 cells = 128 bytes.
318318-- **Port indicator**: 1 bit per cell (left/right operand). Same size as
319319- occupied bitmap. Can share the same storage.
320320-- **Generation counters**: 2 bits per *slot* (not per cell). 32 slots =
321321- 8 bytes. Trivial — a small register file or a handful of flip-flops.
322322-323323-This separation means the main SRAM stores only 16-bit values at clean
324324-power-of-two addresses. No awkward 18-bit word widths. The metadata
325325-(occupied, port, gen) is tiny and stored in dedicated fast-access
326326-registers alongside the SRAM.
327327-328328-**Recommendation for v0**: start with Config B (16 slots x 32 entries =
329329-1KB) to match the current 4-bit ctx_slot token field. Upgrade to Config C
330330-(32 x 32 = 2KB, needs 5-bit ctx_slot) if 16 concurrent activations
331331-proves too tight. The physical SRAM chip doesn't change between these
332332-configs — just the address generation logic.
333333-334334-### Instruction Address vs Matching Store Address
335335-336336-These are NOT the same thing, and this distinction matters:
337337-338338-- **Instruction address** (used in Stage 3): indexes into instruction
339339- memory SRAM. 7-8 bits (128-256 instructions per PE). Used by ALL
340340- token types. This is the "offset" field in the token.
341341-- **Matching store address** (used in Stage 2): indexes into matching
342342- store SRAM. Composed of [ctx_slot : match_entry]. Only used by
343343- dyadic tokens.
344344-345345-The compiler maintains the mapping. For dyadic instructions, the
346346-instruction word in IM includes a "match_entry" field that tells the
347347-hardware which matching store entry corresponds to this instruction.
348348-349349-This means the matching store is dense with respect to dyadic instructions
350350-— no gaps for monadic instructions. A function chunk with 20 instructions,
351351-8 of which are dyadic, uses 8 matching store entries, not 20.
352352-353353-**Simplest v0 approach:** the token carries the instruction memory offset
354354-(for Stage 3). The instruction word fetched in Stage 3 contains the
355355-match_entry index (for Stage 2, which already happened on the previous
356356-pipeline cycle for the previous token — or, more precisely, the match
357357-stage reads match_entry from a lookup table indexed by offset, stored
358358-alongside the instruction memory).
359359-360360-Actually, wait — Stage 2 happens BEFORE Stage 3. So the match_entry
361361-must come from the token, not the instruction word. This means either:
362362-363363-(a) The token carries both an instruction offset AND a match entry index.
364364- Costs token bits. May require the offset field to be split or the
365365- match_entry to be packed into unused bits.
366366-367367-(b) The match_entry IS the instruction offset, and the instruction memory
368368- is laid out so that the offset of a dyadic instruction is also its
369369- matching store entry within the slot. This works if the compiler
370370- assigns offsets such that dyadic instruction offsets are dense (0, 1,
371371- 2, ...) and monadic instruction offsets are in a separate range.
372372-373373-(c) A small lookup ROM/SRAM alongside the matching store maps instruction
374374- offset -> match_entry. This adds a read before the match SRAM access
375375- (serial, adds latency) or requires a second SRAM port (parallel, adds
376376- hardware).
377377-378378-Option (b) is the simplest if the compiler can make it work. The
379379-instruction memory layout would be: dyadic instructions at offsets 0..M-1,
380380-monadic instructions at offsets M..N-1. The token's offset field directly
381381-indexes both the matching store (for dyadic) and the instruction memory
382382-(for everything). The matching store just doesn't get accessed for offsets
383383->= M (monadic range).
384384-385385-This constrains instruction memory layout — dyadic instructions must be
386386-packed at the low end. But the compiler controls the layout, so this is
387387-achievable.
388388-389389-**Recommendation for v0:** option (b). Dyadic instructions packed at
390390-offsets 0..M-1 in instruction memory, monadic at M..N-1. Token offset
391391-directly serves as both instruction address and (for offsets < M)
392392-matching store entry within the context slot. Clean, no extra bits,
393393-no extra lookup, single cycle. Constraint on the compiler, not on the
394394-hardware.
395395-396396-### What About Overflow?
397397-398398-If the matching store is full (all slots occupied) or a function body
399399-exceeds M dyadic instructions:
400400-401401-**Compile-time prevention (primary strategy):**
402402-- The compiler knows the slot count and entry count
403403-- It splits functions and schedules activations to fit
404404-- If a program genuinely can't fit (unbounded recursion deeper than N
405405- slots), the compiler inserts throttling code: a token that waits for
406406- a slot to free before allowing the next recursive call
407407-- This is the Amamiya throttle idea, but implemented in software
408408- (compiler-inserted dataflow logic) rather than hardware
409409-410410-**Runtime overflow (safety net):**
411411-- If a token arrives for a full matching store (shouldn't happen with
412412- correct compilation), the PE stalls the input FIFO until a slot frees.
413413- Simplest, safest, most debuggable. If it fires, something is wrong
414414- and stalling surfaces the bug.
415415-416416-**Future: small CAM overflow buffer**
417417-- If runtime overflow becomes a real issue (genuinely unpredictable
418418- recursion depth), a small CAM (4-8 entries using 100142 chips or
419419- similar) per PE could catch overflow tokens
420420-- Sits between input FIFO and SRAM matching store, catches tokens
421421- that don't fit, retries when slots free up
422422-- Not needed for v0. The input FIFO interface doesn't change.
423423-- 100142 chips (4 words x 4 bits) could give a 4-entry overflow buffer
424424- at maybe 6-8 chips per PE. Small but might handle 95% of overflow
425425- cases where a slot frees within a few cycles.
426426-427427-## Context Slot Lifecycle
428428-429429-### Allocation: Bump Allocator
430430-- Counter + register per PE
431431-- On function activation: current counter value = new context slot ID
432432-- Counter increments (wraps around to 0 after max slot)
433433-- On wrap: checks occupied bitmap to find next free slot
434434-- Hardware: binary counter + bitmap register + priority encoder for
435435- free-slot finding. ~8-10 TTL chips.
436436-- Alternative: small FIFO of free slot IDs, populated at init and on
437437- deallocation. Avoids bitmap scan. ~5-8 chips.
438438-439439-### Deallocation
440440-- Compiler inserts explicit "free" instruction on every exit path
441441-- Free instruction clears the slot's occupied bits (all entries in
442442- the slot) and returns the slot ID to the free pool
443443-- Multiple frees are idempotent / harmless
444444-- Freed slots are immediately available for reallocation
445445-446446-### ABA Protection
447447-- 2-bit generation counter per context slot
448448-- Incremented on each reallocation
449449-- Tokens carry the generation they were created under
450450-- On match attempt: if token's generation != slot's current generation,
451451- the token is stale and discarded
452452-- 4 generations before wraparound; stale tokens drain in 2-5 cycles,
453453- so wraparound collision is effectively impossible
454454-- Hardware cost: 2-bit counter + 2-bit comparator per slot. Trivial.
455455-456456-### Throttle
457457-- Saturating counter tracks number of active (occupied) slots per PE
458458-- When counter = max slots, stalls new allocations until a free occurs
459459-- Prevents matching store overflow
460460-- Hardware cost: counter + comparator + gate. ~10 TTL chips.
461461-- With compiler-controlled scheduling, the throttle should rarely fire.
462462- It's a safety net, not a performance mechanism.
463463-464464-## Open Design Questions
465465-466466-1. **Context slot sizing**: Config B (16x32) vs Config C (32x32)?
467467- Depends on realistic concurrent activation counts for target programs.
468468- Need to compile some test programs and measure.
469469-2. **Matching store metadata storage**: flip-flop register file for
470470- occupied/port/gen, or tiny SRAM? Depends on slot count and available
471471- chip count budget per PE.
472472-3. **Instruction memory layout**: dyadic-first packing (option b) seems
473473- clean. Any cases where this constraint causes the compiler grief?
474474-4. **Free slot tracking**: bump allocator + bitmap + priority encoder?
475475- Or free-slot FIFO?
476476-5. **Instruction encoding**: operation set, format, how wide. Not yet
477477- specified. Must be wide enough to hold opcode + destination PE + dest
478478- offset + dest ctx_slot + any literals.
479479-6. **Function splitting heuristics**: how does the compiler decide where
480480- to split? Minimise cross-PE traffic? Balance slot usage across PEs?
481481- Hardware constraints (slot count, entry count) drive it.
482482-7. **Token format ctx_slot field width**: 4 bits (current, 16 slots)
483483- or 5 bits (32 slots, costs one bit from elsewhere)?
484484-485485-## Key References
486486-487487-- `17407_17358.pdf` — DFM evaluation: OM structure (1024 CAM blocks,
488488- 32 words each, 8 entries of 4 words, 4-way set-associative within
489489- entry). Function activation via CCU requesting least-loaded PE, then
490490- getting instance name from target PE's free instance table. IM is
491491- 8KW/PE, identical across all PEs. Critical for understanding why
492492- Amamiya's OM is so large and why ours can be much smaller.
493493-- `gurd1985.pdf` — Manchester matching unit: 16 parallel hash banks,
494494- 64K tokens each, 54-bit comparators, 180ns clock period. Overflow
495495- unit emulated in software. Shows the cost of general-purpose matching.
496496-- `Dataflow_Machine_Architecture.pdf` — Veen survey: matching store
497497- analysis, tag space management, overflow handling across multiple
498498- architectures.
499499-- `amamiya1982.pdf` — Original DFM paper: semi-CAM concept, IM/OM
500500- split, execution control mechanism with associative IM fetch.
501501- Partial function body execution (begin executing when first argument
502502- arrives, don't wait for all arguments).
-560
design-notes/versions/pe-design(2).md
···11-# Dynamic Dataflow CPU — PE (Processing Element) Design
22-33-Covers the CM (Control Module) pipeline, matching store, instruction memory,
44-context slot management, and per-PE identity.
55-66-See `architecture-overview.md` for token format and module taxonomy.
77-See `network-and-communication.md` for how tokens enter/leave the PE.
88-99-## Design Philosophy: Static Assignment, Compiler-Driven Sizing
1010-1111-This design diverges significantly from both Manchester and Amamiya in how
1212-PEs are used. Understanding the difference is critical to understanding why
1313-the matching store can be so much smaller here.
1414-1515-**Amamiya DFM (1982/17407 papers):** every PE has ALL function bodies
1616-pre-loaded in instruction memory (8KW, 58 bits/word per PE, identical
1717-contents across all PEs). Function *instances* are dynamically assigned to
1818-PEs at runtime by a CCU (Cluster Control Unit) that picks the least-loaded
1919-PE. The OM (operand matching memory) needs 1024 CAM blocks per PE because
2020-any function can run anywhere, and deep Lisp recursion means many
2121-simultaneous activations. The "semi-CAM" was their solution to making this
2222-affordable — instance name directly addresses a block, then 4-way
2323-set-associative lookup within the block on instruction identifier.
2424-2525-**Manchester (Gurd 1985):** similar story but with hashing instead of
2626-semi-CAM. 16 parallel 64K-token memory banks per PE for set-associative
2727-hash lookup. 1M token capacity matching store. Plus an overflow unit
2828-(initially emulated on the host). The matching unit alone was 16 memory
2929-boards per PE.
3030-3131-Both machines sized their matching stores for worst-case dynamic scheduling
3232-of arbitrary programs. The whole program lives in every PE (or in a single
3333-PE's matching unit), and any activation can land anywhere. That's why
3434-those matching stores are enormous.
3535-3636-**This design:** the compiler statically assigns function bodies (or chunks
3737-of them) to specific PEs. Different PEs have different instruction memory
3838-contents. The compiler knows at compile time which functions run where,
3939-and can calculate maximum concurrent activations per PE. This means:
4040-4141-- Instruction memory is NOT replicated — each PE only holds its assigned
4242- function bodies. IM can be much smaller.
4343-- The matching store only needs enough context slots for the maximum
4444- concurrent activations the compiler predicts for that specific PE.
4545- Not 1024. Probably 16-32.
4646-- No CCU needed for dynamic PE allocation. Scheduling decisions are
4747- made at compile time.
4848-- The tradeoff is scheduling flexibility — you can't dynamically
4949- rebalance load at runtime. The compiler must get it roughly right.
5050-5151-### Function Splitting Across PEs
5252-5353-A "function" in the source language does NOT need to map 1:1 to a
5454-contiguous block on one PE. The compiler can split a function body at
5555-any data-dependency boundary. The token network doesn't know or care
5656-whether two instructions are "in the same function" — it just sees tokens
5757-with destinations.
5858-5959-A 40-instruction function body could be split into three chunks of ~13
6060-instructions across three PEs, each chunk fitting in a smaller context
6161-slot. The "function" as the architecture sees it is really "a set of
6262-instructions that share a context slot ID on this PE." The compiler
6363-defines what that grouping means.
6464-6565-This is a powerful lever for keeping context slots small: if a function
6666-body is too big for the slot size, the compiler splits it. The split
6767-introduces inter-PE token traffic (extra network hops), but keeps
6868-per-PE hardware simple. The compiler can optimise the split points to
6969-minimise cross-PE traffic.
7070-7171-**Implication for context slot semantics:** a context slot doesn't mean
7272-"one function activation." It means "one chunk of work sharing a local
7373-operand namespace on this PE." Multiple context slots on different PEs
7474-might collectively represent one function activation. The token's ctx_slot
7575-field scopes operand matching to a local context, nothing more.
7676-7777-**Implication for the compiler:** this architecture actively wants either
7878-small functions or functions distributed across PEs. The compiler is free
7979-to treat any subgraph of the dataflow graph as a "chunk" and assign it to
8080-a PE, regardless of source-level function boundaries. Loop bodies, branch
8181-arms, pipeline stages — all valid chunk boundaries. The grain of
8282-scheduling is the subgraph, not the function.
8383-8484-## PE Identity
8585-8686-Each PE has a unique ID used for routing. Two mechanisms, not mutually
8787-exclusive:
8888-8989-**EEPROM-based**: the instruction decoder EEPROM already contains
9090-per-PE truth tables. The PE ID can be encoded as additional input bits
9191-to the EEPROM, meaning the EEPROM contents are unique per PE but the
9292-circuit board is identical. The instruction decoder "knows" which PE
9393-it is because its EEPROM was burned with that ID.
9494-9595-**DIP switches**: 3-4 switches give 8-16 PE addresses. Better for early
9696-prototyping — reconfigurable without reflashing. Can coexist with the
9797-EEPROM approach (switches provide ID bits that feed into the EEPROM
9898-address lines).
9999-100100-The PE ID is needed in two places:
101101-1. Input token filtering: "is this token addressed to me?"
102102-2. Output token formatting: "set the source PE field" (if result tokens
103103- carry source info for return routing)
104104-105105-## PE Pipeline (5-stage sketch)
106106-107107-```
108108-Stage 1: TOKEN INPUT
109109- - Receive token from network
110110- - Classify: type 00/01 (normal), type 11 subtype 01 (config write)
111111- - Normal tokens -> pipeline FIFO
112112- - Config writes -> instruction memory write port (stalls pipeline)
113113- - Buffer in small FIFO (8-deep, 32-bit)
114114- - ~1K transistors (flip-flops) or use small SRAM
115115-116116-Stage 2: MATCH / BYPASS
117117- - Type 00 (dyadic): direct-index into context slot array
118118- - Check generation counter: mismatch = stale, discard
119119- - First operand: store in slot, advance to wait state
120120- - Second operand: read partner from slot, both proceed
121121- - Type 01 (monadic): bypass matching entirely, proceed directly
122122- - Single cycle for all cases (no hash path, no CAM search —
123123- direct indexing only, see matching store section below)
124124- - Estimated: ~200-300 transistors + SRAM
125125-126126-Stage 3: INSTRUCTION FETCH
127127- - Use local offset to read from PE's instruction SRAM
128128- - External SRAM chip, so just address generation logic
129129- - ~200 transistors of logic
130130- - NOTE: instruction memory is shared between pipeline reads and
131131- network config writes — see "Instruction Memory" section below
132132-133133-Stage 4: EXECUTE
134134- - 8/16-bit ALU
135135- - ~500-2000 transistors depending on width and features
136136-137137-Stage 5: TOKEN OUTPUT
138138- - Form result token with routing prefix (type, destination PE/SM,
139139- offset, context, etc.)
140140- - Inject into network via output FIFO
141141- - ~300 transistors
142142-```
143143-144144-Pipeline registers between stages: ~500 transistors
145145-Control logic (state machine, handshaking): ~500-1000 transistors
146146-147147-**Per-PE total: ~3-5K transistors of logic + SRAM chips**
148148-149149-(Revised down from earlier 5-8K estimate. The matching stage is dramatically
150150-simpler than originally sketched now that hash fallback is removed from the
151151-primary pipeline. See matching store section.)
152152-153153-## Instruction Memory
154154-155155-### Static Assignment, Per-PE Contents
156156-157157-Unlike Amamiya where every PE has identical IM contents (full program),
158158-each PE here holds only the function bodies (or function chunks) assigned
159159-to it by the compiler. This means:
160160-161161-- IM is smaller per PE (only assigned code, not the whole program)
162162-- Different PEs have different IM contents (loaded at bootstrap)
163163-- The compiler emits a per-PE instruction image as part of the program
164164-165165-### Runtime Writability
166166-167167-Instruction memory is **not** read-only. It is writable from the network
168168-via type-11 subtype-01 (config/extended address) packets. This serves
169169-two purposes:
170170-171171-1. **Bootstrap**: loading programs before execution starts
172172-2. **Runtime reprogramming**: loading new function bodies while other PEs
173173- continue executing (future capability, not needed for v0)
174174-175175-Runtime writability also means instruction memory size is not a hard
176176-architectural limit — if a program needs more code than fits in one PE's
177177-IM, the runtime (or a management PE) could swap function bodies in and
178178-out. Very speculative, but the hardware path exists.
179179-180180-### Implementation
181181-182182-Instruction memory is external SRAM. The PE pipeline reads from it during
183183-Stage 3 (instruction fetch). The network can write to it via config
184184-packets received at Stage 1.
185185-186186-Shared SRAM means arbitration between two users:
187187-- Pipeline reads (instruction fetch): high frequency, performance-critical
188188-- Network writes (config): low frequency, can tolerate delay
189189-190190-**Arbitration approach**: network writes get priority when they arrive
191191-(they're rare and bursty during bootstrap). When a config write is in
192192-progress, the pipeline stalls for one cycle at Stage 3. Hardware cost:
193193-mux on SRAM address/data buses + write-enable gating + stall signal to
194194-pipeline. Roughly 5-8 TTL chips.
195195-196196-**Async-compatible arbitration**: defined as request/grant interface.
197197-Synchronous implementation: priority mux resolved on clock edge. Async
198198-implementation: mutual exclusion element (Seitz arbiter). Interface is
199199-the same in both cases. See `network-and-communication.md` for clocking
200200-discipline.
201201-202202-### EEPROM-Based Instruction Decoding
203203-204204-The instruction decoder can be implemented as an EEPROM acting like a PLD.
205205-Input bits = instruction opcode fields + PE ID bits. Output bits = control
206206-signals for the ALU, matching store, token output formatter, etc.
207207-208208-This gives significant flexibility:
209209-- Instruction set can be changed by reflashing the EEPROM (no board changes)
210210-- Per-PE customisation (different PEs could theoretically have different
211211- instruction subsets, though unlikely for v0)
212212-- The PE ID is "free" — it's just more EEPROM address bits
213213-214214-## Matching Store Design
215215-216216-### Why It Can Be Small
217217-218218-The matching store is the highest-risk component in any dataflow machine.
219219-Manchester needed 16 memory boards per PE. Amamiya needed 1024 CAM blocks
220220-(32KW at 43 bits/word) per PE. Both were sized for worst-case dynamic
221221-scheduling of arbitrary programs.
222222-223223-This design avoids that because:
224224-225225-1. **Static PE assignment**: the compiler knows which functions run on
226226- which PE and can calculate maximum concurrent activations per PE.
227227-2. **Function splitting**: the compiler can split large function bodies
228228- across PEs so no single PE needs a huge context slot.
229229-3. **Compiler-controlled slot allocation**: the compiler assigns context
230230- slot IDs at compile time for statically-known activations. Only
231231- genuinely dynamic activations (runtime-determined recursion depth)
232232- need runtime allocation.
233233-234234-The matching store size is therefore a *compiler parameter*, not an
235235-architectural constant. The hardware provides N context slots of M entries
236236-each. The compiler must generate code that fits within those limits,
237237-splitting and scheduling accordingly.
238238-239239-### Architecture: Pure Direct-Indexed Context Slots
240240-241241-No hash fallback. No CAM search. No set-associative lookup.
242242-243243-The matching operation is:
244244-245245-```
246246-SRAM_address = [ctx_slot (from token) : match_entry (from token or instr)]
247247-read SRAM at that address
248248-check generation counter: mismatch = stale, discard token
249249-250250-if occupied bit set:
251251- -> match found, read stored operand, proceed to instruction fetch
252252- -> clear occupied bit
253253-else:
254254- -> no match, write incoming operand, set occupied bit
255255- -> token consumed, advance to next input token
256256-```
257257-258258-Single cycle. Always. The only comparison is the generation counter check
259259-(2-bit XOR, trivial). There is no "miss" path that requires multi-cycle
260260-recovery, because the address is deterministic — the compiler guaranteed
261261-that ctx_slot + match_entry uniquely identifies this matching location.
262262-263263-Hardware cost:
264264-- One SRAM chip (matching store data)
265265-- Small register file or SRAM (occupied bitmap + port flags)
266266-- Address generation: concatenate ctx_slot bits and match_entry bits (wires)
267267-- Read/write control: occupied bit check + generation compare (~3-4 chips)
268268-- Generation counter storage: 2 bits per context slot (~1 chip for 32 slots)
269269-270270-Total matching logic per PE: **~200-300 transistors + one SRAM chip + bitmap.**
271271-Order of magnitude less than the 2-3K transistors estimated when the
272272-design included a hash fallback path.
273273-274274-### Context Slot Sizing
275275-276276-The slot count (N) vs entry count (M) tradeoff maps to:
277277-- **N (slots)**: how many concurrent activations can this PE handle
278278-- **M (entries per slot)**: how many dyadic instructions per function chunk
279279-280280-Both are compiler-controllable. More slots = more parallelism headroom.
281281-More entries per slot = bigger function bodies without splitting. The
282282-compiler balances these.
283283-284284-**Candidate configurations (targeting clean SRAM utilisation):**
285285-286286-```
287287-Config A: 32 slots x 16 entries = 512 cells
288288- - 9-bit SRAM address (5 ctx + 4 offset)
289289- - 16-bit values: 1KB exactly in one 8Kbit SRAM chip
290290- - Good concurrency headroom, smaller function chunks
291291-292292-Config B: 16 slots x 32 entries = 512 cells
293293- - 9-bit SRAM address (4 ctx + 5 offset)
294294- - 16-bit values: 1KB exactly
295295- - Matches current 4-bit ctx_slot token format
296296- - Fewer concurrent activations, bigger function chunks
297297-298298-Config C: 32 slots x 32 entries = 1024 cells
299299- - 10-bit SRAM address (5 ctx + 5 offset)
300300- - 16-bit values: 2KB, fits in one 16Kbit SRAM chip
301301- - Most headroom in both dimensions
302302- - Probably the sweet spot for SRAM utilisation vs headroom
303303-304304-Config D: 64 slots x 16 entries = 1024 cells
305305- - 10-bit SRAM address (6 ctx + 4 offset)
306306- - 16-bit values: 2KB
307307- - Favours concurrency over function chunk size
308308- - 64 concurrent activations likely overkill for v0 but future-proof
309309-```
310310-311311-**SRAM layout:** store 16-bit operand values in the main SRAM chip
312312-(standard 8-bit-wide chips, 2 bytes per entry, sequential access or
313313-use 16-bit-wide SRAM if available). Occupied flags + port indicators
314314-stored separately:
315315-316316-- **Occupied bitmap**: 1 bit per cell. 512 cells = 64 bytes (trivially
317317- small — a few flip-flops or one tiny SRAM). 1024 cells = 128 bytes.
318318-- **Port indicator**: 1 bit per cell (left/right operand). Same size as
319319- occupied bitmap. Can share the same storage.
320320-- **Generation counters**: 2 bits per *slot* (not per cell). 32 slots =
321321- 8 bytes. Trivial — a small register file or a handful of flip-flops.
322322-323323-This separation means the main SRAM stores only 16-bit values at clean
324324-power-of-two addresses. No awkward 18-bit word widths. The metadata
325325-(occupied, port, gen) is tiny and stored in dedicated fast-access
326326-registers alongside the SRAM.
327327-328328-**Recommendation for v0**: start with Config B (16 slots x 32 entries =
329329-1KB) to match the current 4-bit ctx_slot token field. Upgrade to Config C
330330-(32 x 32 = 2KB, needs 5-bit ctx_slot) if 16 concurrent activations
331331-proves too tight. The physical SRAM chip doesn't change between these
332332-configs — just the address generation logic.
333333-334334-### Instruction Address vs Matching Store Address
335335-336336-These are NOT the same thing, and this distinction matters:
337337-338338-- **Instruction address** (used in Stage 3): indexes into instruction
339339- memory SRAM. 7-8 bits (128-256 instructions per PE). Used by ALL
340340- token types. This is the "offset" field in the token.
341341-- **Matching store address** (used in Stage 2): indexes into matching
342342- store SRAM. Composed of [ctx_slot : match_entry]. Only used by
343343- dyadic tokens.
344344-345345-The compiler maintains the mapping. For dyadic instructions, the
346346-instruction word in IM includes a "match_entry" field that tells the
347347-hardware which matching store entry corresponds to this instruction.
348348-349349-This means the matching store is dense with respect to dyadic instructions
350350-— no gaps for monadic instructions. A function chunk with 20 instructions,
351351-8 of which are dyadic, uses 8 matching store entries, not 20.
352352-353353-**Simplest v0 approach:** the token carries the instruction memory offset
354354-(for Stage 3). The instruction word fetched in Stage 3 contains the
355355-match_entry index (for Stage 2, which already happened on the previous
356356-pipeline cycle for the previous token — or, more precisely, the match
357357-stage reads match_entry from a lookup table indexed by offset, stored
358358-alongside the instruction memory).
359359-360360-Actually, wait — Stage 2 happens BEFORE Stage 3. So the match_entry
361361-must come from the token, not the instruction word. This means either:
362362-363363-(a) The token carries both an instruction offset AND a match entry index.
364364- Costs token bits. May require the offset field to be split or the
365365- match_entry to be packed into unused bits.
366366-367367-(b) The match_entry IS the instruction offset, and the instruction memory
368368- is laid out so that the offset of a dyadic instruction is also its
369369- matching store entry within the slot. This works if the compiler
370370- assigns offsets such that dyadic instruction offsets are dense (0, 1,
371371- 2, ...) and monadic instruction offsets are in a separate range.
372372-373373-(c) A small lookup ROM/SRAM alongside the matching store maps instruction
374374- offset -> match_entry. This adds a read before the match SRAM access
375375- (serial, adds latency) or requires a second SRAM port (parallel, adds
376376- hardware).
377377-378378-Option (b) is the simplest if the compiler can make it work. The
379379-instruction memory layout would be: dyadic instructions at offsets 0..M-1,
380380-monadic instructions at offsets M..N-1. The token's offset field directly
381381-indexes both the matching store (for dyadic) and the instruction memory
382382-(for everything). The matching store just doesn't get accessed for offsets
383383->= M (monadic range).
384384-385385-This constrains instruction memory layout — dyadic instructions must be
386386-packed at the low end. But the compiler controls the layout, so this is
387387-achievable.
388388-389389-**Recommendation for v0:** option (b). Dyadic instructions packed at
390390-offsets 0..M-1 in instruction memory, monadic at M..N-1. Token offset
391391-directly serves as both instruction address and (for offsets < M)
392392-matching store entry within the context slot. Clean, no extra bits,
393393-no extra lookup, single cycle. Constraint on the compiler, not on the
394394-hardware.
395395-396396-### What About Overflow?
397397-398398-If the matching store is full (all slots occupied) or a function body
399399-exceeds M dyadic instructions:
400400-401401-**Compile-time prevention (primary strategy):**
402402-- The compiler knows the slot count and entry count
403403-- It splits functions and schedules activations to fit
404404-- If a program genuinely can't fit (unbounded recursion deeper than N
405405- slots), the compiler inserts throttling code: a token that waits for
406406- a slot to free before allowing the next recursive call
407407-- This is the Amamiya throttle idea, but implemented in software
408408- (compiler-inserted dataflow logic) rather than hardware
409409-410410-**Runtime overflow (safety net):**
411411-- If a token arrives for a full matching store (shouldn't happen with
412412- correct compilation), the PE stalls the input FIFO until a slot frees.
413413- Simplest, safest, most debuggable. If it fires, something is wrong
414414- and stalling surfaces the bug.
415415-416416-**Future: small CAM overflow buffer**
417417-- If runtime overflow becomes a real issue (genuinely unpredictable
418418- recursion depth), a small CAM (4-8 entries using 100142 chips or
419419- similar) per PE could catch overflow tokens
420420-- Sits between input FIFO and SRAM matching store, catches tokens
421421- that don't fit, retries when slots free up
422422-- Not needed for v0. The input FIFO interface doesn't change.
423423-- 100142 chips (4 words x 4 bits) could give a 4-entry overflow buffer
424424- at maybe 6-8 chips per PE. Small but might handle 95% of overflow
425425- cases where a slot frees within a few cycles.
426426-427427-## Context Slot Lifecycle
428428-429429-### Allocation: Bump Allocator
430430-- Counter + register per PE
431431-- On function activation: current counter value = new context slot ID
432432-- Counter increments (wraps around to 0 after max slot)
433433-- On wrap: checks occupied bitmap to find next free slot
434434-- Hardware: binary counter + bitmap register + priority encoder for
435435- free-slot finding. ~8-10 TTL chips.
436436-- Alternative: small FIFO of free slot IDs, populated at init and on
437437- deallocation. Avoids bitmap scan. ~5-8 chips.
438438-439439-### Deallocation
440440-- Compiler inserts explicit "free" instruction on every exit path
441441-- Free instruction clears the slot's occupied bits (all entries in
442442- the slot) and returns the slot ID to the free pool
443443-- Multiple frees are idempotent / harmless
444444-- Freed slots are immediately available for reallocation
445445-446446-### ABA Protection
447447-- 2-bit generation counter per context slot
448448-- Incremented on each reallocation
449449-- Tokens carry the generation they were created under
450450-- On match attempt: if token's generation != slot's current generation,
451451- the token is stale and discarded
452452-- 4 generations before wraparound; stale tokens drain in 2-5 cycles,
453453- so wraparound collision is effectively impossible
454454-- Hardware cost: 2-bit counter + 2-bit comparator per slot. Trivial.
455455-456456-### Throttle
457457-- Saturating counter tracks number of active (occupied) slots per PE
458458-- When counter = max slots, stalls new allocations until a free occurs
459459-- Prevents matching store overflow
460460-- Hardware cost: counter + comparator + gate. ~10 TTL chips.
461461-- With compiler-controlled scheduling, the throttle should rarely fire.
462462- It's a safety net, not a performance mechanism.
463463-464464-## Open Design Questions
465465-466466-1. **Context slot sizing**: Config B (16x32) vs Config C (32x32)?
467467- Depends on realistic concurrent activation counts for target programs.
468468- Need to compile some test programs and measure.
469469-2. **Matching store metadata storage**: flip-flop register file for
470470- occupied/port/gen, or tiny SRAM? Depends on slot count and available
471471- chip count budget per PE.
472472-3. **Instruction memory layout**: dyadic-first packing (option b) seems
473473- clean. Any cases where this constraint causes the compiler grief?
474474-4. **Free slot tracking**: bump allocator + bitmap + priority encoder?
475475- Or free-slot FIFO?
476476-5. **Instruction encoding**: operation set, format, how wide. Not yet
477477- specified. Must be wide enough to hold opcode + destination PE + dest
478478- offset + dest ctx_slot + any literals.
479479-6. **Function splitting heuristics**: how does the compiler decide where
480480- to split? Minimise cross-PE traffic? Balance slot usage across PEs?
481481- Hardware constraints (slot count, entry count) drive it.
482482-7. **Token format ctx_slot field width**: 4 bits (current, 16 slots)
483483- or 5 bits (32 slots, costs one bit from elsewhere)?
484484-485485-## Dynamic Scheduling: Future Capability
486486-487487-The architecture is **policy-agnostic** on whether PE assignment is fully
488488-static (compiler decides everything) or partially dynamic (a scheduler
489489-places activations at runtime). the mechanism — tokens carry destination
490490-PE + ctx_slot, PEs have writable IRAM, matching store is addressed by
491491-ctx_slot — supports either policy.
492492-493493-### Static Assignment (v0)
494494-495495-Compiler decides everything at compile time. each PE gets specific
496496-function fragments loaded at bootstrap. no runtime decisions about
497497-placement. simplest, no scheduler hardware or firmware needed.
498498-499499-### Dynamic Scheduling (future)
500500-501501-A CCU-like scheduler (could be firmware on a dedicated PE, a small
502502-fixed-function unit, or distributed logic) decides at runtime where to
503503-place new activations, based on PE load, IRAM contents, etc.
504504-505505-The tension: dynamic scheduling wants **wide IRAM** (so the target PE
506506-already has the function body loaded), while cheap PEs want **narrow
507507-IRAM**. Amamiya resolved this by replicating the entire program into
508508-every PE's IRAM. that's one approach but costs a lot of memory.
509509-510510-The middle ground is a **working set model**: keep hot function bodies
511511-loaded, swap cold ones via type-11 config writes when the scheduler
512512-wants to place an activation on a PE that doesn't have the code yet.
513513-this is demand paging for instruction memory.
514514-515515-- **miss latency**: significant (network round-trip to load code from
516516- flash/SM/another PE's IRAM). much worse than Amamiya's "already there."
517517-- **miss rate**: depends on scheduler affinity policy. if the scheduler
518518- prefers placing activations on PEs that already have the code, misses
519519- should be rare. a small "IRAM directory" (which PE has which function
520520- body loaded) lets the scheduler make this decision cheaply.
521521-- **coordination**: drain in-flight tokens for the old fragment before
522522- overwriting IRAM. throttle stalls new activations for that fragment,
523523- existing ones complete, then overwrite. coarse-grained context switch.
524524-525525-The hardware path is already there — writable IRAM + type-11 config
526526-writes + throttle. the missing piece is the scheduler, which is a
527527-software/firmware problem. nothing in the v0 hardware prevents adding
528528-this later.
529529-530530-### What Changes If You Want Dynamic Scheduling
531531-532532-The main hardware implication: if the same function body might run on
533533-different PEs at different times, the **instruction memory needs to be
534534-large enough to hold a useful working set**, not just one program's
535535-worth of fragments. this argues for bigger IRAM per PE (4Kx8 or 8Kx8
536536-instead of 2Kx8) even if v0 programs don't need it. SRAM is cheap;
537537-leaving headroom costs one chip size bump, not a redesign.
538538-539539-the matching store size is less affected — context slot count is about
540540-concurrency, not code size. 16-32 slots handles most realistic
541541-activation depths regardless of whether assignment is static or dynamic.
542542-543543-## Key References
544544-545545-- `17407_17358.pdf` — DFM evaluation: OM structure (1024 CAM blocks,
546546- 32 words each, 8 entries of 4 words, 4-way set-associative within
547547- entry). Function activation via CCU requesting least-loaded PE, then
548548- getting instance name from target PE's free instance table. IM is
549549- 8KW/PE, identical across all PEs. Critical for understanding why
550550- Amamiya's OM is so large and why ours can be much smaller.
551551-- `gurd1985.pdf` — Manchester matching unit: 16 parallel hash banks,
552552- 64K tokens each, 54-bit comparators, 180ns clock period. Overflow
553553- unit emulated in software. Shows the cost of general-purpose matching.
554554-- `Dataflow_Machine_Architecture.pdf` — Veen survey: matching store
555555- analysis, tag space management, overflow handling across multiple
556556- architectures.
557557-- `amamiya1982.pdf` — Original DFM paper: semi-CAM concept, IM/OM
558558- split, execution control mechanism with associative IM fetch.
559559- Partial function body execution (begin executing when first argument
560560- arrives, don't wait for all arguments).
-212
design-notes/versions/pe-design.md
···11-# Dynamic Dataflow CPU — PE (Processing Element) Design
22-33-Covers the CM (Control Module) pipeline, matching store, instruction memory,
44-context slot management, and per-PE identity.
55-66-See `architecture-overview.md` for token format and module taxonomy.
77-See `network-and-communication.md` for how tokens enter/leave the PE.
88-99-## PE Identity
1010-1111-Each PE has a unique ID used for routing. two mechanisms, not mutually
1212-exclusive:
1313-1414-**EEPROM-based**: the instruction decoder EEPROM already contains
1515-per-PE truth tables. the PE ID can be encoded as additional input bits
1616-to the EEPROM, meaning the EEPROM contents are unique per PE but the
1717-circuit board is identical. the instruction decoder "knows" which PE
1818-it is because its EEPROM was burned with that ID.
1919-2020-**DIP switches**: 3-4 switches give 8-16 PE addresses. better for early
2121-prototyping — reconfigurable without reflashing. can coexist with the
2222-EEPROM approach (switches provide ID bits that feed into the EEPROM
2323-address lines).
2424-2525-the PE ID is needed in two places:
2626-1. input token filtering: "is this token addressed to me?"
2727-2. output token formatting: "set the source PE field" (if result tokens
2828- carry source info for return routing)
2929-3030-## PE Pipeline (5-stage sketch)
3131-3232-```
3333-Stage 1: TOKEN INPUT
3434- - Receive token from network
3535- - Classify: type 00/01 (normal), type 11 subtype 01 (config write)
3636- - Normal tokens -> pipeline FIFO
3737- - Config writes -> instruction memory write port (stalls pipeline)
3838- - Buffer in small FIFO (8-deep, 32-bit)
3939- - ~1K transistors (flip-flops) or use small SRAM
4040-4141-Stage 2: MATCH / BYPASS
4242- - Type 00 (dyadic): direct-index into context slot array
4343- - Check generation counter: mismatch = stale, discard
4444- - First operand: store in slot, advance to wait state
4545- - Second operand: read partner from slot, both proceed
4646- - Type 01 (monadic): bypass matching entirely, proceed directly
4747- - Common case (direct index): single cycle
4848- - Hash fallback path for dynamic/overflow: multi-cycle
4949- - Estimated: 2-3K transistors + SRAM
5050-5151-Stage 3: INSTRUCTION FETCH
5252- - Use local offset to read from PE's instruction SRAM
5353- - External SRAM chip, so just address generation logic
5454- - ~200 transistors of logic
5555- - NOTE: instruction memory is shared between pipeline reads and
5656- network config writes — see "Instruction Memory" section below
5757-5858-Stage 4: EXECUTE
5959- - 8/16-bit ALU
6060- - ~500-2000 transistors depending on width and features
6161-6262-Stage 5: TOKEN OUTPUT
6363- - Form result token with routing prefix (type, destination PE/SM,
6464- offset, context, etc.)
6565- - Inject into network via output FIFO
6666- - ~300 transistors
6767-```
6868-6969-Pipeline registers between stages: ~500 transistors
7070-Control logic (state machine, handshaking): ~500-1000 transistors
7171-7272-**Per-PE total: ~5-8K transistors of logic + SRAM chips**
7373-7474-## Instruction Memory
7575-7676-### Runtime Writability
7777-7878-Instruction memory is **not** read-only. it is writable from the network
7979-via type-11 subtype-01 (config/extended address) packets. this serves
8080-two purposes:
8181-8282-1. **Bootstrap**: loading programs before execution starts
8383-2. **Runtime reprogramming**: loading new function bodies while other PEs
8484- continue executing (future capability, not needed for v0)
8585-8686-### Implementation
8787-8888-Instruction memory is external SRAM. the PE pipeline reads from it during
8989-Stage 3 (instruction fetch). the network can write to it via config
9090-packets received at Stage 1.
9191-9292-Shared SRAM means arbitration between two users:
9393-- Pipeline reads (instruction fetch): high frequency, performance-critical
9494-- Network writes (config): low frequency, can tolerate delay
9595-9696-**Arbitration approach**: network writes get priority when they arrive
9797-(they're rare and bursty during bootstrap). when a config write is in
9898-progress, the pipeline stalls for one cycle at Stage 3. hardware cost:
9999-mux on SRAM address/data buses + write-enable gating + stall signal to
100100-pipeline.
101101-102102-**Async-compatible arbitration**: defined as request/grant interface.
103103-synchronous implementation: priority mux resolved on clock edge. async
104104-implementation: mutual exclusion element (Seitz arbiter). interface is
105105-the same in both cases. see `network-and-communication.md` for clocking
106106-discipline.
107107-108108-### EEPROM-Based Instruction Decoding
109109-110110-The instruction decoder can be implemented as an EEPROM acting like a PLD.
111111-input bits = instruction opcode fields + PE ID bits. output bits = control
112112-signals for the ALU, matching store, token output formatter, etc.
113113-114114-This gives significant flexibility:
115115-- instruction set can be changed by reflashing the EEPROM (no board changes)
116116-- per-PE customisation (different PEs could theoretically have different
117117- instruction subsets, though this is unlikely for v0)
118118-- the PE ID is "free" — it's just more EEPROM address bits
119119-120120-## Context Slot Lifecycle
121121-122122-See `architecture-overview.md` for the high-level description. detailed
123123-hardware design below.
124124-125125-### Allocation: Bump Allocator
126126-- Counter + register per PE
127127-- On function activation: current counter value = new context slot ID
128128-- Counter increments
129129-- Hardware: binary counter + output register. ~5 TTL chips.
130130-131131-### Deallocation
132132-- Compiler inserts explicit "free" instruction on every exit path
133133-- Free instruction resets the slot's "occupied" bit
134134-- Multiple frees are idempotent / harmless
135135-- Freed slots are available for reuse by the bump allocator
136136- (allocator wraps around and checks "occupied" bits, or a free list
137137- is maintained — TBD)
138138-139139-### ABA Protection
140140-- 2-bit generation counter per context slot
141141-- Incremented on each reallocation
142142-- Tokens carry the generation they were created under
143143-- On match attempt: if token's generation != slot's current generation,
144144- the token is stale and discarded
145145-- 4 generations before wraparound; stale tokens drain in 2-5 cycles,
146146- so wraparound collision is effectively impossible
147147-- Hardware cost: 2-bit counter + 2-bit comparator per slot. small.
148148-149149-### Throttle
150150-- Saturating counter tracks number of active (occupied) slots per PE
151151-- When counter = max slots, stalls new allocations until a free occurs
152152-- Prevents matching store overflow
153153-- Hardware cost: counter + comparator + gate. ~10 TTL chips.
154154-155155-## Matching Store Design (highest-risk component)
156156-157157-### Primary Path: Direct-Indexed Context Slots (Amamiya semi-CAM)
158158-159159-- Bump allocator assigns context slot IDs to function activations
160160-- Context slot ID directly addresses a bank of SRAM
161161-- Instruction offset within function body used as direct address within
162162- that bank
163163-- **Single-cycle matching for the common case** — no hashing, no search
164164-- This is the critical performance path. if this works well, the PE is
165165- competitive. if it doesn't, nothing else matters.
166166-167167-### Fallback Path: Hash-Based Matching
168168-169169-For dynamic or overflow cases where direct indexing doesn't apply:
170170-171171-- Multiplicative hashing: `(a * K) >> (w - m)` — simple to implement
172172- in hardware (shift register + adder chain, or lookup table)
173173-- Multi-bank (4-8 banks) checked in parallel for collision tolerance
174174- (Manchester-style set-associative)
175175-- Overflow to linked list or dedicated overflow buffer for worst case
176176-- This path is multi-cycle — acceptable because it's the uncommon case
177177-178178-### Compiler-Assisted Tag Assignment
179179-180180-- Static-lifetime values get contiguous, dense tags — sequential readout,
181181- no hashing
182182-- Dynamic activations get allocated tags via bump allocator
183183-- Potential for hybrid: half of matching store uses precalculated tags,
184184- half uses runtime hash
185185-186186-### Monadic/Dyadic Optimisation (deferred)
187187-188188-- Compiler assigns matching store indices only to dyadic nodes
189189-- Monadic nodes bypass matching, don't consume matching store cells
190190-- Requires indirection: matching store cell includes instruction address
191191- pointer
192192-- Cell width increases (~8 bits for instr_addr) but cell count decreases
193193- (~60% fewer)
194194-- local_offset in token = matching store index, NOT instruction address
195195-- **Deferred for v0**: simpler to have local_offset = instruction address
196196- = matching store address
197197-198198-## Open Design Questions
199199-200200-1. **Context slot count per CM** — 4 bits = 16 slots. each slot needs
201201- enough SRAM to hold one operand per instruction in the function body.
202202- if function bodies are up to 64 instructions, each slot is 64 x 16-bit
203203- = 128 bytes. 16 slots = 2KB SRAM for the matching store. is 16 enough?
204204-2. **Free slot tracking** — bump allocator with wraparound + occupied bits?
205205- or explicit free list (small FIFO of freed slot IDs)?
206206-3. **Hash fallback** — how many banks? what hash function exactly? worth
207207- prototyping in FPGA first (see `design-alternatives.md`)?
208208-4. **Instruction encoding** — operation set, format, how wide. this
209209- determines Stage 3 and Stage 4 design. not yet specified.
210210-5. **Instruction memory write protocol** — exact handshake between
211211- "config write arrived" and "pipeline stalled, writing SRAM." needs
212212- to be fully specified before building.
-208
design-notes/versions/sm-design.md
···11-# Dynamic Dataflow CPU — SM (Structure Memory) Design
22-33-Covers the SM interface protocol, operation set, banking scheme, address
44-space extension, and hardware architecture.
55-66-See `architecture-overview.md` for module taxonomy and token format.
77-See `network-and-communication.md` for how SM connects to the bus.
88-99-## Role
1010-1111-SM stores structured data (arrays, lists, heap) and performs operations on
1212-it. it is NOT used for I/O mapping — I/O lives in the type-11 subsystem
1313-(see `io-and-bootstrap.md`).
1414-1515-SM is a pure data store with embedded functional units for atomic operations.
1616-from a CM's perspective: send a type-10 request, get a result token back
1717-eventually. split-phase, asynchronous relative to the requesting CM.
1818-1919-## Interface Protocol
2020-2121-Stateless request handling: the request token carries its own return routing
2222-info in the bits that are unused by that operation type. SM never maintains
2323-pending-request state — result packets are self-addressed.
2424-2525-### Request Formats (type 10, received on AN)
2626-2727-```
2828-READ request (data field repurposed for return routing):
2929-[type:2][SM_id:2][op:3][address:9][ret_CM:2][ret_offset:8][ret_ctx:4][ret_port:1][pad:1]
3030-3131-WRITE request (data field carries write data, no response needed):
3232-[type:2][SM_id:2][op:3][address:9][data:16]
3333-3434-READ_INC / READ_DEC (same as READ format — return routing in data field):
3535-[type:2][SM_id:2][op:3][address:9][ret_CM:2][ret_offset:8][ret_ctx:4][ret_port:1][pad:1]
3636-3737-CAS — compare-and-swap (two-flit operation):
3838-Flit 1: [type:2][SM_id:2][op:3][address:9][expected_value:16]
3939-Flit 2: [new_value:16][ret_CM:2][ret_offset:8][ret_ctx:4][ret_port:1][pad:1]
4040-```
4141-4242-### Result Format (on DN, repackaged as type 00 or 01)
4343-4444-SM extracts return routing from the request and constructs a normal token:
4545-4646-```
4747-Result -> type 00 (dyadic) or type 01 (monadic) token:
4848-[type:2][ret_CM:2][ret_ctx:4][gen:?][ret_offset:7/8][ret_port:1][fetched_data:14/20]
4949-```
5050-5151-The requesting CM specified where this result should land (which context
5252-slot, which offset, which port). SM just repackages. the result looks
5353-like any other token arriving at the CM — the CM doesn't know or care
5454-that it came from SM.
5555-5656-**Open question**: the return routing in READ requests carries ret_ctx (4 bits)
5757-but not gen (2 bits). the result token needs gen if it's type 00 (dyadic).
5858-either: (a) SM result tokens are always monadic (type 01, no gen needed),
5959-or (b) the ret_ctx field is widened to include gen bits (eating into
6060-padding), or (c) the requesting CM stores the gen locally and the result
6161-matches without it. option (a) is simplest — SM results bypass matching
6262-and go straight to instruction fetch. this means SM results always feed
6363-monadic instruction inputs.
6464-6565-## Operation Set (3-bit opcode, 8 slots)
6666-6767-```
6868-000: READ — read address, return data via DN
6969-001: WRITE — write data to address (no DN response)
7070-010: READ_INC — atomic fetch-and-add(+1), return old value
7171-011: READ_DEC — atomic fetch-and-add(-1), return old value
7272-100: CAS — compare-and-swap (two-flit), return old value + success bit
7373-101: ALLOC — (future) allocate N cells, return base address
7474-110: FREE — (future) mark cells as available
7575-111: RESERVED
7676-```
7777-7878-READ_INC / READ_DEC are fetch-and-add primitives. they give atomic pointer
7979-operations and reference counting without dedicated refcount hardware. CM
8080-checks returned value for zero (refcount exhausted) using its normal ALU.
8181-8282-CAS is the general-purpose atomic primitive. two-flit: first flit carries
8383-the expected value, second flit carries the new value + return routing.
8484-SM compares memory contents with expected, swaps if match, returns old
8585-value either way. success/fail can be inferred by comparing returned old
8686-value with expected (CM does this with its ALU).
8787-8888-ALLOC / FREE are placeholders for heap management. deferred to post-v0.
8989-could be implemented as firmware (a small state machine in the SM that
9090-manages a free list) or as software (the CM program manages allocation
9191-using READ/WRITE/CAS).
9292-9393-## Hardware Architecture
9494-9595-```
9696-Input Interface Output Interface
9797- (receive type-10 request) (send result as type 00/01)
9898- | ^
9999- v |
100100- [Request FIFO] [Result FIFO]
101101- | ^
102102- v |
103103- [Op Decoder]----+ [Result Formatter]
104104- | | ^
105105- v v |
106106- [Addr Decode] [ALU for inc/dec/cas] [Bank Read Data]
107107- | | ^
108108- v v |
109109- [SRAM Bank 0] [SRAM Bank 1] ... [SRAM Bank N]
110110-```
111111-112112-### Banking
113113-114114-- Start with 2 banks (1 address bit selects bank) for v0
115115-- 9-bit address = 512 cells per SM = 1KB at 16-bit data width
116116-- Each bank is one SRAM chip with room to spare
117117-- Banking allows pipelining: one bank can be reading while another is
118118- being written (for RMW ops, or overlapping independent requests)
119119-120120-### Internal Components
121121-122122-**Request FIFO**: buffers incoming type-10 packets. depth TBD (4-8 deep
123123-probably sufficient for v0). handles bursty traffic from multiple CMs.
124124-125125-**Op decoder**: extracts opcode, determines:
126126-- read / write / RMW?
127127-- one-flit or two-flit? (CAS is two-flit)
128128-- does it need a DN response?
129129-- how to pack the result?
130130-131131-**Address decode**: selects SRAM bank from address bits.
132132-133133-**ALU**: minimal — increment, decrement, compare. NOT a full ALU. just
134134-enough for the atomic operations. hardware cost: 16-bit incrementer +
135135-16-bit comparator + mux. ~10-15 TTL chips.
136136-137137-**Result formatter**: extracts return routing from the original request
138138-(ret_CM, ret_offset, ret_ctx, ret_port), combines with read data,
139139-constructs a type 00/01 token. this is where the SM-to-DN format
140140-conversion happens.
141141-142142-## Address Space Extension
143143-144144-The 9-bit address in the compact structure token (type 10) gives only
145145-512 cells per SM. three mechanisms to extend it:
146146-147147-### 1. Page Register (recommended for v0)
148148-149149-- SM has a writable config register: "page base" (8-16 bits)
150150-- 9-bit token address is treated as offset, added to page base
151151-- Gives up to 64K+ addressable cells per SM
152152-- CM sets the page with a WRITE to a reserved config address before
153153- issuing a burst of reads/writes to a region
154154-- Hardware cost: ~3 chips (latch for page register + adder)
155155-- Programming model: familiar bank-switching, like 8-bit micros
156156-- Tradeoff: page switch costs one extra token; compiler batches accesses
157157- to same page to amortise
158158-159159-### 2. Banking as Implicit Address Bits
160160-161161-- SM_id field (2 bits) gives 4 SMs = 4 x 512 = 2K cells system-wide
162162-- Not contiguous from a programming perspective, but compiler can
163163- distribute data structures across SMs for both capacity and parallelism
164164-- Essentially free — already in the token format
165165-- Combine with page registers for 4 x 64K = 256K cells system-wide
166166-167167-### 3. Extended Structure Tokens (via type 11)
168168-169169-- Use type-11 (system) packets with a structure-extended subtype for
170170- structure ops needing wide addresses
171171-- Full 16-24 bit address space, at the cost of 2-cycle token transmission
172172-- Use for: large heap, external RAM chip
173173-- Compact type-10 tokens remain the fast path for common/local accesses
174174-175175-### Practical Address Space with All Three Combined
176176-177177-- Fast path (type 10 + page register): 64K per SM, single-flit
178178-- Medium path (type 10 across SMs): 4 x 64K = 256K, single-flit
179179-- Slow path (type 11 extended): up to 16M+ with wide addresses, two-flit
180180-181181-## V0 Test Plan
182182-183183-- Drive input with microcontroller (RP2040 / Arduino)
184184-- Microcontroller formats 32-bit request packets, clocks into request FIFO
185185-- Read 32-bit result packets from output FIFO
186186-- Test suite:
187187- - Sequential read/write
188188- - Random access
189189- - READ_INC sequences (verify atomicity, verify returned old value)
190190- - READ_DEC to zero (verify underflow behaviour)
191191- - CAS success and failure cases
192192- - Bank contention (same bank back-to-back)
193193- - Page register set + offset access
194194- - Boundary conditions (address 0, address 511, page wraparound)
195195-196196-## Open Design Questions
197197-198198-1. **Result token type** — always monadic (type 01)? or sometimes dyadic?
199199- see "open question" in interface protocol section above.
200200-2. **CAS two-flit handling** — how does the request FIFO handle two-flit
201201- ops? does it buffer both flits before dispatching, or pipeline them?
202202-3. **Page register per-CM or global?** — if multiple CMs access the same
203203- SM, do they share a page register (contention) or each have their own
204204- (more hardware, more config)? probably global for v0.
205205-4. **Banking vs pipeline depth** — with 2 banks, can we overlap a read to
206206- bank 0 with a write to bank 1? worth the control complexity for v0?
207207-5. **SRAM chip selection** — specific part numbers, speed grades, package.
208208- needs to match the target clock frequency.