Dynamic Dataflow CPU — Architecture Overview#
Master reference document. For detailed design of individual subsystems, see
companion documents. For rejected/deferred approaches and decision rationale,
see design-alternatives.md.
Companion Documents#
pe-design.md— PE pipeline, matching, frames, instruction encoding, 670 subsystem, stall analysisalu-and-output-design.md— ALU operation set, instruction decoder, output formattersm-design.md— structure memory interface, operations, banking, address spacebus-interconnect-design.md— physical bus design, arbitration, routing nodes, backpressurenetwork-and-communication.md— interconnect topology, routing, clocking, handshakingio-and-bootstrap.md— IO as memory-mapped SM, bootstrap via SM00 EXECdesign-alternatives.md— rejected/deferred approaches with rationale
Project Goals#
- Dynamic dataflow CPU achievable with discrete logic (74-series TTL + SRAM)
- Multi-PE design targeting superscalar-equivalent IPC
- "Period-plausible" transistor budget: ~25-35K logic transistors + SRAM chips
- Comparable to a 68000 or a couple of Z80s in logic complexity
- Reference builds for physical scale: Fabian Schuiki's superscalar CPU, James Sharman's pipelined CPU
- Must be able to load and execute a binary over serial without a substantial conventional control core
- Incremental build plan: single PE first, expand to multi-PE
- Architecture must not rule out future evolution: specifically, must preserve design space for asynchronous operation, network topology changes, and runtime reprogramming
Key Architectural Decisions#
Execution Model#
- Dynamic dataflow (tagged-token), not static like the Electron E1
- Compiler performs static PE assignment and routing configuration (E1-like)
- Matching store operates dynamically within each PE for concurrent activations
- This is a hybrid: static routing topology, dynamic operand matching
Influences / Reference Architectures#
- Manchester Dataflow Machine (Gurd 1985): pipeline structure, matching unit design, overflow handling
- DFM / Amamiya 1982: semi-CAM concept, computational locality, function-instance-based addressing, CM/SM split, TTL prototype
- Pao et al. (IP lookup): subtree bit-vector parallel search via bitwise AND — useful for collision resolution or routing
- Electron E1: compile-time spatial mapping, tile-based PEs, control core for bootstrap
- Yang et al. (DDR SDRAM IP lookup): hash + small CAM for collision overflow
Width Domains#
The architecture separates into independent width domains, each sized for its own constraints. There is no requirement that they match.
| Domain | Width | Driven By |
|---|---|---|
| External bus (inter-module) | 16-bit | routing trace count, physical buildability |
| Token format (logical) | variable-length flits | encoding needs per token type |
| IRAM (instruction memory) | 16-bit single-half read | opcode + mode + frame reference |
| Frame store entries | 16-bit data (or 32-bit with wide=1) | operand, constant, and destination storage |
| PE pipeline registers | wide, decomposed | parallel data path + control path |
| SM internal datapath | 16-bit | SRAM word size |
Width conversion occurs at FIFO boundaries between domains, using serialisers/deserialisers (shift register + toggle). This is cheap in TTL and naturally integrates with the clock domain crossing FIFOs that already exist in the GALS clocking design.
The 16-bit external bus halves data traces (16 data + handful of control vs 32 + control), halves routing node width (comparators, latches, muxes all narrower), and halves FIFO storage width at network boundaries. Most token traffic in a well-compiled program stays PE-local; the external bus is the slow path, where SM access latency and cross-PE routing latency dominate over bus serialisation.
Tentative: 16-bit is the working assumption for the emulator and assembler, but the 8-bit vs 16-bit ALU datapath decision is not fully resolved. The hardware design notes (
alu-and-output-design.md) detail an 8-bit v0 datapath with 16-bit as a future upgrade. The emulator operates at 16-bit. Final decision depends on transistor budget tradeoffs during physical build.
Token Format (type-tagged, flit-based)#
All inter-module communication is serialised into 16-bit flits on the external bus. The first flit of any packet contains the type field and routing information, enabling routing nodes to make forwarding decisions after receiving only the first flit. Subsequent flits are forwarded blindly to the same destination. The number of flits is determined by the prefix bits in flit 1 — routing nodes and receivers can predict packet length from the first flit alone.
Critical architectural point: tokens do not carry opcodes. A token is pure data-in-motion — it carries a destination address, activation information, and a data value. The opcode lives in the destination PE's instruction memory (IRAM) and is fetched locally (IFETCH stage, before matching). This decouples token width from instruction width, and means IRAM width is completely independent of bus width.
Instruction Deduplication#
Because IRAM entries are activation-independent templates and frames
provide all per-activation state, many activations can share the same
physical instruction. A series of comparisons against different thresholds
uses one cmp instruction at a single IRAM offset — each activation
provides different operand pairs and constants via different frames. The
number of IRAM entries a fragment needs is the number of unique operation
shapes, not the total number of operations executed. This keeps IRAM
small even for moderately complex function fragments.
Top-Level Discriminator: 1-Bit SM/CM Split#
BIT[15] = 1: SM TOKEN — destination is an SM bank. carries a memory
operation request (read, write, atomic RMW, bulk ops, etc.).
BIT[15] = 0: CM TOKEN — destination is a CM (PE). carries operand data
for compute instructions, or IRAM write commands.
IO is memory-mapped into SM address space (typically SM00 at v0). IRAM writes are a CM misc-bucket subtype. Bootstrap uses EXEC on SM00, reading pre-formed tokens from ROM. Debug/trace can use a reserved SM address range or a spare misc-bucket subtype.
CM Token Prefix Encoding#
BIT[15:14] = 00: DYADIC WIDE — hot path. 2 flits, 16-bit data.
offset:8 = 256 instruction addresses.
act_id:3 = 8 activation IDs.
BIT[15:13] = 010: MONADIC NORMAL — 2 flits, 16-bit data.
offset:8 = 256 instruction addresses.
BIT[15:13] = 011: MISC BUCKET — infrequent CM formats.
sub:2 discriminates:
011+00: FRAME CONTROL (2 flits, ALLOC/FREE)
011+01: PE-LOCAL WRITE (2 flits, IRAM + frame write)
011+10: MONADIC INLINE (1 flit, trigger only, offset:7)
011+11: SPARE (reserved)
The 3-bit activation_id field identifies the active frame for this token.
With 3 bits allocated to act_id, dyadic wide and monadic normal tokens both
have 8-bit offset fields (256 instruction addresses). The misc-bucket
subtypes cover frame lifecycle (011+00 for frame control) and PE-local
writes (011+01 for IRAM and frame slot writes).
Flit-1 Bit Allocation with Bit Positions#
Flit 1 is 16 bits. The network routing requirement is minimal: bit[15] (SM/CM split) plus the 2-bit destination ID. These 3 bits determine where the flit goes. The remaining bits are opaque payload — the network forwards them blindly. The destination endpoint (PE or SM) decodes the full 16 bits.
DYADIC WIDE: [0][0][port:1][PE:2][offset:8][act_id:3]
15 14 13 12-11 10-3 2-0
MONADIC NORM: [0][1][0][PE:2][offset:8][act_id:3]
15 14 13 12-11 10-3 2-0
FRAME CONTROL: [0][1][1][PE:2][00][op:1][spare:3][act_id:3]
15 14 13 12-11 10-9 8 7-5 4-2 (note 1)
op=0 ALLOC Associate act_id with next free frame.
flit 2 = return routing for confirmation/error token.
op=1 FREE Release frame associated with act_id.
flit 2 = unused (or diagnostic).
PE-LOCAL WRITE: [0][1][1][PE:2][01][region:1][spare:1][slot:5][act_id:3]
15 14 13 12-11 10-9 8 7 (note 2)
region=0 IRAM write. act_id ignored. slot = IRAM address (within
current bank).
region=1 Frame write. act_id resolved to frame_id by PE's tag store.
slot = frame slot index within that activation's frame.
MONADIC INLINE: [0][1][1][PE:2][10][offset:7][spare:2]
15 14 13 12-11 10-9 8-2 1-0
SPARE: [0][1][1][PE:2][11][...] reserved for future use.
Note 1: act_id is at bits [7:5] for frame control, with spare at [4:2]. Note 2: PE-local write packs slot and act_id in the low bits, with spare at [7]. Exact bit positions depend on final slot width allocation.
Routing Field Extraction#
Flit 1 bit [15] — SM/CM discriminator:
0 = CM token -> route to PE identified by PE_id
1 = SM token -> route to SM identified by SM_id
For CM tokens (bit[15]=0):
bits [13:12] = PE_id for dyadic wide (prefix 00)
bits [12:11] = PE_id for monadic normal and misc bucket (prefix 01x)
For SM tokens (bit[15]=1):
bits [14:13] = SM_id (0-3)
Everything below the destination ID is endpoint-decoded, not network-decoded. The routing node's job is to extract the destination ID from the appropriate bit position based on bit[15].
Flit-1 Field Alignment Invariants#
Several field positions are architecturally fixed across CM token formats, enabling shared decode hardware:
- PE_id is always at bits [12:11] for monadic normal and misc bucket tokens (prefix 01x), and at bits [13:12] for dyadic wide (prefix 00). The routing node extracts PE_id from the appropriate position based on the prefix.
- act_id is always bits [2:0] on dyadic wide and monadic normal tokens. This allows the 670-based act_id-to-frame_id lookup to begin immediately from the low 3 bits of flit 1, regardless of token type.
- offset is bits [10:3] for both dyadic wide and monadic normal tokens (8 bits). This allows the IRAM address latch to wire directly to flit 1 bits [10:3] for the two hot-path formats.
For the misc bucket formats (011+xx), field positions vary by subtype. These are decoded after the 5-bit prefix, not on the hot path.
Hot Path Decode#
- bit[15] splits SM/CM: one gate
- bit[14] splits dyadic-wide from everything else: one gate
- bit[13] splits monadic normal from misc bucket: one more gate
- the misc bucket is three gates deep, but nothing there is latency-critical
On the hot path (dyadic wide, prefix 00), the PE begins the IRAM read on
offset[7:0] and the 670 act_id-to-frame_id resolution on act_id[2:0]
simultaneously with decode. By the time the prefix is fully resolved
(2 gates, ~10 ns), both lookups are already in flight.
SM Token Format#
SM token (2 flits, standard):
flit 1: [1][SM_id:2][op:3-5][addr:8-10] = 16 bits
flit 2: [data:16] or [return_routing:16] = 16 bits
15 bits available after the SM discriminator. SM_id (2 bits) selects one of 4 SMs. The remaining 13 bits are split between opcode and address using variable-width encoding:
op[2:1] != 11: 3-bit opcode, 10-bit addr (1024 cells)
read, write, alloc, free, exec, ext
op[2:1] == 11: extends to 5-bit opcode, 8-bit payload (256 cells or inline data)
rd_inc, rd_dec, cas, raw_rd, clear, set_pg, write_im, (spare)
One decode gate on op[2:1] discriminates the two tiers.
Return routing in flit 2: for read and other result-producing ops,
flit 2 carries a pre-formed CM token template. The SM's result
formatter latches the template, prepends it as flit 1, and appends the
read data as flit 2.
IO is memory-mapped SM: IO devices are mapped into SM address space
(typically SM00 at v0). IO operations use the standard SM token format.
I-structure semantics provide natural interrupt-free IO: a read from an
IO device that has no data defers until data arrives.
See sm-design.md for the full opcode table, extended addressing, and
cas handling.
Variable-Length Token Summary#
| Token Type | Prefix | Flits | Data | Offset | Act ID | Port |
|---|---|---|---|---|---|---|
| Dyadic wide | 00 | 2 | 16-bit | 8 (256) | 3 | flit1 |
| Monadic normal | 010 | 2 | 16-bit | 8 (256) | 3 | -- |
| Frame control | 011+00 | 2 | 16-bit payload | -- | 3 | -- |
| PE-local write | 011+01 | 2 | 16-bit | -- | 3 | -- |
| Monadic inline | 011+10 | 1 | none | 7 (128) | -- | -- |
| SM standard | 1 | 2 | 16-bit | 8-10 | -- | -- |
The common case is 2 flits. Inline monadic (1 flit) is the fast path for control-flow tokens. PE-local writes are 2 flits and infrequent during execution.
Key Design Rationale#
- Opcodes don't travel: tokens carry destination addresses, not opcodes. IRAM is fetched PE-locally. With the reversed pipeline (IFETCH before MATCH), the instruction word drives match behaviour. Instruction width is completely independent of bus width.
- 1-bit top-level split: bit[15] discriminates SM from CM traffic. One gate. The network routes on bit[15] + the 2-bit destination ID (PE_id or SM_id). Everything below that is endpoint-decoded.
- Hot path decode is shallow: dyadic wide (the dominant format) is identified by two bits (bit[15]=0, bit[14]=0). The PE can begin activation_id resolution (74LS670 lookup) and IRAM read in parallel on flit 1 latch.
- PE-local writes are unified: IRAM writes and frame writes share the
same misc-bucket format (
011+01), differing only in theregionbit. No separate system token type for either. - Compact activation ID: 3-bit
act_ididentifies one of up to 4 concurrent frames per PE, with sufficient ABA distance. The compact encoding leaves 8 bits for instruction offset in both dyadic and monadic hot-path formats. - Width domains are independent: bus width (16-bit), token format (variable flit count), IRAM width (16-bit single-half read), and PE pipeline width (wider, decomposed) are each sized for their own constraints.
- Instruction deduplication: IRAM entries are templates shared across activations. Per-activation data (constants, destinations) lives in frames. The number of IRAM entries needed is unique operation shapes, not total operations executed.
Spare Bits and Future Use#
The spare bits are explicitly reserved, not accidentally unused:
- Frame control spare:3 — future candidates: extended op field (up to 4 lifecycle operations), diagnostic flags, priority level.
- Monadic inline spare:2 — future candidates: reduced act_id (2 bits), extended interpretation flag.
- Misc bucket sub=11 — entire format reserved for future use.
The spare bits provide escape hatches for architectural evolution without changing the base format. v0 should treat them as must-be-zero on transmit, ignored on receive.
Module Taxonomy#
CM (Control Module) — execution and matching#
- Instruction memory (IM / IRAM): stores activation-independent instruction
templates (function bodies)
- Width decoupled from bus: 16-bit single-half read (one 8-bit chip
pair per PE). Instruction word format:
[type:1][opcode:5][mode:3][wide:1][fref:6]. Seepe-design.mdfor detailed field definitions and mode table. - Runtime-writable via PE-local write tokens (prefix
011+01, region=0) - Write from network stalls the pipeline (acceptable for config operations)
- Enables runtime reprogramming and eliminates need for separate config bus
- Constants and destinations live in per-activation frame slots, not in the instruction word. This doubles IRAM density and enables template deduplication across activations.
- Width decoupled from bus: 16-bit single-half read (one 8-bit chip
pair per PE). Instruction word format:
- Per-activation frame storage: each active computation gets a frame — a
flat array of 16-bit SRAM slots holding pending match operands, constants,
destinations, accumulators, and wide values. The instruction template
references frame slots by index via the
freffield.- Runtime-writable via PE-local write tokens (prefix
011+01, region=1) - Frame lifecycle managed by frame control tokens (prefix
011+00, ALLOC/FREE)
- Runtime-writable via PE-local write tokens (prefix
- Matching via 74LS670 register-file lookup:
activation_idindexes a lookup table to getframe_id, then presence/port metadata in additional 670s is checked for dyadic matching. All combinational (~35 ns), no SRAM cycle needed. 8 matchable offsets per frame (assembler-enforced). - Pipeline order: IFETCH then MATCH. The instruction word drives match behaviour; activation_id resolution runs in parallel with the IRAM read.
- Receives CM tokens (bit[15]=0) from CN and DN (SM results repackaged as CM tokens), produces tokens to CN and AN
- Each PE has a unique ID, set via EEPROM (instruction decoder doubles as ID store) or DIP switches during prototyping
- See
pe-design.mdfor PE internals, pipeline staging, and frame details
SM (Structure Memory) — data storage, structure operations, and IO#
- Banked data memory (cells) for arrays, lists, heap data
- Embedded functional units for structure operations (read, write, atomic RMW, bulk ops via EXEC/ITERATE/COPY_RANGE)
- Receives operation requests via AN (bit[15]=1 tokens), returns results via DN (repackaged as CM tokens)
- Operates asynchronously from CMs — split-phase memory access
- IO is memory-mapped into SM address space. An SM (typically SM00 at v0) maps IO devices into its address range. I-structure semantics provide natural interrupt-free IO: a READ from an IO device that has no data defers until data arrives.
- SM00 has bootstrap responsibility: wired to the system reset signal,
it calls EXEC on a predetermined ROM address to load the system. At
runtime, SM00 behaves as a standard SM; only the reset-vector wiring is
special. See
sm-design.mdfor details. - Memory tiers: SM address space supports regions with different semantics — tier 0 (raw, no presence bits), tier 1 (I-structure), and tier 2 (wide/bulk with is_wide tag). Tier selection is by address range.
- See
sm-design.mdfor interface, banking, and tier details
Two Logical Interconnects (shared physical bus for v0)#
CN (Communication Network): CM <-> CM, bit[15]=0
AN (Arbitration Network): CM -> SM, bit[15]=1
DN (Distribution Network): SM -> CM, SM results repackaged as bit[15]=0
For v0 (4 PEs + 2-4 SMs), all traffic shares a single physical 16-bit bus with bit[15]-based routing. Routing nodes inspect flit 1 (bit[15] + destination ID in bits[14:13] or [13:12]) and forward the entire multi-flit packet to the appropriate destination. Multiple packets can be in flight simultaneously if the bus is pipelined with latches at each stage.
The AN/DN can be split onto separate physical paths later if SM access contention becomes a bottleneck. The bit[15]-based routing means this is a topology change, not a protocol change — no module interfaces need to change.
See network-and-communication.md for routing, clocking, and scaling details.
Function Calls#
The Problem#
A function call in this architecture must solve:
- Code residency — callee instructions in IRAM on the right PEs.
- Activation isolation — fresh frame for the new activation.
- Argument injection — N argument values tagged into callee's activation.
- Return linkage — callee knows where to send results.
- Activation teardown — free frame(s) when activation completes.
How Frames Support Calls#
The frame-based design keeps function call machinery simple:
-
Destinations are pre-formed. Output destinations are pre-formed flit 1 values stored in frame slots. Each destination already contains the target PE, offset, and act_id. No special override mechanism or context-mode encoding is needed in the instruction word.
-
Call descriptors live in frames. Pre-formed flit 1 values (call descriptors, return addresses) are loaded into frame slots during activation setup. The callee's return address is just another frame constant, loaded before execution begins. No SM round-trip at call time.
-
Frame allocation is PE-local. Activation setup sends an ALLOC frame control token to the target PE. The PE allocates a frame locally and returns a confirmation token. No SM round-trip for allocation.
Static Calls#
For non-recursive calls with compiler-known call graphs, all activation assignments are compile-time constants. The compiler assigns act_ids and frame slot layouts when laying out IRAM.
Argument passing: The caller's output instructions have their destination frame slots pre-loaded with flit 1 values targeting the callee's (PE, offset, act_id). Arguments flow across activation boundaries as normal token routing. No special call instructions — just frame setup.
Return: The callee's return instruction uses a destination read from its frame (mode 0 or 1). The return address is a frame constant loaded during activation setup — a pre-formed flit 1 value pointing back to the caller's (PE, offset, act_id). For single-call-site functions, this is a compile-time constant loaded once.
Multiple call sites: If a function is called from N sites, each call site loads different return routing into the callee's frame during setup. The callee's return instruction is identical across all call sites — it just reads frame[fref] for the destination, and the frame contents determine where the result goes. No per-call-site IRAM duplication needed.
Activation allocation: Compile-time for static calls. The compiler assigns non-overlapping act_id values. The ALLOC frame control token establishes the frame before any compute tokens arrive.
Activation teardown: The callee's final instruction (or a separate cleanup path) executes FREE_FRAME, which clears the tag store entry and returns the frame to the free pool.
Total overhead for static calls: frame setup tokens (ALLOC + PE-local writes to load constants and destinations). No extra compute instructions. The setup cost is proportional to the number of frame slots that differ between call sites.
Dynamic Calls#
For recursive calls, indirect calls (function pointers, trait objects), and functions with multiple call sites that need dynamic return routing.
Primitives:
| Primitive | Type | Hardware | Purpose |
|---|---|---|---|
| CHANGE_TAG | dyadic CM (mode 4/5) | ~4 chips/PE | Output routing from data operand |
| EXTRACT_TAG | monadic CM | ~2 chips/PE | Capture runtime act_id + offset as data |
| ALLOC | frame control token | 0 PE chips | Runtime frame allocation on target PE |
| FREE | frame control token | 0 PE chips | Frame deallocation |
CHANGE_TAG (mode 4/5): Left operand is a 16-bit packed tag (a pre-formed flit 1 value). Right operand is the data payload. Output token's flit 1 = left operand verbatim. Flit 2 = right operand. Enables sending a value to any destination computed at runtime. The output stage is a mux: frame dest vs left operand, selected by mode[2]. Hardware: left operand bypass latch (~2 chips) preserves the left operand value past the ALU. Stage 5 flit 1 mux (~2 chips) selects between assembled flit and raw data.
EXTRACT_TAG: Monadic instruction. Captures the executing token's identity as a 16-bit data value (a return continuation). The return offset comes from the frame (or an instruction-derived constant); PE_id from hardware; act_id from the pipeline activation latch. Output is a packed flit 1 value that can be stored in a callee's frame or passed to CHANGE_TAG for dynamic return routing.
Call descriptor tables: Pre-formed flit 1 values for callee argument destinations, stored in SM at boot (loaded via EXEC). During activation setup, these descriptors are loaded from SM into the callee's frame slots via PE-local write tokens.
Runtime activation allocation: An ALLOC frame control token is sent to each target PE. The PE allocates a frame locally and returns a confirmation token. Purely PE-local — no SM round-trip required. Multiple PEs can allocate frames in parallel.
Dynamic Call Sequence#
Caller (PE0, act=2) calls foo(a, b) -> result dynamically:
[Setup phase — can be overlapped with caller computation]
ALLOC(PE1, act=5) ; allocate frame on callee's PE
PE-local write(PE1, act=5, slot=0, const_val) ; load constant
PE-local write(PE1, act=5, slot=1, dest_ret) ; load return routing
PE-local write(PE1, act=5, slot=2, dest_out) ; load output dest
[Argument injection — from caller's compute path]
EXTRACT_TAG ; pack (PE0, act=2, ret_offset) as data
-> store in callee frame via PE-local write to ret slot
Route arg_a to (PE1, offset_a, act=5) ; normal mode 0 output
Route arg_b to (PE1, offset_b, act=5) ; normal mode 0 output
Callee (PE1, act=5, receives args via normal matching):
; ... compute result ...
; return instruction is mode 0, reads dest from frame[ret_slot]
; frame[ret_slot] = pre-formed flit 1 targeting caller
; result routes back to caller automatically
[Teardown]
FREE_FRAME ; release callee's frame
Setup tokens can be pipelined. ALLOC, PE-local writes, and argument tokens can fire in parallel across different PEs. Effective critical path: ALLOC confirmation latency + argument delivery. Comparable to a conventional function call with register setup + jump.
Partial Execution#
The dataflow execution model supports Amamiya-style partial function execution naturally. If the callee's argument entry points are independent instructions (not a single multi-input "begin" node), arguments arriving early begin executing the callee's body before all arguments are present. No special hardware support — the compiler structures the callee's dataflow graph to expose this parallelism.
Tail Calls#
If the callee reuses the caller's frame (no ALLOC, no new act_id), the call is a tail call. The compiler routes arguments with the inherited activation. No CHANGE_TAG needed, no allocation, no teardown. Falls out of mode 0 (INHERIT) naturally — the caller's frame destinations simply point to the callee's instruction offsets within the same activation.
Output Token Context Source (Summary)#
The 3-bit mode field in the instruction word controls how flit 1 of the
output token is sourced:
- INHERIT (modes 0-3): flit 1 comes from a pre-formed destination in the frame. The frame constant IS the output flit. No token formation logic.
- CHANGE_TAG (modes 4-5): flit 1 comes from the left operand (a runtime-computed packed tag). Enables dynamic routing.
- SINK (modes 6-7): no output token. ALU result written back to a frame slot.
See pe-design.md for the full mode table with frame slot access patterns,
bit-level decode, and hardware cost.
Bus Protocol#
Bus Signals#
Bus signals (per link):
data:16 flit data
valid:1 flit is present (handshake)
ready:1 receiver can accept (backpressure)
more:1 more flits follow in this packet (framing)
Atomic Packet Delivery#
The output FIFO guarantees atomic packet delivery on the bus. A multi-flit token is written to the output FIFO as a complete unit. The FIFO drains flits to the bus sequentially, and the bus arbiter (or handshake protocol) holds the bus for the full packet duration.
The "more flits follow" signal (the more wire) accompanies the data
bus. The emitter asserts it on every flit except the last flit of a packet.
Routing nodes and receivers watch this signal to know when a packet is
complete. They do not need to decode the token type or inspect flit 1 —
framing is entirely at the bus level.
Format-Agnostic Network#
Routing nodes forward flits transparently. They latch flit 1 for routing
decisions, then forward subsequent flits (while more is asserted) to
the same destination. No per-flit routing lookup after the first flit.
This means the network is completely agnostic to token format. It does
not know or care whether a packet is 1, 2, 3, or 4 flits. It follows
the more signal. All format intelligence is at the endpoints (emitting
PE's formatter and receiving PE's deserialiser).
See bus-interconnect-design.md for physical bus implementation and
network-and-communication.md for routing topology.
Transistor Budget Estimate (4-PE system)#
| Component | Transistors |
|---|---|
| 4x PE logic | 20-32K |
| Routing network (4 PEs) | 2-3K |
| SM bootstrap/EXEC sequencer | ~1-2K |
| Total logic | ~25-35K |
| SRAM chips (instruction mem, matching stores, token queues) | 8-16 chips |
Bootstrap is handled by SM00's EXEC sequencer reading pre-formed tokens from ROM, or by an external microcontroller during early prototyping.
IPC / Performance Expectations#
- "Superscalar" is the wrong term for dataflow — there's no single instruction stream
- With 4 PEs and pipelined execution, peak is 4 ops/clock
- Realistic sustained throughput depends on:
- Network crossing frequency (adds routing latency)
- Frame-based matching latency (combinational via 74LS670 register-file lookup — no SRAM cycle for match metadata)
- Available parallelism in the program
- Network contention (shared bus at v0 scale)
- Frame SRAM access scheduling (constant reads, destination reads, and operand stores each take one SRAM cycle)
- Parallel workloads (matrix multiply, FFT): near peak
- Sequential/pointer-chasing code: ~0.5-1 ops/clock (still competitive with 6502)
- Key insight: matching store performance is the primary bottleneck, as Manchester discovered. The 670-based approach moves match metadata off the SRAM bus entirely, enabling pipeline overlap between IRAM fetch and match resolution
Build Order#
Phase 0: SM (Structure Memory) — BUILD FIRST#
- Self-contained module, testable in isolation
- Drive with microcontroller (Arduino/RP2040) for testing
- Defined interface: receive operation request, process, return result
- Key deliverables:
- Banked SRAM with address decoding
- Simple operation unit (read/write at minimum, cons/car/cdr stretch goals)
- Input interface (receive request packets)
- Output interface (send result packets)
- Test harness: microcontroller sends requests, validates responses
Phase 1: CM (Control Module) — single PE#
- Instruction memory (SRAM, 16-bit single-half, one 8-bit chip pair)
- Frame storage (SRAM, shared chip pair with IRAM or separate)
- 74LS670-based activation_id resolution and match metadata (~8 chips)
- 16-bit ALU
- Token FIFO (input)
- Token output formatting (frame destination slots are pre-formed flit 1)
- Test with microcontroller injecting tokens, verify IFETCH -> MATCH pipeline, frame allocation/deallocation, and execution
Phase 2: CM + SM pair#
- Connect via shared bus with bit[15]-based routing
- Load a program using microcontroller (external, via IRAM write tokens or direct SRAM programming)
- Execute a dataflow graph that uses structure memory
- First real program: fibonacci, small FFT, or similar
Phase 3: Multi-module#
- Second CM, routing network
- Prove cross-PE token routing works
- Demonstrate actual parallel execution speedup
Phase 4: System#
- Expand to 4 CMs + 2-4 SMs
- SM00 bootstrap via EXEC from ROM
- IO memory-mapped into SM00 address space (UART, etc.)
- ISR equivalent via I-structure semantics: READ from IO device defers until data arrives, triggering the receiving node in the dataflow graph
- Performance benchmarking vs period-equivalent CPUs
Open Questions / Next Steps#
- SM internal design — banking scheme, bulk op sequencer, tier
boundary configuration, wide pointer cell format (partially specified,
see
sm-design.md) Context slot count per CM— Resolved. 3-bitactivation_idwith 4 concurrent frames per PE. Seepe-design.md.Instruction encoding— Resolved. 16-bit single-half format:[type:1][opcode:5][mode:3][wide:1][fref:6]. Constants and destinations live in frame slots. Seepe-design.mdfor the full encoding and mode table.The emulator and assembler still use Python IntEnum values as opcode placeholders. These do NOT represent final hardware bit encodings — a hardware encoding pass mapping to the 5-bit opcode field is still needed.
- IO address space allocation — which SM_id is reserved for IO? How much of SM00's address space is mapped to IO vs general-purpose storage? SM00 is special only at boot for now; further specialisation deferred until profiling shows the standard opcodes are insufficient.
Compiler / assembler— Resolved. Theasm/package implements a 7-stage assembler pipeline (parse -> lower -> expand -> resolve -> place -> allocate -> codegen). Produces PEConfig/SMConfig + seed tokens or a bootstrap token stream. The expand pass handles macro expansion (#macrodefinitions with${param}substitution, variadic repetition, constant arithmetic) and function call wiring (cross-context edges, trampoline nodes, context teardown). Built-in macros for common patterns (counted loops, permit injection, reduction trees) are automatically available. Seeassembler-architecture.mdfor architecture. Grammar isdfasm.lark(Lark/Earley parser). Auto-placement via greedy bin-packing with locality heuristic. Remaining work: frame layout allocation, new token format support, optimisation passes, binary output.- Mode B clock ratio — exactly 2x, or design for arbitrary integer ratios?
- Instruction residency — small IRAM per PE means programs larger
than IRAM need runtime code loading. With 16-bit instructions, IRAM
density has doubled (4096 instructions with bank switching on 8Kx8
SRAM), reducing pressure but not eliminating the issue for large
programs. Code storage hierarchy: external storage -> SM -> IRAM.
See
pe-design.mdInstruction Residency section for detailed options. - 74LS670 supply — the register-file lookup depends on 74LS670
availability. Fallback options exist (discrete flip-flops, SRAM-based
indexing) but increase chip count or add latency. See
pe-design.mdfor details. - Assembler updates for frame model — the assembler needs frame layout allocation (assigning constants, destinations, match operands to frame slots), PE-local write token generation for frame setup, and enforcement of the 8-matchable-offset constraint.
Key Papers in Project#
gurd1985.pdf— Manchester Dataflow Machine (matching unit details, overflow, pipeline)Dataflow_Machine_Architecture.pdf— Veen survey (comprehensive overview, matching space analysis)amamiya1982.pdf— DFM architecture (semi-CAM, structure memory, TTL prototype)17407_17358.pdf— DFM evaluation (implementation details, benchmarks, VLSI projection)efficienthardwarearchitectureforfastipaddresslookup.pdf— Pao et al. (binary-trie partitioning, bit-vector parallel search, SRAM pipeline)mclaughlin2005.pdf— IP lookup survey (comparison of trie vs hash approaches in hardware)HighperformanceIPlookupcircuitusingDDRSDRAM.pdf— Yang et al. (hash + CAM overflow, DDR burst for multi-bank)NonStrict_Execution_in_Parallel_and_Distributed_C.pdf— non-strict execution, split-phase memoryNATLS219821.pdf— National Semiconductor 100142 CAM chip (4x4-bit, reference for discrete CAM scale)MOSES071271.pdf— Motorola MCM69C233 CAM (32-bit match width, reference for CAM interface design)yuba1983.pdf— Yuba et al. (PE pipeline sections, pseudo-result handling, packet formats)