Dynamic Dataflow CPU — Architecture Overview#

Master reference document. For detailed design of individual subsystems, see companion documents. For rejected/deferred approaches and decision rationale, see design-alternatives.md.

Companion Documents#

pe-design.md — PE pipeline, matching, frames, instruction encoding, 670 subsystem, stall analysis
alu-and-output-design.md — ALU operation set, instruction decoder, output formatter
sm-design.md — structure memory interface, operations, banking, address space
bus-interconnect-design.md — physical bus design, arbitration, routing nodes, backpressure
network-and-communication.md — interconnect topology, routing, clocking, handshaking
io-and-bootstrap.md — IO as memory-mapped SM, bootstrap via SM00 EXEC
design-alternatives.md — rejected/deferred approaches with rationale

Project Goals#

Dynamic dataflow CPU achievable with discrete logic (74-series TTL + SRAM)
Multi-PE design targeting superscalar-equivalent IPC
"Period-plausible" transistor budget: ~25-35K logic transistors + SRAM chips
- Comparable to a 68000 or a couple of Z80s in logic complexity
- Reference builds for physical scale: Fabian Schuiki's superscalar CPU, James Sharman's pipelined CPU
Must be able to load and execute a binary over serial without a substantial conventional control core
Incremental build plan: single PE first, expand to multi-PE
Architecture must not rule out future evolution: specifically, must preserve design space for asynchronous operation, network topology changes, and runtime reprogramming

Key Architectural Decisions#

Execution Model#

Dynamic dataflow (tagged-token), not static like the Electron E1
Compiler performs static PE assignment and routing configuration (E1-like)
Matching store operates dynamically within each PE for concurrent activations
This is a hybrid: static routing topology, dynamic operand matching

Influences / Reference Architectures#

Manchester Dataflow Machine (Gurd 1985): pipeline structure, matching unit design, overflow handling
DFM / Amamiya 1982: semi-CAM concept, computational locality, function-instance-based addressing, CM/SM split, TTL prototype
Pao et al. (IP lookup): subtree bit-vector parallel search via bitwise AND — useful for collision resolution or routing
Electron E1: compile-time spatial mapping, tile-based PEs, control core for bootstrap
Yang et al. (DDR SDRAM IP lookup): hash + small CAM for collision overflow

Width Domains#

The architecture separates into independent width domains, each sized for its own constraints. There is no requirement that they match.

Domain	Width	Driven By
External bus (inter-module)	16-bit	routing trace count, physical buildability
Token format (logical)	variable-length flits	encoding needs per token type
IRAM (instruction memory)	16-bit single-half read	opcode + mode + frame reference
Frame store entries	16-bit data (or 32-bit with wide=1)	operand, constant, and destination storage
PE pipeline registers	wide, decomposed	parallel data path + control path
SM internal datapath	16-bit	SRAM word size

Width conversion occurs at FIFO boundaries between domains, using serialisers/deserialisers (shift register + toggle). This is cheap in TTL and naturally integrates with the clock domain crossing FIFOs that already exist in the GALS clocking design.

The 16-bit external bus halves data traces (16 data + handful of control vs 32 + control), halves routing node width (comparators, latches, muxes all narrower), and halves FIFO storage width at network boundaries. Most token traffic in a well-compiled program stays PE-local; the external bus is the slow path, where SM access latency and cross-PE routing latency dominate over bus serialisation.

Tentative: 16-bit is the working assumption for the emulator and assembler, but the 8-bit vs 16-bit ALU datapath decision is not fully resolved. The hardware design notes (alu-and-output-design.md) detail an 8-bit v0 datapath with 16-bit as a future upgrade. The emulator operates at 16-bit. Final decision depends on transistor budget tradeoffs during physical build.

Token Format (type-tagged, flit-based)#

All inter-module communication is serialised into 16-bit flits on the external bus. The first flit of any packet contains the type field and routing information, enabling routing nodes to make forwarding decisions after receiving only the first flit. Subsequent flits are forwarded blindly to the same destination. The number of flits is determined by the prefix bits in flit 1 — routing nodes and receivers can predict packet length from the first flit alone.

Critical architectural point: tokens do not carry opcodes. A token is pure data-in-motion — it carries a destination address, activation information, and a data value. The opcode lives in the destination PE's instruction memory (IRAM) and is fetched locally (IFETCH stage, before matching). This decouples token width from instruction width, and means IRAM width is completely independent of bus width.

Instruction Deduplication#

Because IRAM entries are activation-independent templates and frames provide all per-activation state, many activations can share the same physical instruction. A series of comparisons against different thresholds uses one cmp instruction at a single IRAM offset — each activation provides different operand pairs and constants via different frames. The number of IRAM entries a fragment needs is the number of unique operation shapes, not the total number of operations executed. This keeps IRAM small even for moderately complex function fragments.

Top-Level Discriminator: 1-Bit SM/CM Split#

BIT[15] = 1:  SM TOKEN — destination is an SM bank. carries a memory
              operation request (read, write, atomic RMW, bulk ops, etc.).
BIT[15] = 0:  CM TOKEN — destination is a CM (PE). carries operand data
              for compute instructions, or IRAM write commands.

IO is memory-mapped into SM address space (typically SM00 at v0). IRAM writes are a CM misc-bucket subtype. Bootstrap uses EXEC on SM00, reading pre-formed tokens from ROM. Debug/trace can use a reserved SM address range or a spare misc-bucket subtype.

CM Token Prefix Encoding#

BIT[15:14] = 00:  DYADIC WIDE — hot path. 2 flits, 16-bit data.
                  offset:8 = 256 instruction addresses.
                  act_id:3 = 8 activation IDs.
BIT[15:13] = 010: MONADIC NORMAL — 2 flits, 16-bit data.
                  offset:8 = 256 instruction addresses.
BIT[15:13] = 011: MISC BUCKET — infrequent CM formats.
                  sub:2 discriminates:
                  011+00: FRAME CONTROL (2 flits, ALLOC/FREE)
                  011+01: PE-LOCAL WRITE (2 flits, IRAM + frame write)
                  011+10: MONADIC INLINE (1 flit, trigger only, offset:7)
                  011+11: SPARE (reserved)

The 3-bit activation_id field identifies the active frame for this token. With 3 bits allocated to act_id, dyadic wide and monadic normal tokens both have 8-bit offset fields (256 instruction addresses). The misc-bucket subtypes cover frame lifecycle (011+00 for frame control) and PE-local writes (011+01 for IRAM and frame slot writes).

Flit-1 Bit Allocation with Bit Positions#

Flit 1 is 16 bits. The network routing requirement is minimal: bit[15] (SM/CM split) plus the 2-bit destination ID. These 3 bits determine where the flit goes. The remaining bits are opaque payload — the network forwards them blindly. The destination endpoint (PE or SM) decodes the full 16 bits.

DYADIC WIDE:    [0][0][port:1][PE:2][offset:8][act_id:3]
                 15  14   13    12-11   10-3      2-0

MONADIC NORM:   [0][1][0][PE:2][offset:8][act_id:3]
                 15  14 13  12-11   10-3      2-0

FRAME CONTROL:  [0][1][1][PE:2][00][op:1][spare:3][act_id:3]
                 15 14 13  12-11  10-9  8     7-5      4-2   (note 1)

  op=0  ALLOC  Associate act_id with next free frame.
               flit 2 = return routing for confirmation/error token.
  op=1  FREE   Release frame associated with act_id.
               flit 2 = unused (or diagnostic).

PE-LOCAL WRITE: [0][1][1][PE:2][01][region:1][spare:1][slot:5][act_id:3]
                 15 14 13  12-11  10-9   8      7      (note 2)

  region=0  IRAM write. act_id ignored. slot = IRAM address (within
            current bank).
  region=1  Frame write. act_id resolved to frame_id by PE's tag store.
            slot = frame slot index within that activation's frame.

MONADIC INLINE: [0][1][1][PE:2][10][offset:7][spare:2]
                 15 14 13  12-11  10-9   8-2      1-0

SPARE:          [0][1][1][PE:2][11][...] reserved for future use.

Note 1: act_id is at bits [7:5] for frame control, with spare at [4:2]. Note 2: PE-local write packs slot and act_id in the low bits, with spare at [7]. Exact bit positions depend on final slot width allocation.

Routing Field Extraction#

Flit 1 bit [15] — SM/CM discriminator:
  0 = CM token -> route to PE identified by PE_id
  1 = SM token -> route to SM identified by SM_id

For CM tokens (bit[15]=0):
  bits [13:12] = PE_id for dyadic wide (prefix 00)
  bits [12:11] = PE_id for monadic normal and misc bucket (prefix 01x)

For SM tokens (bit[15]=1):
  bits [14:13] = SM_id (0-3)

Everything below the destination ID is endpoint-decoded, not network-decoded. The routing node's job is to extract the destination ID from the appropriate bit position based on bit[15].

Flit-1 Field Alignment Invariants#

Several field positions are architecturally fixed across CM token formats, enabling shared decode hardware:

PE_id is always at bits [12:11] for monadic normal and misc bucket tokens (prefix 01x), and at bits [13:12] for dyadic wide (prefix 00). The routing node extracts PE_id from the appropriate position based on the prefix.
act_id is always bits [2:0] on dyadic wide and monadic normal tokens. This allows the 670-based act_id-to-frame_id lookup to begin immediately from the low 3 bits of flit 1, regardless of token type.
offset is bits [10:3] for both dyadic wide and monadic normal tokens (8 bits). This allows the IRAM address latch to wire directly to flit 1 bits [10:3] for the two hot-path formats.

For the misc bucket formats (011+xx), field positions vary by subtype. These are decoded after the 5-bit prefix, not on the hot path.

Hot Path Decode#

bit[15] splits SM/CM: one gate
bit[14] splits dyadic-wide from everything else: one gate
bit[13] splits monadic normal from misc bucket: one more gate
the misc bucket is three gates deep, but nothing there is latency-critical

On the hot path (dyadic wide, prefix 00), the PE begins the IRAM read on offset[7:0] and the 670 act_id-to-frame_id resolution on act_id[2:0] simultaneously with decode. By the time the prefix is fully resolved (2 gates, ~10 ns), both lookups are already in flight.

SM Token Format#

SM token (2 flits, standard):
  flit 1: [1][SM_id:2][op:3-5][addr:8-10]                         = 16 bits
  flit 2: [data:16] or [return_routing:16]                         = 16 bits

15 bits available after the SM discriminator. SM_id (2 bits) selects one of 4 SMs. The remaining 13 bits are split between opcode and address using variable-width encoding:

op[2:1] != 11:  3-bit opcode, 10-bit addr (1024 cells)
  read, write, alloc, free, exec, ext

op[2:1] == 11:  extends to 5-bit opcode, 8-bit payload (256 cells or inline data)
  rd_inc, rd_dec, cas, raw_rd, clear, set_pg, write_im, (spare)

One decode gate on op[2:1] discriminates the two tiers.

Return routing in flit 2: for read and other result-producing ops, flit 2 carries a pre-formed CM token template. The SM's result formatter latches the template, prepends it as flit 1, and appends the read data as flit 2.

IO is memory-mapped SM: IO devices are mapped into SM address space (typically SM00 at v0). IO operations use the standard SM token format. I-structure semantics provide natural interrupt-free IO: a read from an IO device that has no data defers until data arrives.

See sm-design.md for the full opcode table, extended addressing, and cas handling.

Variable-Length Token Summary#

Token Type	Prefix	Flits	Data	Offset	Act ID	Port
Dyadic wide	00	2	16-bit	8 (256)	3	flit1
Monadic normal	010	2	16-bit	8 (256)	3	--
Frame control	011+00	2	16-bit payload	--	3	--
PE-local write	011+01	2	16-bit	--	3	--
Monadic inline	011+10	1	none	7 (128)	--	--
SM standard	1	2	16-bit	8-10	--	--

The common case is 2 flits. Inline monadic (1 flit) is the fast path for control-flow tokens. PE-local writes are 2 flits and infrequent during execution.

Key Design Rationale#

Opcodes don't travel: tokens carry destination addresses, not opcodes. IRAM is fetched PE-locally. With the reversed pipeline (IFETCH before MATCH), the instruction word drives match behaviour. Instruction width is completely independent of bus width.
1-bit top-level split: bit[15] discriminates SM from CM traffic. One gate. The network routes on bit[15] + the 2-bit destination ID (PE_id or SM_id). Everything below that is endpoint-decoded.
Hot path decode is shallow: dyadic wide (the dominant format) is identified by two bits (bit[15]=0, bit[14]=0). The PE can begin activation_id resolution (74LS670 lookup) and IRAM read in parallel on flit 1 latch.
PE-local writes are unified: IRAM writes and frame writes share the same misc-bucket format (011+01), differing only in the region bit. No separate system token type for either.
Compact activation ID: 3-bit act_id identifies one of up to 4 concurrent frames per PE, with sufficient ABA distance. The compact encoding leaves 8 bits for instruction offset in both dyadic and monadic hot-path formats.
Width domains are independent: bus width (16-bit), token format (variable flit count), IRAM width (16-bit single-half read), and PE pipeline width (wider, decomposed) are each sized for their own constraints.
Instruction deduplication: IRAM entries are templates shared across activations. Per-activation data (constants, destinations) lives in frames. The number of IRAM entries needed is unique operation shapes, not total operations executed.

Spare Bits and Future Use#

The spare bits are explicitly reserved, not accidentally unused:

Frame control spare:3 — future candidates: extended op field (up to 4 lifecycle operations), diagnostic flags, priority level.
Monadic inline spare:2 — future candidates: reduced act_id (2 bits), extended interpretation flag.
Misc bucket sub=11 — entire format reserved for future use.

The spare bits provide escape hatches for architectural evolution without changing the base format. v0 should treat them as must-be-zero on transmit, ignored on receive.

Module Taxonomy#

CM (Control Module) — execution and matching#

Instruction memory (IM / IRAM): stores activation-independent instruction templates (function bodies)
- Width decoupled from bus: 16-bit single-half read (one 8-bit chip pair per PE). Instruction word format: [type:1][opcode:5][mode:3][wide:1][fref:6]. See pe-design.md for detailed field definitions and mode table.
- Runtime-writable via PE-local write tokens (prefix 011+01, region=0)
- Write from network stalls the pipeline (acceptable for config operations)
- Enables runtime reprogramming and eliminates need for separate config bus
- Constants and destinations live in per-activation frame slots, not in the instruction word. This doubles IRAM density and enables template deduplication across activations.
Per-activation frame storage: each active computation gets a frame — a flat array of 16-bit SRAM slots holding pending match operands, constants, destinations, accumulators, and wide values. The instruction template references frame slots by index via the fref field.
- Runtime-writable via PE-local write tokens (prefix 011+01, region=1)
- Frame lifecycle managed by frame control tokens (prefix 011+00, ALLOC/FREE)
Matching via 74LS670 register-file lookup: activation_id indexes a lookup table to get frame_id, then presence/port metadata in additional 670s is checked for dyadic matching. All combinational (~35 ns), no SRAM cycle needed. 8 matchable offsets per frame (assembler-enforced).
Pipeline order: IFETCH then MATCH. The instruction word drives match behaviour; activation_id resolution runs in parallel with the IRAM read.
Receives CM tokens (bit[15]=0) from CN and DN (SM results repackaged as CM tokens), produces tokens to CN and AN
Each PE has a unique ID, set via EEPROM (instruction decoder doubles as ID store) or DIP switches during prototyping
See pe-design.md for PE internals, pipeline staging, and frame details

SM (Structure Memory) — data storage, structure operations, and IO#

Banked data memory (cells) for arrays, lists, heap data
Embedded functional units for structure operations (read, write, atomic RMW, bulk ops via EXEC/ITERATE/COPY_RANGE)
Receives operation requests via AN (bit[15]=1 tokens), returns results via DN (repackaged as CM tokens)
Operates asynchronously from CMs — split-phase memory access
IO is memory-mapped into SM address space. An SM (typically SM00 at v0) maps IO devices into its address range. I-structure semantics provide natural interrupt-free IO: a READ from an IO device that has no data defers until data arrives.
SM00 has bootstrap responsibility: wired to the system reset signal, it calls EXEC on a predetermined ROM address to load the system. At runtime, SM00 behaves as a standard SM; only the reset-vector wiring is special. See sm-design.md for details.
Memory tiers: SM address space supports regions with different semantics — tier 0 (raw, no presence bits), tier 1 (I-structure), and tier 2 (wide/bulk with is_wide tag). Tier selection is by address range.
See sm-design.md for interface, banking, and tier details

Two Logical Interconnects (shared physical bus for v0)#

CN (Communication Network): CM <-> CM, bit[15]=0
AN (Arbitration Network):   CM  -> SM, bit[15]=1
DN (Distribution Network):  SM  -> CM, SM results repackaged as bit[15]=0

For v0 (4 PEs + 2-4 SMs), all traffic shares a single physical 16-bit bus with bit[15]-based routing. Routing nodes inspect flit 1 (bit[15] + destination ID in bits[14:13] or [13:12]) and forward the entire multi-flit packet to the appropriate destination. Multiple packets can be in flight simultaneously if the bus is pipelined with latches at each stage.

The AN/DN can be split onto separate physical paths later if SM access contention becomes a bottleneck. The bit[15]-based routing means this is a topology change, not a protocol change — no module interfaces need to change.

See network-and-communication.md for routing, clocking, and scaling details.

Function Calls#

The Problem#

A function call in this architecture must solve:

Code residency — callee instructions in IRAM on the right PEs.
Activation isolation — fresh frame for the new activation.
Argument injection — N argument values tagged into callee's activation.
Return linkage — callee knows where to send results.
Activation teardown — free frame(s) when activation completes.

How Frames Support Calls#

The frame-based design keeps function call machinery simple:

Destinations are pre-formed. Output destinations are pre-formed flit 1 values stored in frame slots. Each destination already contains the target PE, offset, and act_id. No special override mechanism or context-mode encoding is needed in the instruction word.
Call descriptors live in frames. Pre-formed flit 1 values (call descriptors, return addresses) are loaded into frame slots during activation setup. The callee's return address is just another frame constant, loaded before execution begins. No SM round-trip at call time.
Frame allocation is PE-local. Activation setup sends an ALLOC frame control token to the target PE. The PE allocates a frame locally and returns a confirmation token. No SM round-trip for allocation.

Static Calls#

For non-recursive calls with compiler-known call graphs, all activation assignments are compile-time constants. The compiler assigns act_ids and frame slot layouts when laying out IRAM.

Argument passing: The caller's output instructions have their destination frame slots pre-loaded with flit 1 values targeting the callee's (PE, offset, act_id). Arguments flow across activation boundaries as normal token routing. No special call instructions — just frame setup.

Return: The callee's return instruction uses a destination read from its frame (mode 0 or 1). The return address is a frame constant loaded during activation setup — a pre-formed flit 1 value pointing back to the caller's (PE, offset, act_id). For single-call-site functions, this is a compile-time constant loaded once.

Multiple call sites: If a function is called from N sites, each call site loads different return routing into the callee's frame during setup. The callee's return instruction is identical across all call sites — it just reads frame[fref] for the destination, and the frame contents determine where the result goes. No per-call-site IRAM duplication needed.

Activation allocation: Compile-time for static calls. The compiler assigns non-overlapping act_id values. The ALLOC frame control token establishes the frame before any compute tokens arrive.

Activation teardown: The callee's final instruction (or a separate cleanup path) executes FREE_FRAME, which clears the tag store entry and returns the frame to the free pool.

Total overhead for static calls: frame setup tokens (ALLOC + PE-local writes to load constants and destinations). No extra compute instructions. The setup cost is proportional to the number of frame slots that differ between call sites.

Dynamic Calls#

For recursive calls, indirect calls (function pointers, trait objects), and functions with multiple call sites that need dynamic return routing.

Primitives:

Primitive	Type	Hardware	Purpose
CHANGE_TAG	dyadic CM (mode 4/5)	~4 chips/PE	Output routing from data operand
EXTRACT_TAG	monadic CM	~2 chips/PE	Capture runtime act_id + offset as data
ALLOC	frame control token	0 PE chips	Runtime frame allocation on target PE
FREE	frame control token	0 PE chips	Frame deallocation

CHANGE_TAG (mode 4/5): Left operand is a 16-bit packed tag (a pre-formed flit 1 value). Right operand is the data payload. Output token's flit 1 = left operand verbatim. Flit 2 = right operand. Enables sending a value to any destination computed at runtime. The output stage is a mux: frame dest vs left operand, selected by mode[2]. Hardware: left operand bypass latch (~2 chips) preserves the left operand value past the ALU. Stage 5 flit 1 mux (~2 chips) selects between assembled flit and raw data.

EXTRACT_TAG: Monadic instruction. Captures the executing token's identity as a 16-bit data value (a return continuation). The return offset comes from the frame (or an instruction-derived constant); PE_id from hardware; act_id from the pipeline activation latch. Output is a packed flit 1 value that can be stored in a callee's frame or passed to CHANGE_TAG for dynamic return routing.

Call descriptor tables: Pre-formed flit 1 values for callee argument destinations, stored in SM at boot (loaded via EXEC). During activation setup, these descriptors are loaded from SM into the callee's frame slots via PE-local write tokens.

Runtime activation allocation: An ALLOC frame control token is sent to each target PE. The PE allocates a frame locally and returns a confirmation token. Purely PE-local — no SM round-trip required. Multiple PEs can allocate frames in parallel.

Dynamic Call Sequence#

Caller (PE0, act=2) calls foo(a, b) -> result dynamically:

  [Setup phase — can be overlapped with caller computation]
  ALLOC(PE1, act=5)                 ; allocate frame on callee's PE
  PE-local write(PE1, act=5, slot=0, const_val)  ; load constant
  PE-local write(PE1, act=5, slot=1, dest_ret)   ; load return routing
  PE-local write(PE1, act=5, slot=2, dest_out)   ; load output dest

  [Argument injection — from caller's compute path]
  EXTRACT_TAG                       ; pack (PE0, act=2, ret_offset) as data
    -> store in callee frame via PE-local write to ret slot
  Route arg_a to (PE1, offset_a, act=5)  ; normal mode 0 output
  Route arg_b to (PE1, offset_b, act=5)  ; normal mode 0 output

Callee (PE1, act=5, receives args via normal matching):
  ; ... compute result ...
  ; return instruction is mode 0, reads dest from frame[ret_slot]
  ; frame[ret_slot] = pre-formed flit 1 targeting caller
  ; result routes back to caller automatically

  [Teardown]
  FREE_FRAME                        ; release callee's frame

Setup tokens can be pipelined. ALLOC, PE-local writes, and argument tokens can fire in parallel across different PEs. Effective critical path: ALLOC confirmation latency + argument delivery. Comparable to a conventional function call with register setup + jump.

Partial Execution#

The dataflow execution model supports Amamiya-style partial function execution naturally. If the callee's argument entry points are independent instructions (not a single multi-input "begin" node), arguments arriving early begin executing the callee's body before all arguments are present. No special hardware support — the compiler structures the callee's dataflow graph to expose this parallelism.

Tail Calls#

If the callee reuses the caller's frame (no ALLOC, no new act_id), the call is a tail call. The compiler routes arguments with the inherited activation. No CHANGE_TAG needed, no allocation, no teardown. Falls out of mode 0 (INHERIT) naturally — the caller's frame destinations simply point to the callee's instruction offsets within the same activation.

Output Token Context Source (Summary)#

The 3-bit mode field in the instruction word controls how flit 1 of the output token is sourced:

INHERIT (modes 0-3): flit 1 comes from a pre-formed destination in the frame. The frame constant IS the output flit. No token formation logic.
CHANGE_TAG (modes 4-5): flit 1 comes from the left operand (a runtime-computed packed tag). Enables dynamic routing.
SINK (modes 6-7): no output token. ALU result written back to a frame slot.

See pe-design.md for the full mode table with frame slot access patterns, bit-level decode, and hardware cost.

Bus Protocol#

Bus Signals#

Bus signals (per link):
  data:16          flit data
  valid:1          flit is present (handshake)
  ready:1          receiver can accept (backpressure)
  more:1           more flits follow in this packet (framing)

Atomic Packet Delivery#

The output FIFO guarantees atomic packet delivery on the bus. A multi-flit token is written to the output FIFO as a complete unit. The FIFO drains flits to the bus sequentially, and the bus arbiter (or handshake protocol) holds the bus for the full packet duration.

The "more flits follow" signal (the more wire) accompanies the data bus. The emitter asserts it on every flit except the last flit of a packet. Routing nodes and receivers watch this signal to know when a packet is complete. They do not need to decode the token type or inspect flit 1 — framing is entirely at the bus level.

Format-Agnostic Network#

Routing nodes forward flits transparently. They latch flit 1 for routing decisions, then forward subsequent flits (while more is asserted) to the same destination. No per-flit routing lookup after the first flit.

This means the network is completely agnostic to token format. It does not know or care whether a packet is 1, 2, 3, or 4 flits. It follows the more signal. All format intelligence is at the endpoints (emitting PE's formatter and receiving PE's deserialiser).

See bus-interconnect-design.md for physical bus implementation and network-and-communication.md for routing topology.

Transistor Budget Estimate (4-PE system)#

Component	Transistors
4x PE logic	20-32K
Routing network (4 PEs)	2-3K
SM bootstrap/EXEC sequencer	~1-2K
Total logic	~25-35K
SRAM chips (instruction mem, matching stores, token queues)	8-16 chips

Bootstrap is handled by SM00's EXEC sequencer reading pre-formed tokens from ROM, or by an external microcontroller during early prototyping.

IPC / Performance Expectations#

"Superscalar" is the wrong term for dataflow — there's no single instruction stream
With 4 PEs and pipelined execution, peak is 4 ops/clock
Realistic sustained throughput depends on:
- Network crossing frequency (adds routing latency)
- Frame-based matching latency (combinational via 74LS670 register-file lookup — no SRAM cycle for match metadata)
- Available parallelism in the program
- Network contention (shared bus at v0 scale)
- Frame SRAM access scheduling (constant reads, destination reads, and operand stores each take one SRAM cycle)
Parallel workloads (matrix multiply, FFT): near peak
Sequential/pointer-chasing code: ~0.5-1 ops/clock (still competitive with 6502)
Key insight: matching store performance is the primary bottleneck, as Manchester discovered. The 670-based approach moves match metadata off the SRAM bus entirely, enabling pipeline overlap between IRAM fetch and match resolution

Build Order#

Phase 0: SM (Structure Memory) — BUILD FIRST#

Self-contained module, testable in isolation
Drive with microcontroller (Arduino/RP2040) for testing
Defined interface: receive operation request, process, return result
Key deliverables:
- Banked SRAM with address decoding
- Simple operation unit (read/write at minimum, cons/car/cdr stretch goals)
- Input interface (receive request packets)
- Output interface (send result packets)
- Test harness: microcontroller sends requests, validates responses

Phase 1: CM (Control Module) — single PE#

Instruction memory (SRAM, 16-bit single-half, one 8-bit chip pair)
Frame storage (SRAM, shared chip pair with IRAM or separate)
74LS670-based activation_id resolution and match metadata (~8 chips)
16-bit ALU
Token FIFO (input)
Token output formatting (frame destination slots are pre-formed flit 1)
Test with microcontroller injecting tokens, verify IFETCH -> MATCH pipeline, frame allocation/deallocation, and execution

Phase 2: CM + SM pair#

Connect via shared bus with bit[15]-based routing
Load a program using microcontroller (external, via IRAM write tokens or direct SRAM programming)
Execute a dataflow graph that uses structure memory
First real program: fibonacci, small FFT, or similar

Phase 3: Multi-module#

Second CM, routing network
Prove cross-PE token routing works
Demonstrate actual parallel execution speedup

Phase 4: System#

Expand to 4 CMs + 2-4 SMs
SM00 bootstrap via EXEC from ROM
IO memory-mapped into SM00 address space (UART, etc.)
ISR equivalent via I-structure semantics: READ from IO device defers until data arrives, triggering the receiving node in the dataflow graph
Performance benchmarking vs period-equivalent CPUs

Open Questions / Next Steps#

SM internal design — banking scheme, bulk op sequencer, tier boundary configuration, wide pointer cell format (partially specified, see sm-design.md)
~~Context slot count per CM~~ — Resolved. 3-bit activation_id with 4 concurrent frames per PE. See pe-design.md.
~~Instruction encoding~~ — Resolved. 16-bit single-half format: [type:1][opcode:5][mode:3][wide:1][fref:6]. Constants and destinations live in frame slots. See pe-design.md for the full encoding and mode table.

The emulator and assembler still use Python IntEnum values as opcode placeholders. These do NOT represent final hardware bit encodings — a hardware encoding pass mapping to the 5-bit opcode field is still needed.
IO address space allocation — which SM_id is reserved for IO? How much of SM00's address space is mapped to IO vs general-purpose storage? SM00 is special only at boot for now; further specialisation deferred until profiling shows the standard opcodes are insufficient.
~~Compiler / assembler~~ — Resolved. The asm/ package implements a 7-stage assembler pipeline (parse -> lower -> expand -> resolve -> place -> allocate -> codegen). Produces PEConfig/SMConfig + seed tokens or a bootstrap token stream. The expand pass handles macro expansion (#macro definitions with ${param} substitution, variadic repetition, constant arithmetic) and function call wiring (cross-context edges, trampoline nodes, context teardown). Built-in macros for common patterns (counted loops, permit injection, reduction trees) are automatically available. See assembler-architecture.md for architecture. Grammar is dfasm.lark (Lark/Earley parser). Auto-placement via greedy bin-packing with locality heuristic. Remaining work: frame layout allocation, new token format support, optimisation passes, binary output.
Mode B clock ratio — exactly 2x, or design for arbitrary integer ratios?
Instruction residency — small IRAM per PE means programs larger than IRAM need runtime code loading. With 16-bit instructions, IRAM density has doubled (4096 instructions with bank switching on 8Kx8 SRAM), reducing pressure but not eliminating the issue for large programs. Code storage hierarchy: external storage -> SM -> IRAM. See pe-design.md Instruction Residency section for detailed options.
74LS670 supply — the register-file lookup depends on 74LS670 availability. Fallback options exist (discrete flip-flops, SRAM-based indexing) but increase chip count or add latency. See pe-design.md for details.
Assembler updates for frame model — the assembler needs frame layout allocation (assigning constants, destinations, match operands to frame slots), PE-local write token generation for frame setup, and enforcement of the 8-matchable-offset constraint.

Key Papers in Project#

gurd1985.pdf — Manchester Dataflow Machine (matching unit details, overflow, pipeline)
Dataflow_Machine_Architecture.pdf — Veen survey (comprehensive overview, matching space analysis)
amamiya1982.pdf — DFM architecture (semi-CAM, structure memory, TTL prototype)
17407_17358.pdf — DFM evaluation (implementation details, benchmarks, VLSI projection)
efficienthardwarearchitectureforfastipaddresslookup.pdf — Pao et al. (binary-trie partitioning, bit-vector parallel search, SRAM pipeline)
mclaughlin2005.pdf — IP lookup survey (comparison of trie vs hash approaches in hardware)
HighperformanceIPlookupcircuitusingDDRSDRAM.pdf — Yang et al. (hash + CAM overflow, DDR burst for multi-bank)
NonStrict_Execution_in_Parallel_and_Distributed_C.pdf — non-strict execution, split-phase memory
NATLS219821.pdf — National Semiconductor 100142 CAM chip (4x4-bit, reference for discrete CAM scale)
MOSES071271.pdf — Motorola MCM69C233 CAM (32-bit match width, reference for CAM interface design)
yuba1983.pdf — Yuba et al. (PE pipeline sections, pseudo-result handling, packet formats)