Dynamic Dataflow CPU - ALU & Output Formatter Design#
Covers the PE's functional unit: ALU operation set, instruction decoder, output token formation, and fan-out/routing logic.
See pe-design.md for overall PE pipeline and matching store.
See pe-design.md for the frame-based PE architecture,
iram-and-function-calls.md for instruction encoding and mode table,
and pe-pipelining-and-multiplexing.md for pipeline stages.
See bus-architecture-and-width-decoupling.md for token format definitions
and flit layouts.
See bus-interconnect-design.md for bus topology, loopback bypass, and
physical interconnect design.
Design Context#
The ALU and output formatter together form the "execute + emit" stages of the PE pipeline (stages 4 and 5 in the revised pipeline; see pe-pipelining-and-multiplexing.md). They are stateless relative to the rest of the PE — whatever control lines the IRAM instruction word sets up, plus whatever data the matching store (or token, for monadic) provides, produces the corresponding output after combinational gate delay. No microcode, no multi-cycle operations, no internal state that persists between instructions. The only sequential element is the output formatter's small FSM for serialising multi-token output sequences.
How This Differs From a Von Neumann ALU#
- No addressing arithmetic. No PC increment, no stack pointer, no base+offset calculations. All addressing is compile-time token routing, with destinations stored in per-activation frame slots rather than inline in the instruction word.
- No flags register. Comparison results are data values emitted as tokens. There is no shared condition code register persisting between instructions. Each instruction's boolean output travels as data through the token network, scoped to an activation. (A per-activation predicate store is a future option — see Future Extensions.)
- No branch prediction. Conditional execution is token routing: a SWITCH instruction sends data to one of two destinations based on a boolean input. The ALU does not interact with a branch predictor or program counter.
- Fan-out is explicit. An instruction can produce 0, 1, or 2 output tokens to different destinations. Fan-out is encoded as mode 2/3 in the instruction word; destinations are read from consecutive frame slots addressed by
fref. The output formatter serialises them into the output FIFO. - Constants come from frame slots. The IRAM instruction word contains no inline constant field. Constants of any width (16-bit, or 32-bit with
wide=1) live in frame slots, referenced by the instruction'sfrefindex. This separates the instruction template from per-activation data. - Comparison results are just data. A
cmpinstruction produces an 8-bit (or 1-bit packed in 8) result token that travels on the network like any other value. The downstream SWITCH or GATE instruction receives it as a normal operand.
Operation Set#
5-bit opcode (32 slots), ~16-18 used in v0 with room for expansion. Organised by functional group.
Arithmetic (dyadic unless noted)#
ADD A + B.
SUB A - B.
INC A + 1 (monadic). Uses adder carry-in, no second operand.
DEC A - 1 (monadic). Adder with inverted carry.
INC/DEC could be synthesized with ADD + CONST(1), but they're common
enough in loop counters and reference counting (SM rd_inc/rd_dec
return values that feed INC/DEC chains) that monadic single-token
variants save matching store entries and bus traffic.
No hardware multiply for v0. Multi-step multiply is synthesized in software as a sub-graph of shift-and-add instructions distributed across the PE network. A dedicated multiplier PE is a future option if arithmetic-heavy workloads demand it.
Logic (dyadic unless noted)#
AND Bitwise A & B.
OR Bitwise A | B.
XOR Bitwise A ^ B.
NOT Bitwise ~A (monadic).
Shift#
SHL Shift A left by N bits. N from frame constant (0-7).
SHR Logical shift A right by N bits. Zero-fill. N from frame constant.
ASR Arithmetic shift A right by N bits. Sign-extend. N from frame constant.
Shift amount comes from a frame constant slot (addressed by fref), not
a second operand. This means SHL/SHR/ASR are monadic operations — the
data comes from the token, the shift amount comes from the frame. Using
a frame constant rather than an IRAM immediate means the same instruction
template can shift by different amounts in different activations.
A data-dependent shift (shift by a runtime-computed amount) would require
a dyadic instruction; deferred for now — the compiler can decompose
variable shifts into conditional chains of fixed shifts if needed.
Hardware: 3-stage mux barrel shifter (shift by 1, 2, 4). Gives arbitrary 0-7 bit shifts in a single combinational pass through three layers of 2:1 muxes. Each stage is controlled by one bit of the 3-bit shift amount.
Comparison (dyadic)#
EQ A == B. Result: 0x01 if true, 0x00 if false.
LT A < B (signed). Result: 0x01 / 0x00.
LTE A ≤ B (signed). Result: 0x01 / 0x00.
GT A > B (signed). Result: 0x01 / 0x00.
GTE A ≥ B (signed). Result: 0x01 / 0x00.
All comparisons produce a result token with value 0x0000 or 0x0001. This token is data — it enters the network and arrives at a downstream SWITCH or GATE instruction as a normal operand. The bool_out signal (internal to the PE, derived from the comparator output) is also available to the output formatter for same-instruction conditional routing, but the primary mechanism is to emit the boolean as a data token.
⚠ Note: The emulator operates with 16-bit data throughout, so comparison results are 16-bit values (0x0000 or 0x0001). The 8-bit hardware datapath sketch below produces 8-bit results (0x00 or 0x01). Both representations are semantically identical — only bit 0 carries information.
All comparisons interpret operands as signed 2's complement. The comparator hardware (74LS85) natively does unsigned magnitude comparison. Signed comparison is derived by XORing the sign bits of both inputs before feeding the comparator. LTE and GTE are synthesised from the existing comparator outputs (LTE = LT OR EQ, GTE = GT OR EQ) via the decoder EEPROM's control signal selection.
Routing (dyadic)#
Branch operations (BREQ, BRGT, BRGE, BROF): compare L and R operands, then route the data to the taken or not-taken destination. Both destinations receive the data value; the branch condition selects which destination gets it first (taken) vs second (not-taken).
Switch operations (SWEQ, SWGT, SWGE, SWOF): compare L and R operands, then route data to the taken side and a trigger token (value 0) to the not-taken side. See the Output Formatter section for the not-taken trigger semantics (FREE vs NOOP).
GATE Conditional pass/suppress. Takes data (port L) and boolean (port R).
If bool == true: emit data to dest1 (and optionally dest2).
If bool == false: emit nothing. Both tokens consumed silently.
SEL Select between inputs based on a condition.
MRGE Merge two token streams (non-deterministic).
Branch and switch ops always emit exactly 2 tokens: one data token to the taken side, one token to the not-taken side (data for branch, inline monadic trigger for switch). See the Output Formatter section for details.
GATE emits 0 or 1-2 tokens depending on the boolean. When suppressed (bool false), both input operands are consumed with no output — the computation terminates at this point in the graph.
Data Movement#
PASS Identity (monadic). Output = input. Used for routing tokens
through a PE to fan out to multiple destinations without
transforming the data.
CONST Constant load (monadic). Output = frame constant value (from
frame slot at fref). Input token is a trigger only — its data
value is ignored. Natural consumer of inline monadic tokens.
Frame Management#
FREE_FRAME Deallocate frame (monadic). Clears the tag store entry for
the executing activation and returns the physical frame to
the free pool. Compiler-inserted on every function exit
path. The input token is a trigger; no data output. Output
mode is SUPPRESS (no tokens emitted).
ALLOC_REMOTE Construct and emit a frame control token targeting another
PE (monadic). Target PE and activation_id come from frame
constants. Used for remote function call setup — the
calling PE allocates a frame on the callee PE before
sending operands.
Operation Set Summary#
⚠ Preliminary: The binary opcode encodings below are a draft layout, not a committed hardware encoding. The Python emulator and assembler use IntEnum ordinal values that do NOT correspond to these bit patterns. Final hardware encoding will be determined during physical build.
| Opcode | Mnemonic | Arity | Output Mode | Description |
|---|---|---|---|---|
| 00000 | ADD | dyadic | mode-driven | A + B |
| 00001 | SUB | dyadic | mode-driven | A - B |
| 00010 | INC | monadic | mode-driven | A + 1 |
| 00011 | DEC | monadic | mode-driven | A - 1 |
| 00100 | AND | dyadic | mode-driven | A & B |
| 00101 | OR | dyadic | mode-driven | A | B |
| 00110 | XOR | dyadic | mode-driven | A ^ B |
| 00111 | NOT | monadic | mode-driven | ~A |
| 01000 | SHL | monadic | mode-driven | A << N (frame const) |
| 01001 | SHR | monadic | mode-driven | A >> N (frame const, logical) |
| 01010 | ASR | monadic | mode-driven | A >> N (frame const, arith) |
| 01011 | EQ | dyadic | mode-driven | A == B -> bool |
| 01100 | LT | dyadic | mode-driven | A < B signed -> bool |
| 01101 | LTE | dyadic | mode-driven | A <= B signed -> bool |
| 01110 | GT | dyadic | mode-driven | A > B signed -> bool |
| 01111 | GTE | dyadic | mode-driven | A >= B signed -> bool |
| 10000 | BREQ | dyadic | SWITCH | branch if L == R |
| 10001 | BRGT | dyadic | SWITCH | branch if L > R |
| 10010 | BRGE | dyadic | SWITCH | branch if L >= R |
| 10011 | BROF | dyadic | SWITCH | branch on overflow |
| 10100 | SWEQ | dyadic | SWITCH | switch if L == R |
| 10101 | SWGT | dyadic | SWITCH | switch if L > R |
| 10110 | SWGE | dyadic | SWITCH | switch if L >= R |
| 10111 | SWOF | dyadic | SWITCH | switch on overflow |
| 11000 | GATE | dyadic | GATE | pass or suppress by bool |
| 11001 | SEL | dyadic | mode-driven | select between inputs |
| 11010 | MRGE | dyadic | mode-driven | merge two token streams |
| 11011 | PASS | monadic | mode-driven | identity |
| 11100 | CONST | monadic | mode-driven | output = frame constant |
| 11101 | FREE_FRAME | monadic | SUPPRESS | deallocate frame |
| 11110 | ALLOC_REMOTE | monadic | special | remote frame alloc |
| 11111 | — | — | — | reserved for expansion |
⚠ Preliminary: The binary opcode encodings above are a draft layout, not a committed hardware encoding. The Python emulator and assembler use IntEnum ordinal values that do NOT correspond to these bit patterns. Final hardware encoding will be determined during physical build.
The output mode column indicates the default. "Mode-driven" means the
instruction's 3-bit mode field (from the IRAM instruction word)
determines whether the output is single (modes 0-1), fan-out (modes
2-3), change-tag (modes 4-5), or sink (modes 6-7). See the Output
Formatter section for the full mode table.
Branch vs Switch: Branch operations (BR*) compare their two operands and route the data value to the taken or not-taken destination. Switch operations (SW*) are similar but emit a trigger token (value 0) to the not-taken side instead of the data value. See the Output Formatter section for the not-taken trigger semantics.
Reserved opcode space (1 slot): candidates include MAP_PAGE
(future, when 610 is installed; writes '610 mapping register),
SET_PAGE (future, when 610 is installed; selects active IRAM bank
via page latch -- see pe-design.md IRAM Bank Switching section),
hardware multiply, predicate store read/write, and debug/trace
instructions. Constants for MAP_PAGE and SET_PAGE come from frame
slots. MAP_PAGE + SET_PAGE together cost 2 opcode slots but
enable banked IRAM with one '610 + one latch per PE; these are not
part of the v0 opcode set (v0 uses a jumper wire in place of the
610).
SM Instruction Dispatch#
The operation set above covers CM compute instructions (IRAM type
bit = 0). When type = 1, the PE dispatches an SM operation instead
of an ALU computation. CM and SM instructions share the same 16-bit
instruction format ([type:1][opcode:5][mode:3][wide:1][fref:6]) but
with independent 5-bit opcode spaces.
In SM mode:
- The ALU is bypassed (or used for address arithmetic in indexed ops)
- Stage 4 constructs SM token flits from the SM opcode, operand data, and frame slot contents (SM target address, return routing)
- Stage 5 emits the SM token to the target SM via the SM bus
SM parameters (target SM_id, cell address, return routing) come from
frame slots addressed by fref, not from inline IRAM fields. A frame
target slot packs [SM_id:2][addr:8-10][spare] in 16 bits. Return
routing occupies the next consecutive frame slot as a pre-formed
response token flit 1.
The flit 2 source for SM tokens is selected by a mux driven by the instruction decoder:
source select use case
------- ------ -----------------------------------------
ALU out 00 SM write (data passes through ALU). Default.
R oper 01 SM scatter write (R operand bypasses ALU to flit 2).
SM CAS flit 2 (expected value = L operand).
Frame 10 SM read / rd_inc / rd_dec / raw_rd / alloc return
routing (frame[fref+1] = pre-formed response flit 1).
(spare) 11 reserved
See iram-and-function-calls.md for the full SM operation mapping
table, CAS 3-flit encoding, and indexed address computation.
Instruction Decoder#
The instruction decoder translates the opcode field from the IRAM word into ALU control signals and output formatter control. It is implemented as a single EEPROM used as a lookup table — the opcode bits address the EEPROM, the data outputs are the control signals.
Decoder EEPROM#
Address inputs (active during stage 2, instruction fetch):
opcode:5 from IRAM instruction word
type:1 CM (0) vs SM (1) instruction
mode:3 output/frame-access mode from IRAM instruction word
Data outputs (active during stage 4, execute + output):
alu_func:3 selects ALU functional unit (add/sub/logic/shift/cmp/pass)
alu_sub_mode:2 sub-function (e.g., AND vs OR vs XOR within logic unit)
alu_src_sel:1 left input source: port_L data vs frame constant
carry_in:1 for SUB, DEC (invert + carry), or INC (carry-in = 1)
shift_dir:1 left vs right (for shift unit)
shift_arith:1 arithmetic vs logical (for right shifts)
cmp_signed:1 signed vs unsigned comparison mode
output_enable:1 modes 0-3: emit output token(s)
output_fanout:1 modes 2-3: emit second output token
sink_write_en:1 modes 6-7: write ALU result to frame[fref]
output_data_sel:1 result source: ALU output vs passthrough input (for SWITCH/PASS)
flit2_source:2 SM flit 2 mux: ALU out (00) / R operand (01) / Frame (10)
bool_out_en:1 whether this instruction produces a meaningful bool_out
---
Total: ~18 output bits
The output mode encoding is now driven by the 3-bit mode field from
the instruction word, replacing the old has_dest2 + ctx_mode
combination. The mode field feeds directly into the EEPROM address
alongside the opcode, so the EEPROM output signals reflect the full
mode semantics in a single lookup.
One EEPROM chip (e.g., 28C64: 8K x 8, or 28C256: 32K x 8). With opcode (5 bits) + type (1 bit) + mode (3 bits) = 9 address bits, a 28C64 (13 address lines) leaves 4 spare address lines. These can serve as:
- PE_id inputs (2-3 bits): the same EEPROM encodes per-PE customised decode tables. Allows heterogeneous PEs from identical circuit boards with different EEPROM contents.
- bool_out input (1 bit): enables GATE suppression and SWITCH routing decisions to be resolved in the EEPROM lookup rather than downstream logic.
The EEPROM is purely combinational from the pipeline's perspective — the address is stable (latched from IRAM read), the output settles within EEPROM access time (~150ns for standard parts, well within a pipeline stage at 5-10 MHz).
ALU Hardware#
⚠ Tentative: The 8-bit vs 16-bit ALU datapath decision is not fully resolved. The emulator operates at 16-bit. The 8-bit design below is the current hardware sketch for minimising chip count; the 16-bit variant (see below) roughly doubles the datapath. Final decision depends on transistor budget during physical build.
8-bit Datapath (v0)#
┌──────────────────────────────────────────┐
data_L ───────►│ source mux (data_L vs frame constant) │
└──────────────┬───────────────────────────┘
│ A input
V
┌──────────────────────────────────────────┐
│ Adder/Subtractor │
data_R ──┬────►│ 2x 74LS283 (4-bit CLA, cascaded) │──► add_result
│ │ + 2x 74LS86 (XOR for SUB: invert B) │
│ │ carry_in from decoder │
│ └──────────────────────────────────────────┘
│
│ ┌──────────────────────────────────────────┐
├────►│ Logic Unit │
│ │ 2x 74LS32 (OR) │
A input ─┤ │ (XOR reused from adder) │
│ │ sub-mode mux selects AND/OR/XOR output │
│ └──────────────────────────────────────────┘
│
│ ┌──────────────────────────────────────────┐
├────►│ Barrel Shifter │
│ │ 3 stages of 2x 74LS157 (quad 2:1 mux) │──► shift_result
│ │ stage 0: shift by 1 if imm[0] │
│ │ stage 1: shift by 2 if imm[1] │
│ │ stage 2: shift by 4 if imm[2] │
│ │ shift_dir and shift_arith from decoder │
│ └──────────────────────────────────────────┘
│
│ ┌──────────────────────────────────────────┐
└────►│ Comparator │
│ 2x 74LS85 (4-bit magnitude, cascaded) │──► cmp_result (0x00/0x01)
A input ──────►│ sign-bit XOR for signed mode │ + bool_out (1 bit)
└──────────────────────────────────────────┘
┌───────────────────────────────────────────────────────────────────┐
│ Result Mux │
│ 2x 74LS157 (quad 2:1 mux, cascaded for 8-bit 4:1 select) │
│ │
│ alu_func:3 ──► selects: add_result / logic_result / │
│ shift_result / cmp_result / passthrough │
└──────────────────────────────────────────┬────────────────────────┘
│
V
result:8
bool_out:1
Chip Count (8-bit)#
Instruction decoder EEPROM 1
Adder: 2x 74LS283 (4-bit CLA) 2
SUB inversion: 2x 74LS86 (XOR, shared with logic unit) 2
Logic AND: 2x 74LS08 2
Logic OR: 2x 74LS32 2
Barrel shifter: 6x 74LS157 (3 stages x 2 chips) 6
Comparator: 2x 74LS85 (4-bit magnitude, cascaded) 2
Result mux: 2x 74LS157 (or 74LS153 for 4:1) 2
Source mux: 2x 74LS157 (A input: data_L vs frame const) 2
─────────
Total ALU: ~21 chips
Without barrel shifter (shift-by-1 only): ~15 chips. The barrel shifter is the single most expensive functional unit but is worth it — bit manipulation without multi-bit shifts is very painful, and shift-by-1 chains through the pipeline are slow (each shift costs a full token round-trip).
16-bit Datapath (future)#
Roughly double the datapath chips: 4 adder, 4 XOR, 4 AND, 4 OR, 12 shifter muxes, 4 comparator, 4 result mux, 4 source mux. The decoder EEPROM stays the same. Estimate: ~38-42 chips. Buildable but substantially larger.
ALU Outputs#
The ALU produces these signals, consumed by the output formatter:
result:8 computed data value (or passthrough for SWITCH/PASS)
bool_out:1 boolean/comparison result:
- from comparator for EQ/LT/LTE/GT/GTE
- from comparator for BR*/SW* (inline comparison)
- from data_R bit 0 for GATE (boolean input)
- undefined for arithmetic/logic ops
bool_out for SWITCH and GATE comes from the boolean operand (port R by convention), not from the ALU computation. The decoder's output_data_sel signal controls whether result carries the ALU output or the passthrough data_L input (for SWITCH, the data being routed).
Output Formatter#
The output formatter takes the ALU result, frame slot data, and control signals from the decoder, and produces 0-2 output tokens serialised into the PE's output FIFO. Under the frame-based architecture, the formatter is significantly simpler than the original design: destination routing is a pre-formed flit 1 read from a frame slot, not assembled field-by-field from IRAM destination bits.
Inputs#
From ALU:
result:8/16 computed or passthrough data value
bool_out:1 boolean signal for SWITCH/GATE
From frame SRAM (read during stage 5):
dest1:16 pre-formed flit 1 from frame[fref] or frame[fref+1]
dest2:16 (fan-out modes only) from frame[fref+1] or frame[fref+2]
From left operand bypass latch (modes 4-5, CHANGE_TAG):
left_operand:16 entire flit 1 comes from left operand data
From IRAM instruction word:
mode:3 3-bit output/frame-access mode
From pipeline latches:
act_id:3 activation_id of executing token (for EXTRACT_TAG
and similar operations)
From decoder EEPROM:
output_enable:1 emit output token(s) (modes 0-3)
output_fanout:1 emit second output token (modes 2-3)
sink_write_en:1 write ALU result to frame (modes 6-7)
output_data_sel:1 result = ALU output vs passthrough input
flit2_source:2 SM flit 2 source mux select
bool_out_en:1 whether this instruction produces a meaningful bool_out
See iram-and-function-calls.md for the 16-bit instruction word
layout, mode table, and frame slot formats.
Output Modes#
The 3-bit mode field from the instruction word drives output behaviour.
All 8 encodings are useful; there are no wasted mode values.
mode tag behaviour frame access at fref output behaviour
---- ------------- ------------------------- --------------------------
0 INHERIT read dest single output, no constant
1 INHERIT read const, read dest single output with constant
2 INHERIT read dest1, read dest2 fan-out, no constant
3 INHERIT read const, dest1, dest2 fan-out with constant
4 CHANGE_TAG (none) dynamic routing, no constant
5 CHANGE_TAG read const dynamic routing with constant
6 SINK write result -> frame[fref] no output token
7 SINK+CONST read frame[fref], local accumulate / RMW
write result -> frame[fref]
Bit-level decode (directly from mode[2:1:0]):
output_enable = NOT mode[2] modes 0-3: read dest, emit token
change_tag = mode[2] AND NOT mode[1] modes 4-5: routing from left operand
sink = mode[2] AND mode[1] modes 6-7: no output, write to frame
has_const = mode[0] modes 1, 3, 5, 7: read constant
has_fanout = mode[1] AND NOT mode[2] modes 2-3: read two destinations
SUPPRESS: not a mode encoding but an opcode-driven override. FREE_FRAME opcode, or GATE with bool_out=false, forces output_enable=0 and sink_write_en=0. Both input tokens consumed silently, pipeline advances immediately.
INHERIT, single output (modes 0-1): one token to dest read from frame. One formatter cycle. The frame slot provides the complete flit 1 (pre-formed with PE, offset, act_id, port). The ALU result becomes flit 2.
INHERIT, fan-out (modes 2-3): two tokens, same data, different destinations. Two formatter cycles. Dest1 from frame[fref] (or frame[fref+1] if has_const), then dest2 from the next consecutive slot. Both carry the ALU result as flit 2.
CHANGE_TAG (modes 4-5): one token whose flit 1 is the left operand data (a 16-bit packed tag), bypassing normal frame dest reads. No frame destination read needed. The ALU result becomes flit 2. Used for return routing where the return address was carried as data through the function body.
SINK (modes 6-7): no output token. The ALU result is written back to frame[fref]. Mode 7 (SINK+CONST) reads frame[fref] as a constant input to the ALU first, then writes the result back — a read-modify- write cycle for frame-local accumulation.
SWITCH: branch and switch opcodes (BR*, SW*) override the normal mode semantics for destination selection. Two tokens, different roles. Two formatter cycles.
Cycle 0 -- data token (taken side):
destination: bool_out ? dest1 : dest2 (from frame slots)
flit 1: pre-formed dest from frame
flit 2: ALU result (passthrough data_L for SWITCH)
Cycle 1 -- trigger token (not-taken side):
destination: bool_out ? dest2 : dest1 (from frame slots)
format: ALWAYS monadic inline (prefix 011+10), regardless
of the destination's original token format
data: none (1-flit token)
The formatter always emits an inline monadic trigger to the not-taken side. Whether that trigger causes a NOOP, a FREE_FRAME, or anything else depends on the instruction at the target offset in the target PE's IRAM. This is a compiler decision, not a hardware one.
GATE Behaviour#
GATE is simpler than SWITCH:
bool_out = true (gate passes):
Same as modes 0-3 depending on instruction mode.
Data token(s) emitted normally.
bool_out = false (gate suppresses):
Output becomes SUPPRESS.
No tokens emitted. Both inputs consumed.
The decoder can implement this by having GATE's output_enable be conditional on bool_out. In the EEPROM this is achieved by including bool_out as an address input, so the EEPROM output directly reflects the suppression decision.
Formatter State Machine#
The formatter is a small FSM (2-4 states) that serialises output tokens into the PE's output FIFO. Frame SRAM reads for destinations occur in stage 5 and drive the flit 1 output directly.
States:
IDLE -- waiting for ALU completion signal
EMIT_FIRST -- read dest1 from frame SRAM, push flit 1 + flit 2
EMIT_SECOND -- read dest2 from frame SRAM, push flit 1 + flit 2
(fan-out modes only)
SINK_WRITE -- write ALU result to frame[fref]
(sink modes only)
Transitions:
IDLE -> EMIT_FIRST on ALU done, if output_enable
IDLE -> SINK_WRITE on ALU done, if sink_write_en
IDLE -> IDLE on ALU done, if SUPPRESS
EMIT_FIRST -> EMIT_SECOND if has_fanout or SWITCH mode
EMIT_FIRST -> IDLE otherwise
EMIT_SECOND -> IDLE always
SINK_WRITE -> IDLE always
Each EMIT state takes 1-2 bus cycles depending on whether the output token is 1-flit (inline monadic) or 2-flit (all other formats). The formatter pushes flits to the output FIFO sequentially. The pipeline is stalled during EMIT and SINK_WRITE states.
EMIT_FIRST reads dest1 from frame SRAM (1 cycle). EMIT_SECOND reads dest2 from the next frame slot (1 additional cycle). For CHANGE_TAG modes, no frame read occurs — flit 1 is sourced from the left operand bypass latch, so EMIT_FIRST completes without an SRAM access.
Token Packing Mux#
Under the frame-based architecture, the token packing mux is dramatically simpler than the original design. Flit 1 no longer requires field-by-field assembly of PE, offset, act_id, port, and gen from separate sources. Instead:
Flit 1 source (2:1 mux, selected by mode[2]):
mode[2] = 0 (INHERIT, modes 0-3):
flit 1 = frame dest slot (pre-formed, read from SRAM)
format: [prefix:2-3][port:0-1][PE:2][offset:8][act_id:3]
The frame slot IS the output flit. No repacking needed.
mode[2] = 1, mode[1] = 0 (CHANGE_TAG, modes 4-5):
flit 1 = left operand data (16-bit packed tag from bypass latch)
Flit 2 (always):
flit 2 = ALU result (16-bit data value)
The packing mux is physically a single layer of 2:1 muxes selecting
between frame dest and left operand, controlled by mode[2]. This
replaces the multi-chip format-dependent field-selection mux from the
original design. Estimate: 2-4 TTL chips (2-4x 74LS157 for 16-bit
2:1 mux).
For SWITCH cycle 1 (not-taken trigger), the mux is forced to monadic inline format (prefix 011+10). The formatter extracts the PE and offset fields from the frame dest slot and packs them into the inline format. This override is a small amount of combinational logic during EMIT_SECOND when in SWITCH mode.
For SM operations, the flit 2 source mux selects between ALU output, R operand, and frame slot data (see SM Instruction Dispatch above). This adds one additional 3:1 mux layer on the flit 2 path (~4-6 chips for 16-bit width).
Source of Output Routing#
Output routing is controlled by the 3-bit mode field. There are two
routing modes (the old CTX_OVRD mode is eliminated):
-
INHERIT (modes 0-3): routing comes from the frame destination slot. The dest slot contains a pre-formed flit 1 with PE, offset, act_id, and port all baked in at frame setup time. The compiler pre-computes the complete destination routing and loads it into the frame during activation setup. Cross-activation routing (the old CTX_OVRD use case) is handled by storing the appropriate act_id in the frame constant at setup time — no separate mode needed.
-
CHANGE_TAG (modes 4-5): routing comes from the left operand data, which carries a 16-bit packed tag (a complete flit 1 value). Used for dynamic return routing where the return address was carried as data through the function body. The ALU result becomes flit 2.
Pipeline latches still carry act_id (3 bits) through the pipeline
for use in EXTRACT_TAG and similar operations that need to reference
the executing activation's identity.
Bus Framing and Non-Interleaving#
The output FIFO guarantees atomic packet delivery on the bus. A multi-flit token is written to the output FIFO as a complete unit. The FIFO drains flits to the bus sequentially, and the bus arbiter (or handshake protocol) holds the bus for the full packet duration.
A "more flits follow" signal (1 wire) accompanies the data bus. The emitter asserts it on every flit except the last flit of a packet. Routing nodes and receivers watch this signal to know when a packet is complete. They do not need to decode the token type or inspect flit 1's inline bit — framing is entirely at the bus level.
Bus signals (per link):
data:16 flit data
valid:1 flit is present (handshake)
ready:1 receiver can accept (backpressure)
more:1 more flits follow in this packet (framing)
Routing nodes forward flits transparently. They latch flit 1 for routing
decisions, then forward subsequent flits (while more is asserted) to
the same destination. No per-flit routing lookup after the first flit.
This means the network is completely agnostic to token format. It does
not know or care whether a packet is 1, 2, 3, or 4 flits. It follows
the more signal. All format intelligence is at the endpoints (emitting
PE's formatter and receiving PE's deserialiser).
Local Token Routing (v0 vs Future)#
v0: all tokens go through the external bus, even when source PE == destination PE. The output formatter forms the token, pushes it to the output FIFO, it goes out on the bus, the bus routes it back to the same PE's input FIFO. Full round-trip.
This is deliberately simple. It avoids:
- A second write port on the input FIFO (arbitration between bus input and local loopback)
- Priority logic (should local tokens jump the queue ahead of external?)
- Bypass detection (comparing output dest_PE against own PE_id, which is trivial logic but adds a path)
The cost is latency: a local token pays the full bus round-trip instead of staying internal. For v0 with 1-2 PEs, the bus is mostly idle anyway.
Future optimisation: local bypass. The formatter checks dest_PE == own PE_id. If match, the token is written directly to the input FIFO (or a dedicated local loopback FIFO) without touching the external bus. This is the monadic-chain optimisation from the Papadopoulos & Traub paper — if A feeds B feeds C and all are on the same PE, keeping the token internal saves 3 bus round-trips. Significant latency win for sequential chains.
See bus-interconnect-design.md for the physical implementation of the
loopback bypass design, including FIFO arbitration, priority logic, and
chip-level detail.
Hardware Summary (v0, 8-bit datapath)#
Component Chips
─────────────────────────────────────────────────────────
Instruction decoder EEPROM 1
Source mux (data_L vs frame const) 2
Adder (2x 74LS283) 2
SUB/XOR (2x 74LS86) 2
AND (2x 74LS08) 2
OR (2x 74LS32) 2
Barrel shifter (6x 74LS157, 3 stages) 6
Comparator (2x 74LS85) 2
Result mux (2x 74LS157) 2
Output data mux (result vs passthrough, 2x 74LS157) 2
Flit 1 source mux (frame dest vs left operand) 2-4
Flit 2 source mux (ALU / R oper / frame, for SM ops) 4-6
Formatter FSM (state machine + SRAM addr control) 2
─────────────────────────────────────────────────────────
Total ALU + output formatter: ~31-35 chips
The output formatter's flit 1 path is simpler than the original design (a single 2:1 mux layer replacing the multi-format field-assembly mux), but the flit 2 path adds a 3:1 source mux for SM operations. The net chip count is comparable.
This does not include the output FIFO (shared with bus interface), the
matching store, frame SRAM, or IRAM — those are other parts of the PE.
See pe-pipelining-and-multiplexing.md and pe-design.md for the
full per-PE chip count estimate.
Without barrel shifter: ~25-29 chips.
Future Extensions (not in v0)#
Per-Activation Predicate Store#
A small RAM (e.g., 4-entry x 4-bit, one entry per concurrent frame) indexed by frame_id. Comparison instructions optionally write their bool_out to a named predicate channel (2-bit channel select from the instruction word). SWITCH and GATE instructions optionally read a predicate channel as their control input instead of consuming a boolean data token.
This enables three-logical-input operations (data_L, data_R, control) without triadic matching — the control arrives via the predicate store rather than as a third token. Eliminates one token hop for the common compare-then-branch pattern.
The predicate store is scoped per-activation (indexed by frame_id), so concurrent activations on the same PE do not interfere. Four predicate channels per frame handle up to four levels of nested conditional control flow; deeper nesting requires the compiler to decompose via two-stage token routing.
Hardware cost: 1 chip (small RAM or 74LS670) + 1 chip (read mux) + minor decode.
Shared Predicate Register (Sequential Mode)#
A single shared predicate register (not per-activation) for use in a "pseudo-sequential" execution mode where the compiler guarantees only one activation is using the predicate at a time. Useful for maximising locality on hot loops. Simpler than the per-activation store but dangerous if the single-activation invariant is violated. Purely an optimisation for specific code patterns, not a general mechanism.
Triadic Matching#
Matching store entries gain a second presence bit and second data slot. Three tokens must arrive before the instruction fires. Adds complexity to the matching logic (3-way port arbitration, wider data read-out) and roughly doubles match data storage width for triadic-capable entries. Deferred unless profiling shows the two-stage decomposition overhead is a real bottleneck.
Local Bypass#
Output formatter detects dest_PE == own PE and routes the token directly
to the input FIFO without external bus round-trip. Significant latency
reduction for PE-local computation chains. See Local Token Routing above
and bus-interconnect-design.md for physical implementation details.
Hardware Multiply#
Dedicated multiply unit or multiply-capable PE. Shift-and-add is feasible as a software subgraph, but a hardware 8x8 -> 16-bit multiplier would dramatically accelerate DSP-style workloads. Could be implemented as a specialised PE in the network rather than adding multiply logic to every PE.