Dynamic Dataflow CPU - ALU & Output Formatter Design#

Covers the PE's functional unit: ALU operation set, instruction decoder, output token formation, and fan-out/routing logic.

See pe-design.md for overall PE pipeline and matching store. See pe-design.md for the frame-based PE architecture, iram-and-function-calls.md for instruction encoding and mode table, and pe-pipelining-and-multiplexing.md for pipeline stages. See bus-architecture-and-width-decoupling.md for token format definitions and flit layouts. See bus-interconnect-design.md for bus topology, loopback bypass, and physical interconnect design.

Design Context#

The ALU and output formatter together form the "execute + emit" stages of the PE pipeline (stages 4 and 5 in the revised pipeline; see pe-pipelining-and-multiplexing.md). They are stateless relative to the rest of the PE — whatever control lines the IRAM instruction word sets up, plus whatever data the matching store (or token, for monadic) provides, produces the corresponding output after combinational gate delay. No microcode, no multi-cycle operations, no internal state that persists between instructions. The only sequential element is the output formatter's small FSM for serialising multi-token output sequences.

How This Differs From a Von Neumann ALU#

No addressing arithmetic. No PC increment, no stack pointer, no base+offset calculations. All addressing is compile-time token routing, with destinations stored in per-activation frame slots rather than inline in the instruction word.
No flags register. Comparison results are data values emitted as tokens. There is no shared condition code register persisting between instructions. Each instruction's boolean output travels as data through the token network, scoped to an activation. (A per-activation predicate store is a future option — see Future Extensions.)
No branch prediction. Conditional execution is token routing: a SWITCH instruction sends data to one of two destinations based on a boolean input. The ALU does not interact with a branch predictor or program counter.
Fan-out is explicit. An instruction can produce 0, 1, or 2 output tokens to different destinations. Fan-out is encoded as mode 2/3 in the instruction word; destinations are read from consecutive frame slots addressed by fref. The output formatter serialises them into the output FIFO.
Constants come from frame slots. The IRAM instruction word contains no inline constant field. Constants of any width (16-bit, or 32-bit with wide=1) live in frame slots, referenced by the instruction's fref index. This separates the instruction template from per-activation data.
Comparison results are just data. A cmp instruction produces an 8-bit (or 1-bit packed in 8) result token that travels on the network like any other value. The downstream SWITCH or GATE instruction receives it as a normal operand.

Operation Set#

5-bit opcode (32 slots), ~16-18 used in v0 with room for expansion. Organised by functional group.

Arithmetic (dyadic unless noted)#

ADD      A + B.
SUB      A - B.
INC      A + 1 (monadic). Uses adder carry-in, no second operand.
DEC      A - 1 (monadic). Adder with inverted carry.

INC/DEC could be synthesized with ADD + CONST(1), but they're common enough in loop counters and reference counting (SM rd_inc/rd_dec return values that feed INC/DEC chains) that monadic single-token variants save matching store entries and bus traffic.

No hardware multiply for v0. Multi-step multiply is synthesized in software as a sub-graph of shift-and-add instructions distributed across the PE network. A dedicated multiplier PE is a future option if arithmetic-heavy workloads demand it.

Logic (dyadic unless noted)#

AND      Bitwise A & B.
OR       Bitwise A | B.
XOR      Bitwise A ^ B.
NOT      Bitwise ~A (monadic).

Shift#

SHL      Shift A left by N bits. N from frame constant (0-7).
SHR      Logical shift A right by N bits. Zero-fill. N from frame constant.
ASR      Arithmetic shift A right by N bits. Sign-extend. N from frame constant.

Shift amount comes from a frame constant slot (addressed by fref), not a second operand. This means SHL/SHR/ASR are monadic operations — the data comes from the token, the shift amount comes from the frame. Using a frame constant rather than an IRAM immediate means the same instruction template can shift by different amounts in different activations. A data-dependent shift (shift by a runtime-computed amount) would require a dyadic instruction; deferred for now — the compiler can decompose variable shifts into conditional chains of fixed shifts if needed.

Hardware: 3-stage mux barrel shifter (shift by 1, 2, 4). Gives arbitrary 0-7 bit shifts in a single combinational pass through three layers of 2:1 muxes. Each stage is controlled by one bit of the 3-bit shift amount.

Comparison (dyadic)#

EQ       A == B. Result: 0x01 if true, 0x00 if false.
LT       A < B (signed). Result: 0x01 / 0x00.
LTE      A ≤ B (signed). Result: 0x01 / 0x00.
GT       A > B (signed). Result: 0x01 / 0x00.
GTE      A ≥ B (signed). Result: 0x01 / 0x00.

All comparisons produce a result token with value 0x0000 or 0x0001. This token is data — it enters the network and arrives at a downstream SWITCH or GATE instruction as a normal operand. The bool_out signal (internal to the PE, derived from the comparator output) is also available to the output formatter for same-instruction conditional routing, but the primary mechanism is to emit the boolean as a data token.

⚠ Note: The emulator operates with 16-bit data throughout, so comparison results are 16-bit values (0x0000 or 0x0001). The 8-bit hardware datapath sketch below produces 8-bit results (0x00 or 0x01). Both representations are semantically identical — only bit 0 carries information.

All comparisons interpret operands as signed 2's complement. The comparator hardware (74LS85) natively does unsigned magnitude comparison. Signed comparison is derived by XORing the sign bits of both inputs before feeding the comparator. LTE and GTE are synthesised from the existing comparator outputs (LTE = LT OR EQ, GTE = GT OR EQ) via the decoder EEPROM's control signal selection.

Routing (dyadic)#

Branch operations (BREQ, BRGT, BRGE, BROF): compare L and R operands, then route the data to the taken or not-taken destination. Both destinations receive the data value; the branch condition selects which destination gets it first (taken) vs second (not-taken).

Switch operations (SWEQ, SWGT, SWGE, SWOF): compare L and R operands, then route data to the taken side and a trigger token (value 0) to the not-taken side. See the Output Formatter section for the not-taken trigger semantics (FREE vs NOOP).

GATE     Conditional pass/suppress. Takes data (port L) and boolean (port R).
         If bool == true:  emit data to dest1 (and optionally dest2).
         If bool == false: emit nothing. Both tokens consumed silently.
SEL      Select between inputs based on a condition.
MRGE     Merge two token streams (non-deterministic).

Branch and switch ops always emit exactly 2 tokens: one data token to the taken side, one token to the not-taken side (data for branch, inline monadic trigger for switch). See the Output Formatter section for details.

GATE emits 0 or 1-2 tokens depending on the boolean. When suppressed (bool false), both input operands are consumed with no output — the computation terminates at this point in the graph.

Data Movement#

PASS     Identity (monadic). Output = input. Used for routing tokens
         through a PE to fan out to multiple destinations without
         transforming the data.
CONST    Constant load (monadic). Output = frame constant value (from
         frame slot at fref). Input token is a trigger only — its data
         value is ignored. Natural consumer of inline monadic tokens.

Frame Management#

FREE_FRAME  Deallocate frame (monadic). Clears the tag store entry for
            the executing activation and returns the physical frame to
            the free pool. Compiler-inserted on every function exit
            path. The input token is a trigger; no data output. Output
            mode is SUPPRESS (no tokens emitted).
ALLOC_REMOTE  Construct and emit a frame control token targeting another
              PE (monadic). Target PE and activation_id come from frame
              constants. Used for remote function call setup — the
              calling PE allocates a frame on the callee PE before
              sending operands.

Operation Set Summary#

⚠ Preliminary: The binary opcode encodings below are a draft layout, not a committed hardware encoding. The Python emulator and assembler use IntEnum ordinal values that do NOT correspond to these bit patterns. Final hardware encoding will be determined during physical build.

Opcode	Mnemonic	Arity	Output Mode	Description
00000	ADD	dyadic	mode-driven	A + B
00001	SUB	dyadic	mode-driven	A - B
00010	INC	monadic	mode-driven	A + 1
00011	DEC	monadic	mode-driven	A - 1
00100	AND	dyadic	mode-driven	A & B
00101	OR	dyadic	mode-driven	A \| B
00110	XOR	dyadic	mode-driven	A ^ B
00111	NOT	monadic	mode-driven	~A
01000	SHL	monadic	mode-driven	A << N (frame const)
01001	SHR	monadic	mode-driven	A >> N (frame const, logical)
01010	ASR	monadic	mode-driven	A >> N (frame const, arith)
01011	EQ	dyadic	mode-driven	A == B -> bool
01100	LT	dyadic	mode-driven	A < B signed -> bool
01101	LTE	dyadic	mode-driven	A <= B signed -> bool
01110	GT	dyadic	mode-driven	A > B signed -> bool
01111	GTE	dyadic	mode-driven	A >= B signed -> bool
10000	BREQ	dyadic	SWITCH	branch if L == R
10001	BRGT	dyadic	SWITCH	branch if L > R
10010	BRGE	dyadic	SWITCH	branch if L >= R
10011	BROF	dyadic	SWITCH	branch on overflow
10100	SWEQ	dyadic	SWITCH	switch if L == R
10101	SWGT	dyadic	SWITCH	switch if L > R
10110	SWGE	dyadic	SWITCH	switch if L >= R
10111	SWOF	dyadic	SWITCH	switch on overflow
11000	GATE	dyadic	GATE	pass or suppress by bool
11001	SEL	dyadic	mode-driven	select between inputs
11010	MRGE	dyadic	mode-driven	merge two token streams
11011	PASS	monadic	mode-driven	identity
11100	CONST	monadic	mode-driven	output = frame constant
11101	FREE_FRAME	monadic	SUPPRESS	deallocate frame
11110	ALLOC_REMOTE	monadic	special	remote frame alloc
11111	—	—	—	reserved for expansion

⚠ Preliminary: The binary opcode encodings above are a draft layout, not a committed hardware encoding. The Python emulator and assembler use IntEnum ordinal values that do NOT correspond to these bit patterns. Final hardware encoding will be determined during physical build.

The output mode column indicates the default. "Mode-driven" means the instruction's 3-bit mode field (from the IRAM instruction word) determines whether the output is single (modes 0-1), fan-out (modes 2-3), change-tag (modes 4-5), or sink (modes 6-7). See the Output Formatter section for the full mode table.

Branch vs Switch: Branch operations (BR*) compare their two operands and route the data value to the taken or not-taken destination. Switch operations (SW*) are similar but emit a trigger token (value 0) to the not-taken side instead of the data value. See the Output Formatter section for the not-taken trigger semantics.

Reserved opcode space (1 slot): candidates include MAP_PAGE (future, when 610 is installed; writes '610 mapping register), SET_PAGE (future, when 610 is installed; selects active IRAM bank via page latch -- see pe-design.md IRAM Bank Switching section), hardware multiply, predicate store read/write, and debug/trace instructions. Constants for MAP_PAGE and SET_PAGE come from frame slots. MAP_PAGE + SET_PAGE together cost 2 opcode slots but enable banked IRAM with one '610 + one latch per PE; these are not part of the v0 opcode set (v0 uses a jumper wire in place of the 610).

SM Instruction Dispatch#

The operation set above covers CM compute instructions (IRAM type bit = 0). When type = 1, the PE dispatches an SM operation instead of an ALU computation. CM and SM instructions share the same 16-bit instruction format ([type:1][opcode:5][mode:3][wide:1][fref:6]) but with independent 5-bit opcode spaces.

In SM mode:

The ALU is bypassed (or used for address arithmetic in indexed ops)
Stage 4 constructs SM token flits from the SM opcode, operand data, and frame slot contents (SM target address, return routing)
Stage 5 emits the SM token to the target SM via the SM bus

SM parameters (target SM_id, cell address, return routing) come from frame slots addressed by fref, not from inline IRAM fields. A frame target slot packs [SM_id:2][addr:8-10][spare] in 16 bits. Return routing occupies the next consecutive frame slot as a pre-formed response token flit 1.

The flit 2 source for SM tokens is selected by a mux driven by the instruction decoder:

source   select  use case
-------  ------  -----------------------------------------
ALU out  00      SM write (data passes through ALU). Default.
R oper   01      SM scatter write (R operand bypasses ALU to flit 2).
                 SM CAS flit 2 (expected value = L operand).
Frame    10      SM read / rd_inc / rd_dec / raw_rd / alloc return
                 routing (frame[fref+1] = pre-formed response flit 1).
(spare)  11      reserved

See iram-and-function-calls.md for the full SM operation mapping table, CAS 3-flit encoding, and indexed address computation.

Instruction Decoder#

The instruction decoder translates the opcode field from the IRAM word into ALU control signals and output formatter control. It is implemented as a single EEPROM used as a lookup table — the opcode bits address the EEPROM, the data outputs are the control signals.

Decoder EEPROM#

Address inputs (active during stage 2, instruction fetch):
  opcode:5            from IRAM instruction word
  type:1              CM (0) vs SM (1) instruction
  mode:3              output/frame-access mode from IRAM instruction word

Data outputs (active during stage 4, execute + output):
  alu_func:3          selects ALU functional unit (add/sub/logic/shift/cmp/pass)
  alu_sub_mode:2      sub-function (e.g., AND vs OR vs XOR within logic unit)
  alu_src_sel:1       left input source: port_L data vs frame constant
  carry_in:1          for SUB, DEC (invert + carry), or INC (carry-in = 1)
  shift_dir:1         left vs right (for shift unit)
  shift_arith:1       arithmetic vs logical (for right shifts)
  cmp_signed:1        signed vs unsigned comparison mode
  output_enable:1     modes 0-3: emit output token(s)
  output_fanout:1     modes 2-3: emit second output token
  sink_write_en:1     modes 6-7: write ALU result to frame[fref]
  output_data_sel:1   result source: ALU output vs passthrough input (for SWITCH/PASS)
  flit2_source:2      SM flit 2 mux: ALU out (00) / R operand (01) / Frame (10)
  bool_out_en:1       whether this instruction produces a meaningful bool_out
  ---
  Total: ~18 output bits

The output mode encoding is now driven by the 3-bit mode field from the instruction word, replacing the old has_dest2 + ctx_mode combination. The mode field feeds directly into the EEPROM address alongside the opcode, so the EEPROM output signals reflect the full mode semantics in a single lookup.

One EEPROM chip (e.g., 28C64: 8K x 8, or 28C256: 32K x 8). With opcode (5 bits) + type (1 bit) + mode (3 bits) = 9 address bits, a 28C64 (13 address lines) leaves 4 spare address lines. These can serve as:

PE_id inputs (2-3 bits): the same EEPROM encodes per-PE customised decode tables. Allows heterogeneous PEs from identical circuit boards with different EEPROM contents.
bool_out input (1 bit): enables GATE suppression and SWITCH routing decisions to be resolved in the EEPROM lookup rather than downstream logic.

The EEPROM is purely combinational from the pipeline's perspective — the address is stable (latched from IRAM read), the output settles within EEPROM access time (~150ns for standard parts, well within a pipeline stage at 5-10 MHz).

ALU Hardware#

⚠ Tentative: The 8-bit vs 16-bit ALU datapath decision is not fully resolved. The emulator operates at 16-bit. The 8-bit design below is the current hardware sketch for minimising chip count; the 16-bit variant (see below) roughly doubles the datapath. Final decision depends on transistor budget during physical build.

8-bit Datapath (v0)#

                  ┌──────────────────────────────────────────┐
  data_L ───────►│ source mux (data_L vs frame constant)    │
                  └──────────────┬───────────────────────────┘
                                 │ A input
                                 V
                  ┌──────────────────────────────────────────┐
                  │              Adder/Subtractor            │
  data_R ──┬────►│  2x 74LS283 (4-bit CLA, cascaded)        │──► add_result
           │      │  + 2x 74LS86 (XOR for SUB: invert B)     │
           │      │  carry_in from decoder                   │
           │      └──────────────────────────────────────────┘
           │
           │      ┌──────────────────────────────────────────┐
           ├────►│              Logic Unit                  │
           │      │  2x 74LS32 (OR)                          │
  A input ─┤      │  (XOR reused from adder)                 │
           │      │  sub-mode mux selects AND/OR/XOR output  │
           │      └──────────────────────────────────────────┘
           │
           │      ┌──────────────────────────────────────────┐
           ├────►│           Barrel Shifter                 │
           │      │  3 stages of 2x 74LS157 (quad 2:1 mux)   │──► shift_result
           │      │  stage 0: shift by 1 if imm[0]           │
           │      │  stage 1: shift by 2 if imm[1]           │
           │      │  stage 2: shift by 4 if imm[2]           │
           │      │  shift_dir and shift_arith from decoder  │
           │      └──────────────────────────────────────────┘
           │
           │      ┌──────────────────────────────────────────┐
           └────►│            Comparator                    │
                  │  2x 74LS85 (4-bit magnitude, cascaded)   │──► cmp_result (0x00/0x01)
  A input ──────►│  sign-bit XOR for signed mode            │    + bool_out (1 bit)
                  └──────────────────────────────────────────┘


  ┌───────────────────────────────────────────────────────────────────┐
  │                      Result Mux                                   │
  │  2x 74LS157 (quad 2:1 mux, cascaded for 8-bit 4:1 select)         │
  │                                                                   │
  │  alu_func:3 ──► selects: add_result / logic_result /             │
  │                          shift_result / cmp_result / passthrough  │
  └──────────────────────────────────────────┬────────────────────────┘
                                             │
                                             V
                                        result:8
                                        bool_out:1

Chip Count (8-bit)#

Instruction decoder EEPROM                                    1
Adder:           2x 74LS283 (4-bit CLA)                       2
SUB inversion:   2x 74LS86  (XOR, shared with logic unit)     2
Logic AND:       2x 74LS08                                    2
Logic OR:        2x 74LS32                                    2
Barrel shifter:  6x 74LS157 (3 stages x 2 chips)              6
Comparator:      2x 74LS85  (4-bit magnitude, cascaded)       2
Result mux:      2x 74LS157 (or 74LS153 for 4:1)              2
Source mux:      2x 74LS157 (A input: data_L vs frame const)   2
                                                         ─────────
Total ALU:                                                ~21 chips

Without barrel shifter (shift-by-1 only): ~15 chips. The barrel shifter is the single most expensive functional unit but is worth it — bit manipulation without multi-bit shifts is very painful, and shift-by-1 chains through the pipeline are slow (each shift costs a full token round-trip).

16-bit Datapath (future)#

Roughly double the datapath chips: 4 adder, 4 XOR, 4 AND, 4 OR, 12 shifter muxes, 4 comparator, 4 result mux, 4 source mux. The decoder EEPROM stays the same. Estimate: ~38-42 chips. Buildable but substantially larger.

ALU Outputs#

The ALU produces these signals, consumed by the output formatter:

result:8          computed data value (or passthrough for SWITCH/PASS)
bool_out:1        boolean/comparison result:
                    - from comparator for EQ/LT/LTE/GT/GTE
                    - from comparator for BR*/SW* (inline comparison)
                    - from data_R bit 0 for GATE (boolean input)
                    - undefined for arithmetic/logic ops

bool_out for SWITCH and GATE comes from the boolean operand (port R by convention), not from the ALU computation. The decoder's output_data_sel signal controls whether result carries the ALU output or the passthrough data_L input (for SWITCH, the data being routed).

Output Formatter#

The output formatter takes the ALU result, frame slot data, and control signals from the decoder, and produces 0-2 output tokens serialised into the PE's output FIFO. Under the frame-based architecture, the formatter is significantly simpler than the original design: destination routing is a pre-formed flit 1 read from a frame slot, not assembled field-by-field from IRAM destination bits.

Inputs#

From ALU:
  result:8/16             computed or passthrough data value
  bool_out:1              boolean signal for SWITCH/GATE

From frame SRAM (read during stage 5):
  dest1:16                pre-formed flit 1 from frame[fref] or frame[fref+1]
  dest2:16                (fan-out modes only) from frame[fref+1] or frame[fref+2]

From left operand bypass latch (modes 4-5, CHANGE_TAG):
  left_operand:16         entire flit 1 comes from left operand data

From IRAM instruction word:
  mode:3                  3-bit output/frame-access mode

From pipeline latches:
  act_id:3                activation_id of executing token (for EXTRACT_TAG
                          and similar operations)

From decoder EEPROM:
  output_enable:1         emit output token(s) (modes 0-3)
  output_fanout:1         emit second output token (modes 2-3)
  sink_write_en:1         write ALU result to frame (modes 6-7)
  output_data_sel:1       result = ALU output vs passthrough input
  flit2_source:2          SM flit 2 source mux select
  bool_out_en:1           whether this instruction produces a meaningful bool_out

See iram-and-function-calls.md for the 16-bit instruction word layout, mode table, and frame slot formats.

Output Modes#

The 3-bit mode field from the instruction word drives output behaviour. All 8 encodings are useful; there are no wasted mode values.

mode  tag behaviour   frame access at fref       output behaviour
----  -------------   -------------------------  --------------------------
  0   INHERIT         read dest                  single output, no constant
  1   INHERIT         read const, read dest      single output with constant
  2   INHERIT         read dest1, read dest2     fan-out, no constant
  3   INHERIT         read const, dest1, dest2   fan-out with constant
  4   CHANGE_TAG      (none)                     dynamic routing, no constant
  5   CHANGE_TAG      read const                 dynamic routing with constant
  6   SINK            write result -> frame[fref] no output token
  7   SINK+CONST      read frame[fref],          local accumulate / RMW
                      write result -> frame[fref]

Bit-level decode (directly from mode[2:1:0]):

output_enable  = NOT mode[2]              modes 0-3: read dest, emit token
change_tag     = mode[2] AND NOT mode[1]  modes 4-5: routing from left operand
sink           = mode[2] AND mode[1]      modes 6-7: no output, write to frame
has_const      = mode[0]                  modes 1, 3, 5, 7: read constant
has_fanout     = mode[1] AND NOT mode[2]  modes 2-3: read two destinations

SUPPRESS: not a mode encoding but an opcode-driven override. FREE_FRAME opcode, or GATE with bool_out=false, forces output_enable=0 and sink_write_en=0. Both input tokens consumed silently, pipeline advances immediately.

INHERIT, single output (modes 0-1): one token to dest read from frame. One formatter cycle. The frame slot provides the complete flit 1 (pre-formed with PE, offset, act_id, port). The ALU result becomes flit 2.

INHERIT, fan-out (modes 2-3): two tokens, same data, different destinations. Two formatter cycles. Dest1 from frame[fref] (or frame[fref+1] if has_const), then dest2 from the next consecutive slot. Both carry the ALU result as flit 2.

CHANGE_TAG (modes 4-5): one token whose flit 1 is the left operand data (a 16-bit packed tag), bypassing normal frame dest reads. No frame destination read needed. The ALU result becomes flit 2. Used for return routing where the return address was carried as data through the function body.

SINK (modes 6-7): no output token. The ALU result is written back to frame[fref]. Mode 7 (SINK+CONST) reads frame[fref] as a constant input to the ALU first, then writes the result back — a read-modify- write cycle for frame-local accumulation.

SWITCH: branch and switch opcodes (BR*, SW*) override the normal mode semantics for destination selection. Two tokens, different roles. Two formatter cycles.

Cycle 0 -- data token (taken side):
  destination: bool_out ? dest1 : dest2 (from frame slots)
  flit 1:      pre-formed dest from frame
  flit 2:      ALU result (passthrough data_L for SWITCH)

Cycle 1 -- trigger token (not-taken side):
  destination: bool_out ? dest2 : dest1 (from frame slots)
  format:      ALWAYS monadic inline (prefix 011+10), regardless
               of the destination's original token format
  data:        none (1-flit token)

The formatter always emits an inline monadic trigger to the not-taken side. Whether that trigger causes a NOOP, a FREE_FRAME, or anything else depends on the instruction at the target offset in the target PE's IRAM. This is a compiler decision, not a hardware one.

GATE Behaviour#

GATE is simpler than SWITCH:

bool_out = true (gate passes):
  Same as modes 0-3 depending on instruction mode.
  Data token(s) emitted normally.

bool_out = false (gate suppresses):
  Output becomes SUPPRESS.
  No tokens emitted. Both inputs consumed.

The decoder can implement this by having GATE's output_enable be conditional on bool_out. In the EEPROM this is achieved by including bool_out as an address input, so the EEPROM output directly reflects the suppression decision.

Formatter State Machine#

The formatter is a small FSM (2-4 states) that serialises output tokens into the PE's output FIFO. Frame SRAM reads for destinations occur in stage 5 and drive the flit 1 output directly.

States:
  IDLE        -- waiting for ALU completion signal
  EMIT_FIRST  -- read dest1 from frame SRAM, push flit 1 + flit 2
  EMIT_SECOND -- read dest2 from frame SRAM, push flit 1 + flit 2
                 (fan-out modes only)
  SINK_WRITE  -- write ALU result to frame[fref]
                 (sink modes only)

Transitions:
  IDLE -> EMIT_FIRST           on ALU done, if output_enable
  IDLE -> SINK_WRITE           on ALU done, if sink_write_en
  IDLE -> IDLE                 on ALU done, if SUPPRESS
  EMIT_FIRST -> EMIT_SECOND    if has_fanout or SWITCH mode
  EMIT_FIRST -> IDLE           otherwise
  EMIT_SECOND -> IDLE          always
  SINK_WRITE -> IDLE           always

Each EMIT state takes 1-2 bus cycles depending on whether the output token is 1-flit (inline monadic) or 2-flit (all other formats). The formatter pushes flits to the output FIFO sequentially. The pipeline is stalled during EMIT and SINK_WRITE states.

EMIT_FIRST reads dest1 from frame SRAM (1 cycle). EMIT_SECOND reads dest2 from the next frame slot (1 additional cycle). For CHANGE_TAG modes, no frame read occurs — flit 1 is sourced from the left operand bypass latch, so EMIT_FIRST completes without an SRAM access.

Token Packing Mux#

Under the frame-based architecture, the token packing mux is dramatically simpler than the original design. Flit 1 no longer requires field-by-field assembly of PE, offset, act_id, port, and gen from separate sources. Instead:

Flit 1 source (2:1 mux, selected by mode[2]):

  mode[2] = 0 (INHERIT, modes 0-3):
    flit 1 = frame dest slot (pre-formed, read from SRAM)
    format: [prefix:2-3][port:0-1][PE:2][offset:8][act_id:3]
    The frame slot IS the output flit. No repacking needed.

  mode[2] = 1, mode[1] = 0 (CHANGE_TAG, modes 4-5):
    flit 1 = left operand data (16-bit packed tag from bypass latch)

Flit 2 (always):
    flit 2 = ALU result (16-bit data value)

The packing mux is physically a single layer of 2:1 muxes selecting between frame dest and left operand, controlled by mode[2]. This replaces the multi-chip format-dependent field-selection mux from the original design. Estimate: 2-4 TTL chips (2-4x 74LS157 for 16-bit 2:1 mux).

For SWITCH cycle 1 (not-taken trigger), the mux is forced to monadic inline format (prefix 011+10). The formatter extracts the PE and offset fields from the frame dest slot and packs them into the inline format. This override is a small amount of combinational logic during EMIT_SECOND when in SWITCH mode.

For SM operations, the flit 2 source mux selects between ALU output, R operand, and frame slot data (see SM Instruction Dispatch above). This adds one additional 3:1 mux layer on the flit 2 path (~4-6 chips for 16-bit width).

Source of Output Routing#

Output routing is controlled by the 3-bit mode field. There are two routing modes (the old CTX_OVRD mode is eliminated):

INHERIT (modes 0-3): routing comes from the frame destination slot. The dest slot contains a pre-formed flit 1 with PE, offset, act_id, and port all baked in at frame setup time. The compiler pre-computes the complete destination routing and loads it into the frame during activation setup. Cross-activation routing (the old CTX_OVRD use case) is handled by storing the appropriate act_id in the frame constant at setup time — no separate mode needed.
CHANGE_TAG (modes 4-5): routing comes from the left operand data, which carries a 16-bit packed tag (a complete flit 1 value). Used for dynamic return routing where the return address was carried as data through the function body. The ALU result becomes flit 2.

Pipeline latches still carry act_id (3 bits) through the pipeline for use in EXTRACT_TAG and similar operations that need to reference the executing activation's identity.

Bus Framing and Non-Interleaving#

The output FIFO guarantees atomic packet delivery on the bus. A multi-flit token is written to the output FIFO as a complete unit. The FIFO drains flits to the bus sequentially, and the bus arbiter (or handshake protocol) holds the bus for the full packet duration.

A "more flits follow" signal (1 wire) accompanies the data bus. The emitter asserts it on every flit except the last flit of a packet. Routing nodes and receivers watch this signal to know when a packet is complete. They do not need to decode the token type or inspect flit 1's inline bit — framing is entirely at the bus level.

Bus signals (per link):
  data:16          flit data
  valid:1          flit is present (handshake)
  ready:1          receiver can accept (backpressure)
  more:1           more flits follow in this packet (framing)

Routing nodes forward flits transparently. They latch flit 1 for routing decisions, then forward subsequent flits (while more is asserted) to the same destination. No per-flit routing lookup after the first flit.

This means the network is completely agnostic to token format. It does not know or care whether a packet is 1, 2, 3, or 4 flits. It follows the more signal. All format intelligence is at the endpoints (emitting PE's formatter and receiving PE's deserialiser).

Local Token Routing (v0 vs Future)#

v0: all tokens go through the external bus, even when source PE == destination PE. The output formatter forms the token, pushes it to the output FIFO, it goes out on the bus, the bus routes it back to the same PE's input FIFO. Full round-trip.

This is deliberately simple. It avoids:

A second write port on the input FIFO (arbitration between bus input and local loopback)
Priority logic (should local tokens jump the queue ahead of external?)
Bypass detection (comparing output dest_PE against own PE_id, which is trivial logic but adds a path)

The cost is latency: a local token pays the full bus round-trip instead of staying internal. For v0 with 1-2 PEs, the bus is mostly idle anyway.

Future optimisation: local bypass. The formatter checks dest_PE == own PE_id. If match, the token is written directly to the input FIFO (or a dedicated local loopback FIFO) without touching the external bus. This is the monadic-chain optimisation from the Papadopoulos & Traub paper — if A feeds B feeds C and all are on the same PE, keeping the token internal saves 3 bus round-trips. Significant latency win for sequential chains.

See bus-interconnect-design.md for the physical implementation of the loopback bypass design, including FIFO arbitration, priority logic, and chip-level detail.

Hardware Summary (v0, 8-bit datapath)#

Component                                           Chips
─────────────────────────────────────────────────────────
Instruction decoder EEPROM                             1
Source mux (data_L vs frame const)                     2
Adder (2x 74LS283)                                     2
SUB/XOR (2x 74LS86)                                    2
AND (2x 74LS08)                                        2
OR (2x 74LS32)                                         2
Barrel shifter (6x 74LS157, 3 stages)                  6
Comparator (2x 74LS85)                                 2
Result mux (2x 74LS157)                                2
Output data mux (result vs passthrough, 2x 74LS157)    2
Flit 1 source mux (frame dest vs left operand)        2-4
Flit 2 source mux (ALU / R oper / frame, for SM ops)  4-6
Formatter FSM (state machine + SRAM addr control)      2
─────────────────────────────────────────────────────────
Total ALU + output formatter:                      ~31-35 chips

The output formatter's flit 1 path is simpler than the original design (a single 2:1 mux layer replacing the multi-format field-assembly mux), but the flit 2 path adds a 3:1 source mux for SM operations. The net chip count is comparable.

This does not include the output FIFO (shared with bus interface), the matching store, frame SRAM, or IRAM — those are other parts of the PE. See pe-pipelining-and-multiplexing.md and pe-design.md for the full per-PE chip count estimate.

Without barrel shifter: ~25-29 chips.

Future Extensions (not in v0)#

Per-Activation Predicate Store#

A small RAM (e.g., 4-entry x 4-bit, one entry per concurrent frame) indexed by frame_id. Comparison instructions optionally write their bool_out to a named predicate channel (2-bit channel select from the instruction word). SWITCH and GATE instructions optionally read a predicate channel as their control input instead of consuming a boolean data token.

This enables three-logical-input operations (data_L, data_R, control) without triadic matching — the control arrives via the predicate store rather than as a third token. Eliminates one token hop for the common compare-then-branch pattern.

The predicate store is scoped per-activation (indexed by frame_id), so concurrent activations on the same PE do not interfere. Four predicate channels per frame handle up to four levels of nested conditional control flow; deeper nesting requires the compiler to decompose via two-stage token routing.

Hardware cost: 1 chip (small RAM or 74LS670) + 1 chip (read mux) + minor decode.

Shared Predicate Register (Sequential Mode)#

A single shared predicate register (not per-activation) for use in a "pseudo-sequential" execution mode where the compiler guarantees only one activation is using the predicate at a time. Useful for maximising locality on hot loops. Simpler than the per-activation store but dangerous if the single-activation invariant is violated. Purely an optimisation for specific code patterns, not a general mechanism.

Triadic Matching#

Matching store entries gain a second presence bit and second data slot. Three tokens must arrive before the instruction fires. Adds complexity to the matching logic (3-way port arbitration, wider data read-out) and roughly doubles match data storage width for triadic-capable entries. Deferred unless profiling shows the two-stage decomposition overhead is a real bottleneck.

Local Bypass#

Output formatter detects dest_PE == own PE and routes the token directly to the input FIFO without external bus round-trip. Significant latency reduction for PE-local computation chains. See Local Token Routing above and bus-interconnect-design.md for physical implementation details.

Hardware Multiply#

Dedicated multiply unit or multiply-capable PE. Shift-and-add is feasible as a software subgraph, but a hardware 8x8 -> 16-bit multiplier would dramatically accelerate DSP-style workloads. Could be implemented as a specialised PE in the network rather than adding multiply logic to every PE.