PE Pipelining and Metadata/SC Register Multiplexing#

Design notes on pipelining the PE's token processing path under the revised frame-based architecture, the frame SRAM contention problem between pipeline stages, and the 74LS670-based subsystem that serves as both the act_id resolution / presence metadata store (dataflow mode) and an SC block register file (sequential mode).

See pe-design.md for the frame-based PE architecture and matching store, architecture-overview.md for the token format, and iram-and-function-calls.md for instruction encoding. The approach comparison is in section 3 of this document. This document focuses on pipeline timing, stall analysis, and the 670 subsystem design.

1. Pipeline Stages (Reversed Order)#

The PE pipeline has been reversed relative to the original design: IFETCH now precedes MATCH. The instruction word drives match behaviour, frame access patterns, and output routing. The token's activation_id drives associative lookup in parallel with the IRAM read, hiding resolution latency behind SRAM access time.

Stage 1: INPUT        Deserialise flits from bus, classify token type
Stage 2: IFETCH       IRAM read + act_id resolution (parallel, via 670)
Stage 3: MATCH/FRAME  Match check + constant read (variable cycles)
Stage 4: EXECUTE      ALU operation (no SRAM access)
Stage 5: OUTPUT       Destination read from frame + token emission

Why IFETCH before MATCH. In the original design, the pipeline was MATCH -> IFETCH -> EXECUTE -> OUTPUT. The match stage ran first, using token prefix bits to decide whether the operand was first or second. The instruction fetch happened only after matching succeeded.

With the frame-based architecture, the instruction word determines how matching works: whether the instruction is dyadic or monadic, which frame slots to read for operands and constants, whether to write back to the frame (sink modes), and how many destinations to read at output. Fetching the instruction first gives the pipeline controller all the information it needs to sequence stage 3's SRAM accesses efficiently.

The token's dyadic/monadic prefix (retained from the original format) enables parallel work: when the prefix indicates "dyadic," stage 2 starts act_id -> frame_id resolution via the 670s simultaneously with the IRAM read. By the time stage 3 begins, both the instruction word and the frame_id / presence / port metadata are available, and the only remaining SRAM work is reading or writing actual operand data and constants.

Unpipelined throughput: 3-7 cycles per token depending on instruction mode (see section 3 for detailed cycle counts). This is the baseline against which pipeline overlap improvements are measured.

2. The Frame SRAM Contention Problem#

The Shifted Bottleneck#

The original design's critical-path problem was a read-modify-write on the matching store SRAM: read the cell, check metadata, then either store or retrieve an operand, all within a single cycle. With 200 ns 2114 SRAM, only one operation (read OR write) fit per clock.

The frame-based redesign eliminates this problem entirely. With Approach C (670 lookup), act_id -> frame_id resolution is combinational (~35 ns via 670 read port), and the presence/port check is also combinational (~35 ns from a second set of 670s). There is no read-modify-write on SRAM for metadata.

The new bottleneck is frame SRAM contention between stage 3 and stage 5. Both stages access the same single-ported SRAM chip pair:

Stage 3 reads/writes operand data (dyadic match) and reads constants (modes with has_const=1).
Stage 5 reads destinations (modes 0-3), or writes results back to the frame (sink modes 6-7).

When two pipelined tokens have stage 3 and stage 5 active in the same cycle, the SRAM can serve only one. The other stalls.

The Pipeline Hazard#

The classic RAW hazard still exists but takes a different form. Two consecutive tokens targeting the same frame slot (e.g., two mode 7 read-modify-write operations on the same accumulator slot) create a data dependency: the second token's stage 3 read must see the first token's stage 5 write.

Detection requires comparing (act_id, fref) of the incoming token against in-flight pipeline latches at stages 3-5. Hardware cost: ~2 chips (9-bit comparator + AND gate). Alternatively, the assembler can guarantee this never happens by never emitting consecutive mode 7 tokens to the same slot on the same PE.

As with the original design, this hazard is statistically uncommon in dataflow execution. Two operands arriving back-to-back at the exact same frame slot requires coincidental timing. The bypass path is cheap insurance that fires infrequently.

3. Pipeline Solutions and Cycle Counts#

SRAM Contention Model#

The frame SRAM chip is single-ported (one access per clock cycle at 5 MHz with 55 ns SRAM). The primary stall source is contention between stage 3 (frame reads for operand data and constants) and stage 5 (frame reads for destinations, or frame writes for sink modes).

Contention arises only when:

Token A is at stage 5, needing a frame SRAM read (dest) or write (sink), AND
Token B is at stage 3, needing a frame SRAM read (match operand, constant, or tag word).

Contention does NOT arise when:

Token A's stage 5 is mode 4/5 (change_tag -- no SRAM access).
Token B's stage 3 is zero-cycle (monadic no-const, or match data in register file with no const).
Token A was a dyadic miss (terminated at stage 3, never reaches stage 5).

Cycle Counts by Instruction Type#

Approach C (74LS670 lookup, recommended v0):

                                stg1  stg2  stg3  stg4  stg5  total
monadic mode 4 (no frame)      1     1     0     1     0     3
monadic mode 0 (dest only)     1     1     0     1     1     4
monadic mode 6 (sink)          1     1     0     1     1     4
monadic mode 1 (const+dest)    1     1     1     1     1     5
monadic mode 7 (RMW)           1     1     1     1     1     5
dyadic miss                    1     1     1     --    --    3
dyadic hit, mode 0             1     1     1     1     1     5
dyadic hit, mode 1             1     1     2     1     1     6
dyadic hit, mode 3 (fan+const) 1     1     2     1     2     7

Stage 3 breakdown for Approach C:

Dyadic hit: 1 SRAM cycle to read stored operand (frame_id and presence already known from 670). +1 cycle for constant if has_const=1.
Dyadic miss: 1 SRAM cycle to write operand data. 670 write port sets presence bit combinationally in parallel.
Monadic: 0 SRAM cycles (no match), +1 for constant if has_const=1.

Approach B (register-file match pool):

                                stg1  stg2  stg3  stg4  stg5  total
monadic mode 4 (no frame)      1     1     0     1     0     3
monadic mode 0 (dest only)     1     1     0     1     1     4
monadic mode 6 (sink)          1     1     0     1     1     4
monadic mode 1 (const+dest)    1     1     1     1     1     5
monadic mode 7 (RMW)           1     1     1     1     1     5
dyadic miss                    1     1     1     --    --    3
dyadic hit, mode 0             1     1     1     1     1     5
dyadic hit, mode 1             1     1     2     1     1     6
dyadic hit, mode 3 (fan+const) 1     1     2     1     2     7

Approaches B and C produce identical single-token cycle counts. The difference emerges under pipelining: Approach B's match data never touches the frame SRAM (operands stored in a dedicated register file), so stage 3's only SRAM access is the constant read. This reduces stage 3 vs stage 5 SRAM contention.

Approach A (set-associative tags in SRAM, minimal chips):

                                stg1  stg2  stg3  stg4  stg5  total
monadic mode 4 (no frame)      1     1     0     1     0     3
monadic mode 0 (dest only)     1     1     0     1     1     4
monadic mode 6 (sink)          1     1     0     1     1     4
monadic mode 1 (const+dest)    1     1     1     1     1     5
monadic mode 7 (RMW)           1     1     1     1     1     5
dyadic miss                    1     1     2     --    --    4
dyadic hit, mode 0             1     1     2     1     1     6
dyadic hit, mode 1             1     1     3     1     1     7
dyadic hit, mode 3 (fan+const) 1     1     3     1     2     8

Approach A adds 1 extra SRAM cycle per dyadic operation (tag word read + associative compare) because act_id resolution is not combinational.

Pipeline Overlap Analysis#

With single-port frame SRAM at 5 MHz, the pipeline controller must arbitrate between stage 3 and stage 5. When both need SRAM in the same cycle, stage 3 stalls.

Approach B, two consecutive dyadic-hit mode 1 tokens:

cycle 0:  A.stg1
cycle 1:  A.stg2 (IRAM)
cycle 2:  A.stg3 match (reg file)  -- frame SRAM FREE
cycle 3:  A.stg3 const (SRAM)
cycle 4:  A.stg4 (ALU)             -- frame SRAM FREE
cycle 5:  A.stg5 dest (SRAM)       B.stg3 match (reg file) -- NO CONFLICT
cycle 6:  (A done)                  B.stg3 const (SRAM)
cycle 7:                            B.stg4 (ALU)
cycle 8:                            B.stg5 dest (SRAM)      -- NO CONFLICT

Token spacing: 4 cycles. Approach A under the same conditions: ~6-7 cycles due to additional SRAM contention in stage 3.

Throughput Summary (per PE, at 5 MHz, single-port frame SRAM)#

Instruction mix profile	Approach A	Approach C	Approach B
Monadic-heavy (mode 0/4/6)	~1.25 MIPS	~1.67 MIPS	~1.67 MIPS
Mixed (40% dyadic mode 1, 30% monadic, 30% misc)	~833 KIPS	~1.25 MIPS	~1.25 MIPS
Dyadic-heavy with constants	~714 KIPS	~1.00 MIPS	~1.00 MIPS
Worst case (mode 3, const+fanout)	~625 KIPS	~714 KIPS	~714 KIPS

4-PE system: multiply by 4. Realistic mixed workload: ~3.3-5.0 MIPS (A), ~5.0-6.7 MIPS (C), or ~5.0-6.7 MIPS (B). For reference: the original Amamiya DFM prototype (TTL, 1982) achieved 1.8 MIPS per PE. EM-4 prototype (VLSI gate array, 1990) achieved 12.5 MIPS per PE. This design sits between the two, closer to the DFM, which is historically appropriate for a discrete TTL build.

4. SRAM Configuration#

Unified SRAM Chip Pair#

The PE uses a single 32Kx8 chip pair (2 chips for 16-bit data width) for both IRAM and frame storage, with address partitioning via a single decode bit. The recommended part is the AS6C62256 (55 ns, 32Kx8, DIP-28) or equivalent. 55 ns access time fits comfortably within a 200 ns clock period at 5 MHz, with margin for address setup and data hold.

The 2114 (1Kx4, 200 ns) from the original design is no longer used. The unified SRAM approach eliminates the chip proliferation problem: one chip pair per PE replaces the 4-6 SRAM chips previously needed for separate matching store and IRAM.

Address Map#

v0 address space (simple decode, no 610):

  IRAM region:   [0][offset:8]              instruction templates
                  offset from token
                  capacity: 256 instructions (512 bytes)

  Frame region:  [1][frame_id:2][slot:6]    per-activation storage
                  frame_id from tag store resolution
                  capacity: 4 frames x 64 slots = 256 entries (512 bytes)

Future address space (with 610 bank switching):

  IRAM region:   [0][bank:4][offset:8]      bank-switched templates
                  bank from 610 mapper
                  capacity: 16 banks x 256 instructions = 4096 entries

  Frame region:  [1][frame_id:2][slot:6]    (unchanged)

Total v0 SRAM utilisation: under 1.5 KB used out of a 32Kx8 chip pair (64 KB). Ample room for future expansion without changing chips. SRAM address lines are pre-routed to a 74LS610 socket with a jumper wire in place of the chip; when bank switching is needed for programs exceeding 256 instructions per PE, the 610 drops in with no board changes.

5. The 670 Subsystem: Act ID Lookup, Match Metadata, and SC Register File#

Role in the Frame-Based Architecture#

The 74LS670s are no longer a metadata cache for the matching store (as in the original design). Instead, they serve two critical functions:

act_id -> frame_id lookup table. Indexed by the token's 3-bit activation_id, outputs {valid:1, frame_id:2, spare:1} in ~35 ns (combinational). This replaces what would otherwise be an SRAM cycle for associative tag comparison.
Presence and port metadata store. Indexed by frame_id, stores presence and port bits for all 8 matchable offsets across all 4 frames. Combinational read (~35 ns after frame_id settles, ~70 ns total from act_id presentation).

Both functions complete within stage 2, in parallel with the IRAM read. By the time stage 3 begins, the PE knows frame_id, presence, and port -- the only remaining SRAM access is the actual operand data.

Hardware Configuration#

act_id -> frame_id (2x 74LS670):

Addressed by act_id[1:0] with act_id[2] selecting between chips. Each chip holds 4 words x 4 bits. Output: {valid:1, frame_id:2, spare:1}.

ALLOC: write {valid=1, frame_id} at address act_id (670 write port)
FREE:  write {valid=0, ...} at address act_id
LOOKUP: read port, address = act_id -> {valid, frame_id} in ~35 ns

The 670's independent read and write ports allow ALLOC to proceed while the pipeline reads -- zero conflict.

Presence + port metadata (4x 74LS670):

Each 670 word (4 bits) holds presence+port for 2 offsets: {presence_N:1, port_N:1, presence_N+1:1, port_N+1:1}. Read address = [frame_id:2]. Output bits selected by offset[2:0] via bit-select mux.

670 chip 0 (offsets 0-1): word[frame_id] = {pres0, port0, pres1, port1}
670 chip 1 (offsets 2-3): word[frame_id] = {pres2, port2, pres3, port3}
670 chip 2 (offsets 4-5): word[frame_id] = {pres4, port4, pres5, port5}
670 chip 3 (offsets 6-7): word[frame_id] = {pres6, port6, pres7, port7}

offset[2:1] selects chip, offset[0] selects which pair of bits within the 4-bit output (a 2:1 mux -- one gate).

The 670's simultaneous read/write is critical: during stage 3, when a first operand stores and sets presence, the write port updates the presence 670 while the read port remains available for the next pipeline stage's lookup. No read-modify-write sequencing needed.

Bit select mux (1-2 chips):

Offset-based selection of the relevant presence and port bits from the 670 outputs.

Chip Budget#

Component	Chips	Function
act_id -> frame_id lookup	2	74LS670, indexed by act_id
Presence + port metadata	4	74LS670, indexed by frame_id
Bit select mux	1-2	offset-based selection
Total match metadata	~8

SC Register File (Mode-Switched)#

During dataflow mode, the PE uses act_id resolution and presence metadata constantly but the SC register file is idle (no SC block executing). During SC mode, the PE uses the register file constantly but act_id lookup and presence tracking are idle (SC block has exclusive PE access; no tokens enter matching).

Some of the 670s can be repurposed for register storage during SC mode. The exact mapping depends on the SC block design:

The 4 presence+port 670s (indexed by frame_id in dataflow mode) can be re-addressed by instruction register fields during SC mode, providing 4 chips x 4 words x 4 bits = 64 bits of register storage. Combined across chips, this gives 4 registers x 16 bits (4 bits per chip, 4 chips for width).
With additional mux logic, all 6 shared 670s (excluding the act_id lookup pair, which may need to remain active for frame lifecycle management) could provide 6 registers x 16 bits during SC mode.

The act_id lookup 670s may need to remain in their dataflow role even during SC mode if the PE must handle frame control tokens (ALLOC/FREE) arriving during SC block execution. Whether to share them depends on the SC block entry/exit protocol.

The Predicate Slice#

One of the 670s can be permanently dedicated as a predicate register rather than participating in the mode-switched pool:

4 entries x 4 bits = 16 predicate bits, always available
Useful for: conditional token routing (SWITCH), loop termination flags, SC block branch conditions, I-structure status flags
Does not reduce the metadata capacity significantly: the remaining 3 presence+port 670s still cover 6 of the 8 matchable offsets; the 2 uncovered offsets can fall back to SRAM-based presence or simply constrain the assembler to 6 dyadic offsets per frame

The predicate register is always readable and writable regardless of mode, since it's a dedicated chip with its own address/enable lines. Instructions can test or set predicate bits without going through the matching store or the ALU result path.

Mode Switching#

When transitioning from dataflow mode to SC mode:

Save metadata from the shared 670s to spill storage.
Load initial SC register values (matched operand pair that triggered the SC block) into the 670s.
Switch address mux: 670 address lines now driven by instruction register fields instead of frame_id / act_id.
Switch IRAM to counter mode: sequential fetch via incrementing counter rather than token-directed offset.

When transitioning back:

Emit final SC result as token (last instruction with OUT=1).
Restore metadata from spill storage to the 670s.
Switch address mux back to frame_id / act_id addressing.
Resume token processing from input FIFO.

Spill Storage Options#

Metadata from the shared 670s (~64-96 bits depending on how many are shared) needs temporary storage during SC block execution.

Option A: Shift registers. 2x 74LS165 (parallel-in, serial-out) for save + 2x 74LS595 (serial-in, parallel-out) for restore. Total: 4 chips. Save/restore takes ~12 clock cycles each.

Option B: Dedicated spill 670. One additional 74LS670 (4x4 bits) holds 16 bits per save cycle; need ~4-6 write cycles to save all shared chips' contents. Total: 1 chip, ~4-6 cycles per save/restore.

Option C: Spill to frame SRAM. During SC mode, the frame SRAM has bandwidth available (no match operand reads). Write the 670 metadata contents into a reserved region of the frame SRAM address space. No extra chips needed. ~4-6 SRAM write cycles to save, ~4-6 to restore. The SRAM is single-ported but there's no contention because the pipeline is paused during mode switch.

Recommended: Option C. Zero additional chips. The save/restore overhead of ~4-6 cycles per transition is negligible compared to the SC block's execution savings (EM-4 data: 23 clocks pure dataflow vs 9 clocks SC for Fibonacci, so even with ~10 cycles of mode switch overhead, you break even at ~5-7 SC instructions).

6. Pipeline Timing by Era#

Key Insight#

With the 670-based matching subsystem (Approach C), act_id resolution and presence/port checking are combinational (~35-70 ns) regardless of era. These never become the timing bottleneck.

The era-dependent part is SRAM access time for frame reads and writes. This determines how many SRAM operations fit per clock cycle and thus how much stage 3 vs stage 5 contention exists.

1979-1983 (5 MHz, 55 ns SRAM)#

670 metadata: combinational (~35-70 ns), well within 200 ns cycle
Frame SRAM: one access per 200 ns cycle (55 ns access + setup/hold margin)
Bottleneck: frame SRAM single-port, stage 3 vs stage 5 contention
SC block throughput: ~1 instruction per clock (670 dual-port)
Overall token throughput: ~1 token per 3-5 clocks (pipelined, mode-dependent)

1984-1990 (5-10 MHz, dual-port SRAM)#

670 metadata: combinational (unchanged)
Frame SRAM: dual-port (IDT7132 or similar), port A for stage 3, port B for stage 5
Bottleneck: eliminated -- both stages access SRAM simultaneously
SC block throughput: ~1 instruction per clock
Overall token throughput: approaches 1 token per 3 clocks for most modes

Dual-port SRAM eliminates the primary stall source. The pipeline becomes instruction-latency-limited rather than SRAM-contention-limited.

Modern Parts (5 MHz clock, 15 ns SRAM)#

670 metadata: combinational (unchanged)
Frame SRAM: 15 ns access, ~13 accesses fit in 200 ns cycle
Practical: 2-3 sub-cycle accesses via time-division multiplexing
Bottleneck: none -- frame SRAM has excess bandwidth
Token throughput: 1 token per 3 clocks (pipeline-stage-limited, not SRAM-limited)

With 15 ns AS7C256B-15PIN (DIP, currently available at ~$3), two sub-cycle accesses fit within a 200 ns clock period. This achieves TDM-like parallelism without additional MUX logic, effectively giving the pipeline a dual-port view of a single-port chip.

Integrated (on-chip SRAM, sub-ns access)#

670 equivalent: on-chip multi-ported register file, ~200 transistors
Frame SRAM: on-chip, sub-cycle access trivially
Token throughput: 1 per 3 clocks, potentially faster with deeper pipelining

7. Interaction with PE-to-PE Pipelining#

When multiple PEs are chained for software-pipelined loops (see architecture overview), the per-PE pipeline throughput determines the overall chain throughput.

With the pipelined design (1 token per 3-5 clocks depending on instruction mix and era), the inter-PE hop cost becomes the critical path for chained execution:

Interconnect	Hop latency	Viable?
Shared bus (discrete build)	5-8 cycles	Marginal -- chain overhead dominates
Dedicated FIFO between adjacent PEs	2-3 cycles	Worthwhile for tight loops
On-chip wide parallel link (integrated)	1-2 cycles	Competitive with intra-PE SC block

For the discrete v0 build, dedicated inter-PE FIFOs (bypassing the shared bus) would enable PE chaining at reasonable cost. This is a low-chip-count addition (~2-4 chips per PE pair) that unlocks software-pipelined loop execution.

Loopback bypass. When a PE emits a token destined for itself (common in iterative computations), the token can be looped back internally without traversing the bus at all. See bus-interconnect-design.md for the loopback bypass design, which eliminates the bus hop latency entirely for self-targeted tokens.

8. The Execution Mode Spectrum#

The pipelined PE with frame-based storage, SC blocks, and predicate register supports a spectrum of execution modes, selectable by the compiler per-region:

Mode	Pipeline behaviour	Throughput	When to use
Pure dataflow	Token -> ifetch -> match/frame -> exec -> output	1 token / 3-7 clocks (mode-dependent)	Parallel regions, independent ops
SC block (register)	Sequential IRAM fetch, 670 register file	~1 instr / clock	Short sequential regions
SC block + predicate	As above, with conditional skip/branch via predicate bits	~1 instr / clock	Conditional sequential regions
PE chain (software pipeline)	Tokens flow PE0->PE1->PE2, each PE handles one stage	1 iteration / PE-pipeline-depth clocks	Loop bodies across PEs
SM-mediated sequential	Tokens to/from SM for memory-intensive work	SM-bandwidth-limited	Array/structure traversal

The compiler partitions the program graph and selects the best mode for each region. This spectrum is arguably more expressive than what a modern OoO core offers (which has exactly one mode: "pretend to be sequential, discover parallelism at runtime").

9. Open Items#

Approach selection for v0. Approach C (670 lookup) is recommended as the starting point: combinational metadata at ~8 chips. Approach B (register-file match pool) eliminates the last SRAM cycle from matching at the cost of ~16-18 chips. Approach A (SRAM tags) is the fallback if 670 supply is a problem. The choice depends on whether chip count or pipeline throughput is the binding constraint for the initial build. See section 3 above for the full approach comparison and cycle counts.
Frame SRAM contention under realistic workloads. The pipeline stall analysis in section 3 uses worst-case consecutive tokens. Simulate representative dataflow programs in the behavioural emulator to measure actual stage 3 vs stage 5 contention rates and determine whether dual-port SRAM or faster SRAM is justified for v0.
SC block register capacity. With 4-6 registers available from repurposed 670s (depending on how many are shared), what is the longest SC block the compiler can generate before register pressure forces a spill? Evaluate empirically on target workloads.
Predicate register encoding. Document specific instruction encodings for predicate test/set/clear, and how SWITCH instructions interact with predicate bits. The predicate register may subsume some of the cancel-bit functionality planned for token format.
Mode switch latency measurement. Build a cycle-accurate model of the save-to-SRAM / restore-from-SRAM path and determine exact overhead. Target: <=10 cycles per transition.
Assembler stall analysis. The assembler can statically detect instruction pairs whose output tokens may cause frame SRAM contention on the same PE. For hot loops, the assembler can insert mode 4 NOP tokens (zero frame access) as pipeline padding. Validate static stall estimates against emulator simulation, since runtime arrival timing depends on network latency and SM response times.
8-offset matchable constraint validation. The 670-based presence metadata limits dyadic instructions to offsets 0-7 per frame. Evaluate whether this is sufficient for compiled programs. If tight, the hybrid upgrade path (offset[3]=0 checks 670s, offset[3]=1 falls back to SRAM tags) adds ~4-6 chips of SRAM tag logic for offsets 8-15+.