design-notes/pe-pipelining-and-multiplexing.md at main

nonbinary.computer / or1-design
fork atom
OR-1 dataflow CPU sketch
fork atom
or1-design / design-notes / pe-pipelining-and-multiplexing.md
at main 606 lines 26 kB view raw view rendered
wrap content
Orual feat: rewrite ProcessingElement with frame-based matching, output routing, and unified instruction set 10d ago
65613978
  1# PE Pipelining and Metadata/SC Register Multiplexing
  2
  3Design notes on pipelining the PE's token processing path under the
  4revised frame-based architecture, the frame SRAM contention problem
  5between pipeline stages, and the 74LS670-based subsystem that serves
  6as both the act_id resolution / presence metadata store (dataflow
  7mode) and an SC block register file (sequential mode).
  8
  9See `pe-design.md` for the frame-based PE architecture and matching
 10store, `architecture-overview.md` for the token format, and
 11`iram-and-function-calls.md` for instruction encoding. The approach
 12comparison is in section 3 of this document. This document focuses on
 13pipeline timing, stall analysis, and the 670 subsystem design.
 14
 15---
 16
 17## 1. Pipeline Stages (Reversed Order)
 18
 19The PE pipeline has been reversed relative to the original design:
 20IFETCH now precedes MATCH. The instruction word drives match
 21behaviour, frame access patterns, and output routing. The token's
 22`activation_id` drives associative lookup in parallel with the IRAM
 23read, hiding resolution latency behind SRAM access time.
 24
 25```
 26Stage 1: INPUT        Deserialise flits from bus, classify token type
 27Stage 2: IFETCH       IRAM read + act_id resolution (parallel, via 670)
 28Stage 3: MATCH/FRAME  Match check + constant read (variable cycles)
 29Stage 4: EXECUTE      ALU operation (no SRAM access)
 30Stage 5: OUTPUT       Destination read from frame + token emission
 31```
 32
 33**Why IFETCH before MATCH.** In the original design, the pipeline was
 34MATCH -> IFETCH -> EXECUTE -> OUTPUT. The match stage ran first,
 35using token prefix bits to decide whether the operand was first or
 36second. The instruction fetch happened only after matching succeeded.
 37
 38With the frame-based architecture, the instruction word determines
 39*how* matching works: whether the instruction is dyadic or monadic,
 40which frame slots to read for operands and constants, whether to
 41write back to the frame (sink modes), and how many destinations to
 42read at output. Fetching the instruction first gives the pipeline
 43controller all the information it needs to sequence stage 3's SRAM
 44accesses efficiently.
 45
 46The token's dyadic/monadic prefix (retained from the original format)
 47enables parallel work: when the prefix indicates "dyadic," stage 2
 48starts act_id -> frame_id resolution via the 670s simultaneously with
 49the IRAM read. By the time stage 3 begins, both the instruction word
 50and the frame_id / presence / port metadata are available, and the
 51only remaining SRAM work is reading or writing actual operand data
 52and constants.
 53
 54**Unpipelined throughput:** 3-7 cycles per token depending on
 55instruction mode (see section 3 for detailed cycle counts). This is
 56the baseline against which pipeline overlap improvements are measured.
 57
 58---
 59
 60## 2. The Frame SRAM Contention Problem
 61
 62### The Shifted Bottleneck
 63
 64The original design's critical-path problem was a read-modify-write
 65on the matching store SRAM: read the cell, check metadata, then
 66either store or retrieve an operand, all within a single cycle. With
 67200 ns 2114 SRAM, only one operation (read OR write) fit per clock.
 68
 69The frame-based redesign eliminates this problem entirely. With
 70Approach C (670 lookup), act_id -> frame_id resolution is
 71combinational (~35 ns via 670 read port), and the presence/port
 72check is also combinational (~35 ns from a second set of 670s).
 73There is no read-modify-write on SRAM for metadata.
 74
 75The new bottleneck is **frame SRAM contention between stage 3 and
 76stage 5**. Both stages access the same single-ported SRAM chip pair:
 77
 78- **Stage 3** reads/writes operand data (dyadic match) and reads
 79  constants (modes with has_const=1).
 80- **Stage 5** reads destinations (modes 0-3), or writes results back
 81  to the frame (sink modes 6-7).
 82
 83When two pipelined tokens have stage 3 and stage 5 active in the
 84same cycle, the SRAM can serve only one. The other stalls.
 85
 86### The Pipeline Hazard
 87
 88The classic RAW hazard still exists but takes a different form. Two
 89consecutive tokens targeting the same frame slot (e.g., two mode 7
 90read-modify-write operations on the same accumulator slot) create a
 91data dependency: the second token's stage 3 read must see the first
 92token's stage 5 write.
 93
 94Detection requires comparing (act_id, fref) of the incoming token
 95against in-flight pipeline latches at stages 3-5. Hardware cost: ~2
 96chips (9-bit comparator + AND gate). Alternatively, the assembler
 97can guarantee this never happens by never emitting consecutive mode 7
 98tokens to the same slot on the same PE.
 99
100As with the original design, this hazard is **statistically
101uncommon** in dataflow execution. Two operands arriving back-to-back
102at the exact same frame slot requires coincidental timing. The
103bypass path is cheap insurance that fires infrequently.
104
105---
106
107## 3. Pipeline Solutions and Cycle Counts
108
109### SRAM Contention Model
110
111The frame SRAM chip is single-ported (one access per clock cycle at
1125 MHz with 55 ns SRAM). The primary stall source is contention
113between stage 3 (frame reads for operand data and constants) and
114stage 5 (frame reads for destinations, or frame writes for sink
115modes).
116
117**Contention arises only when:**
118- Token A is at stage 5, needing a frame SRAM read (dest) or write
119  (sink), AND
120- Token B is at stage 3, needing a frame SRAM read (match operand,
121  constant, or tag word).
122
123**Contention does NOT arise when:**
124- Token A's stage 5 is mode 4/5 (change_tag -- no SRAM access).
125- Token B's stage 3 is zero-cycle (monadic no-const, or match data
126  in register file with no const).
127- Token A was a dyadic miss (terminated at stage 3, never reaches
128  stage 5).
129
130### Cycle Counts by Instruction Type
131
132**Approach C (74LS670 lookup, recommended v0):**
133
134```
135                                stg1  stg2  stg3  stg4  stg5  total
136monadic mode 4 (no frame)      1     1     0     1     0     3
137monadic mode 0 (dest only)     1     1     0     1     1     4
138monadic mode 6 (sink)          1     1     0     1     1     4
139monadic mode 1 (const+dest)    1     1     1     1     1     5
140monadic mode 7 (RMW)           1     1     1     1     1     5
141dyadic miss                    1     1     1     --    --    3
142dyadic hit, mode 0             1     1     1     1     1     5
143dyadic hit, mode 1             1     1     2     1     1     6
144dyadic hit, mode 3 (fan+const) 1     1     2     1     2     7
145```
146
147Stage 3 breakdown for Approach C:
148- Dyadic hit: 1 SRAM cycle to read stored operand (frame_id and
149  presence already known from 670). +1 cycle for constant if
150  has_const=1.
151- Dyadic miss: 1 SRAM cycle to write operand data. 670 write port
152  sets presence bit combinationally in parallel.
153- Monadic: 0 SRAM cycles (no match), +1 for constant if has_const=1.
154
155**Approach B (register-file match pool):**
156
157```
158                                stg1  stg2  stg3  stg4  stg5  total
159monadic mode 4 (no frame)      1     1     0     1     0     3
160monadic mode 0 (dest only)     1     1     0     1     1     4
161monadic mode 6 (sink)          1     1     0     1     1     4
162monadic mode 1 (const+dest)    1     1     1     1     1     5
163monadic mode 7 (RMW)           1     1     1     1     1     5
164dyadic miss                    1     1     1     --    --    3
165dyadic hit, mode 0             1     1     1     1     1     5
166dyadic hit, mode 1             1     1     2     1     1     6
167dyadic hit, mode 3 (fan+const) 1     1     2     1     2     7
168```
169
170Approaches B and C produce identical single-token cycle counts. The
171difference emerges under pipelining: Approach B's match data never
172touches the frame SRAM (operands stored in a dedicated register
173file), so stage 3's only SRAM access is the constant read. This
174reduces stage 3 vs stage 5 SRAM contention.
175
176**Approach A (set-associative tags in SRAM, minimal chips):**
177
178```
179                                stg1  stg2  stg3  stg4  stg5  total
180monadic mode 4 (no frame)      1     1     0     1     0     3
181monadic mode 0 (dest only)     1     1     0     1     1     4
182monadic mode 6 (sink)          1     1     0     1     1     4
183monadic mode 1 (const+dest)    1     1     1     1     1     5
184monadic mode 7 (RMW)           1     1     1     1     1     5
185dyadic miss                    1     1     2     --    --    4
186dyadic hit, mode 0             1     1     2     1     1     6
187dyadic hit, mode 1             1     1     3     1     1     7
188dyadic hit, mode 3 (fan+const) 1     1     3     1     2     8
189```
190
191Approach A adds 1 extra SRAM cycle per dyadic operation (tag word
192read + associative compare) because act_id resolution is not
193combinational.
194
195### Pipeline Overlap Analysis
196
197With single-port frame SRAM at 5 MHz, the pipeline controller must
198arbitrate between stage 3 and stage 5. When both need SRAM in the
199same cycle, stage 3 stalls.
200
201**Approach B, two consecutive dyadic-hit mode 1 tokens:**
202
203```
204cycle 0:  A.stg1
205cycle 1:  A.stg2 (IRAM)
206cycle 2:  A.stg3 match (reg file)  -- frame SRAM FREE
207cycle 3:  A.stg3 const (SRAM)
208cycle 4:  A.stg4 (ALU)             -- frame SRAM FREE
209cycle 5:  A.stg5 dest (SRAM)       B.stg3 match (reg file) -- NO CONFLICT
210cycle 6:  (A done)                  B.stg3 const (SRAM)
211cycle 7:                            B.stg4 (ALU)
212cycle 8:                            B.stg5 dest (SRAM)      -- NO CONFLICT
213```
214
215Token spacing: 4 cycles. Approach A under the same conditions: ~6-7
216cycles due to additional SRAM contention in stage 3.
217
218### Throughput Summary (per PE, at 5 MHz, single-port frame SRAM)
219
220| Instruction mix profile | Approach A | Approach C | Approach B |
221|------------------------|------------|------------|------------|
222| Monadic-heavy (mode 0/4/6) | ~1.25 MIPS | ~1.67 MIPS | ~1.67 MIPS |
223| Mixed (40% dyadic mode 1, 30% monadic, 30% misc) | ~833 KIPS | ~1.25 MIPS | ~1.25 MIPS |
224| Dyadic-heavy with constants | ~714 KIPS | ~1.00 MIPS | ~1.00 MIPS |
225| Worst case (mode 3, const+fanout) | ~625 KIPS | ~714 KIPS | ~714 KIPS |
226
2274-PE system: multiply by 4. Realistic mixed workload: ~3.3-5.0 MIPS
228(A), ~5.0-6.7 MIPS (C), or ~5.0-6.7 MIPS (B). For reference: the
229original Amamiya DFM prototype (TTL, 1982) achieved 1.8 MIPS per PE.
230EM-4 prototype (VLSI gate array, 1990) achieved 12.5 MIPS per PE.
231This design sits between the two, closer to the DFM, which is
232historically appropriate for a discrete TTL build.
233
234---
235
236## 4. SRAM Configuration
237
238### Unified SRAM Chip Pair
239
240The PE uses a single 32Kx8 chip pair (2 chips for 16-bit data width)
241for both IRAM and frame storage, with address partitioning via a
242single decode bit. The recommended part is the AS6C62256 (55 ns,
24332Kx8, DIP-28) or equivalent. 55 ns access time fits comfortably
244within a 200 ns clock period at 5 MHz, with margin for address setup
245and data hold.
246
247The 2114 (1Kx4, 200 ns) from the original design is no longer used.
248The unified SRAM approach eliminates the chip proliferation problem:
249one chip pair per PE replaces the 4-6 SRAM chips previously needed
250for separate matching store and IRAM.
251
252### Address Map
253
254```
255v0 address space (simple decode, no 610):
256
257  IRAM region:   [0][offset:8]              instruction templates
258                  offset from token
259                  capacity: 256 instructions (512 bytes)
260
261  Frame region:  [1][frame_id:2][slot:6]    per-activation storage
262                  frame_id from tag store resolution
263                  capacity: 4 frames x 64 slots = 256 entries (512 bytes)
264
265Future address space (with 610 bank switching):
266
267  IRAM region:   [0][bank:4][offset:8]      bank-switched templates
268                  bank from 610 mapper
269                  capacity: 16 banks x 256 instructions = 4096 entries
270
271  Frame region:  [1][frame_id:2][slot:6]    (unchanged)
272```
273
274Total v0 SRAM utilisation: under 1.5 KB used out of a 32Kx8 chip
275pair (64 KB). Ample room for future expansion without changing chips.
276SRAM address lines are pre-routed to a 74LS610 socket with a jumper
277wire in place of the chip; when bank switching is needed for programs
278exceeding 256 instructions per PE, the 610 drops in with no board
279changes.
280
281---
282
283## 5. The 670 Subsystem: Act ID Lookup, Match Metadata, and SC Register File
284
285### Role in the Frame-Based Architecture
286
287The 74LS670s are no longer a metadata cache for the matching store
288(as in the original design). Instead, they serve two critical
289functions:
290
2911. **act_id -> frame_id lookup table.** Indexed by the token's 3-bit
292   `activation_id`, outputs `{valid:1, frame_id:2, spare:1}` in
293   ~35 ns (combinational). This replaces what would otherwise be an
294   SRAM cycle for associative tag comparison.
295
2962. **Presence and port metadata store.** Indexed by `frame_id`,
297   stores presence and port bits for all 8 matchable offsets across
298   all 4 frames. Combinational read (~35 ns after frame_id settles,
299   ~70 ns total from act_id presentation).
300
301Both functions complete within stage 2, in parallel with the IRAM
302read. By the time stage 3 begins, the PE knows frame_id, presence,
303and port -- the only remaining SRAM access is the actual operand
304data.
305
306### Hardware Configuration
307
308**act_id -> frame_id (2x 74LS670):**
309
310Addressed by `act_id[1:0]` with `act_id[2]` selecting between chips.
311Each chip holds 4 words x 4 bits. Output: `{valid:1, frame_id:2,
312spare:1}`.
313
314```
315ALLOC: write {valid=1, frame_id} at address act_id (670 write port)
316FREE:  write {valid=0, ...} at address act_id
317LOOKUP: read port, address = act_id -> {valid, frame_id} in ~35 ns
318```
319
320The 670's independent read and write ports allow ALLOC to proceed
321while the pipeline reads -- zero conflict.
322
323**Presence + port metadata (4x 74LS670):**
324
325Each 670 word (4 bits) holds presence+port for 2 offsets:
326`{presence_N:1, port_N:1, presence_N+1:1, port_N+1:1}`.
327Read address = `[frame_id:2]`. Output bits selected by
328`offset[2:0]` via bit-select mux.
329
330```
331670 chip 0 (offsets 0-1): word[frame_id] = {pres0, port0, pres1, port1}
332670 chip 1 (offsets 2-3): word[frame_id] = {pres2, port2, pres3, port3}
333670 chip 2 (offsets 4-5): word[frame_id] = {pres4, port4, pres5, port5}
334670 chip 3 (offsets 6-7): word[frame_id] = {pres6, port6, pres7, port7}
335```
336
337`offset[2:1]` selects chip, `offset[0]` selects which pair of bits
338within the 4-bit output (a 2:1 mux -- one gate).
339
340The 670's simultaneous read/write is critical: during stage 3, when
341a first operand stores and sets presence, the write port updates the
342presence 670 while the read port remains available for the next
343pipeline stage's lookup. No read-modify-write sequencing needed.
344
345**Bit select mux (1-2 chips):**
346
347Offset-based selection of the relevant presence and port bits from
348the 670 outputs.
349
350### Chip Budget
351
352| Component                  | Chips  | Function                             |
353|----------------------------|--------|--------------------------------------|
354| act_id -> frame_id lookup  | 2      | 74LS670, indexed by act_id           |
355| Presence + port metadata   | 4      | 74LS670, indexed by frame_id         |
356| Bit select mux             | 1-2    | offset-based selection               |
357| **Total match metadata**   | **~8** |                                      |
358
359### SC Register File (Mode-Switched)
360
361During **dataflow mode**, the PE uses act_id resolution and presence
362metadata constantly but the SC register file is idle (no SC block
363executing). During **SC mode**, the PE uses the register file
364constantly but act_id lookup and presence tracking are idle (SC block
365has exclusive PE access; no tokens enter matching).
366
367Some of the 670s can be repurposed for register storage during SC
368mode. The exact mapping depends on the SC block design:
369
370- The 4 presence+port 670s (indexed by frame_id in dataflow mode) can
371  be re-addressed by instruction register fields during SC mode,
372  providing 4 chips x 4 words x 4 bits = 64 bits of register storage.
373  Combined across chips, this gives **4 registers x 16 bits** (4 bits
374  per chip, 4 chips for width).
375
376- With additional mux logic, all 6 shared 670s (excluding the act_id
377  lookup pair, which may need to remain active for frame lifecycle
378  management) could provide **6 registers x 16 bits** during SC mode.
379
380The act_id lookup 670s may need to remain in their dataflow role even
381during SC mode if the PE must handle frame control tokens (ALLOC/FREE)
382arriving during SC block execution. Whether to share them depends on
383the SC block entry/exit protocol.
384
385### The Predicate Slice
386
387One of the 670s can be **permanently dedicated as a predicate
388register** rather than participating in the mode-switched pool:
389
390- 4 entries x 4 bits = 16 predicate bits, always available
391- Useful for: conditional token routing (SWITCH), loop termination
392  flags, SC block branch conditions, I-structure status flags
393- Does not reduce the metadata capacity significantly: the remaining
394  3 presence+port 670s still cover 6 of the 8 matchable offsets;
395  the 2 uncovered offsets can fall back to SRAM-based presence or
396  simply constrain the assembler to 6 dyadic offsets per frame
397
398The predicate register is always readable and writable regardless of
399mode, since it's a dedicated chip with its own address/enable lines.
400Instructions can test or set predicate bits without going through the
401matching store or the ALU result path.
402
403### Mode Switching
404
405When transitioning from dataflow mode to SC mode:
406
4071. **Save metadata** from the shared 670s to spill storage.
4082. **Load initial SC register values** (matched operand pair that
409   triggered the SC block) into the 670s.
4103. **Switch address mux**: 670 address lines now driven by
411   instruction register fields instead of frame_id / act_id.
4124. **Switch IRAM to counter mode**: sequential fetch via incrementing
413   counter rather than token-directed offset.
414
415When transitioning back:
416
4171. **Emit final SC result** as token (last instruction with OUT=1).
4182. **Restore metadata** from spill storage to the 670s.
4193. **Switch address mux back** to frame_id / act_id addressing.
4204. **Resume token processing** from input FIFO.
421
422### Spill Storage Options
423
424Metadata from the shared 670s (~64-96 bits depending on how many
425are shared) needs temporary storage during SC block execution.
426
427**Option A: Shift registers.** 2x 74LS165 (parallel-in, serial-out)
428for save + 2x 74LS595 (serial-in, parallel-out) for restore. Total:
4294 chips. Save/restore takes ~12 clock cycles each.
430
431**Option B: Dedicated spill 670.** One additional 74LS670 (4x4 bits)
432holds 16 bits per save cycle; need ~4-6 write cycles to save all
433shared chips' contents. Total: 1 chip, ~4-6 cycles per save/restore.
434
435**Option C: Spill to frame SRAM.** During SC mode, the frame SRAM
436has bandwidth available (no match operand reads). Write the 670
437metadata contents into a reserved region of the frame SRAM address
438space. No extra chips needed. ~4-6 SRAM write cycles to save, ~4-6
439to restore. The SRAM is single-ported but there's no contention
440because the pipeline is paused during mode switch.
441
442**Recommended: Option C.** Zero additional chips. The save/restore
443overhead of ~4-6 cycles per transition is negligible compared to the
444SC block's execution savings (EM-4 data: 23 clocks pure dataflow vs
4459 clocks SC for Fibonacci, so even with ~10 cycles of mode switch
446overhead, you break even at ~5-7 SC instructions).
447
448---
449
450## 6. Pipeline Timing by Era
451
452### Key Insight
453
454With the 670-based matching subsystem (Approach C), act_id
455resolution and presence/port checking are combinational (~35-70 ns)
456**regardless of era**. These never become the timing bottleneck.
457
458The era-dependent part is **SRAM access time** for frame reads and
459writes. This determines how many SRAM operations fit per clock cycle
460and thus how much stage 3 vs stage 5 contention exists.
461
462### 1979-1983 (5 MHz, 55 ns SRAM)
463
464```
465670 metadata: combinational (~35-70 ns), well within 200 ns cycle
466Frame SRAM: one access per 200 ns cycle (55 ns access + setup/hold margin)
467Bottleneck: frame SRAM single-port, stage 3 vs stage 5 contention
468SC block throughput: ~1 instruction per clock (670 dual-port)
469Overall token throughput: ~1 token per 3-5 clocks (pipelined, mode-dependent)
470```
471
472### 1984-1990 (5-10 MHz, dual-port SRAM)
473
474```
475670 metadata: combinational (unchanged)
476Frame SRAM: dual-port (IDT7132 or similar), port A for stage 3, port B for stage 5
477Bottleneck: eliminated -- both stages access SRAM simultaneously
478SC block throughput: ~1 instruction per clock
479Overall token throughput: approaches 1 token per 3 clocks for most modes
480```
481
482Dual-port SRAM eliminates the primary stall source. The pipeline
483becomes instruction-latency-limited rather than SRAM-contention-limited.
484
485### Modern Parts (5 MHz clock, 15 ns SRAM)
486
487```
488670 metadata: combinational (unchanged)
489Frame SRAM: 15 ns access, ~13 accesses fit in 200 ns cycle
490Practical: 2-3 sub-cycle accesses via time-division multiplexing
491Bottleneck: none -- frame SRAM has excess bandwidth
492Token throughput: 1 token per 3 clocks (pipeline-stage-limited, not SRAM-limited)
493```
494
495With 15 ns AS7C256B-15PIN (DIP, currently available at ~$3), two
496sub-cycle accesses fit within a 200 ns clock period. This achieves
497TDM-like parallelism without additional MUX logic, effectively
498giving the pipeline a dual-port view of a single-port chip.
499
500### Integrated (on-chip SRAM, sub-ns access)
501
502```
503670 equivalent: on-chip multi-ported register file, ~200 transistors
504Frame SRAM: on-chip, sub-cycle access trivially
505Token throughput: 1 per 3 clocks, potentially faster with deeper pipelining
506```
507
508---
509
510## 7. Interaction with PE-to-PE Pipelining
511
512When multiple PEs are chained for software-pipelined loops (see
513architecture overview), the per-PE pipeline throughput determines the
514overall chain throughput.
515
516With the pipelined design (1 token per 3-5 clocks depending on
517instruction mix and era), the inter-PE hop cost becomes the critical
518path for chained execution:
519
520| Interconnect | Hop latency | Viable? |
521|-------------|-------------|---------|
522| Shared bus (discrete build) | 5-8 cycles | Marginal -- chain overhead dominates |
523| Dedicated FIFO between adjacent PEs | 2-3 cycles | Worthwhile for tight loops |
524| On-chip wide parallel link (integrated) | 1-2 cycles | Competitive with intra-PE SC block |
525
526For the discrete v0 build, dedicated inter-PE FIFOs (bypassing the
527shared bus) would enable PE chaining at reasonable cost. This is a
528low-chip-count addition (~2-4 chips per PE pair) that unlocks
529software-pipelined loop execution.
530
531**Loopback bypass.** When a PE emits a token destined for itself
532(common in iterative computations), the token can be looped back
533internally without traversing the bus at all. See
534`bus-interconnect-design.md` for the loopback bypass design, which
535eliminates the bus hop latency entirely for self-targeted tokens.
536
537---
538
539## 8. The Execution Mode Spectrum
540
541The pipelined PE with frame-based storage, SC blocks, and predicate
542register supports a spectrum of execution modes, selectable by the
543compiler per-region:
544
545| Mode | Pipeline behaviour | Throughput | When to use |
546|------|-------------------|-----------|-------------|
547| Pure dataflow | Token -> ifetch -> match/frame -> exec -> output | 1 token / 3-7 clocks (mode-dependent) | Parallel regions, independent ops |
548| SC block (register) | Sequential IRAM fetch, 670 register file | ~1 instr / clock | Short sequential regions |
549| SC block + predicate | As above, with conditional skip/branch via predicate bits | ~1 instr / clock | Conditional sequential regions |
550| PE chain (software pipeline) | Tokens flow PE0->PE1->PE2, each PE handles one stage | 1 iteration / PE-pipeline-depth clocks | Loop bodies across PEs |
551| SM-mediated sequential | Tokens to/from SM for memory-intensive work | SM-bandwidth-limited | Array/structure traversal |
552
553The compiler partitions the program graph and selects the best mode
554for each region. This spectrum is arguably more expressive than what a
555modern OoO core offers (which has exactly one mode: "pretend to be
556sequential, discover parallelism at runtime").
557
558---
559
560## 9. Open Items
561
5621. **Approach selection for v0.** Approach C (670 lookup) is
563   recommended as the starting point: combinational metadata at ~8
564   chips. Approach B (register-file match pool) eliminates the last
565   SRAM cycle from matching at the cost of ~16-18 chips. Approach A
566   (SRAM tags) is the fallback if 670 supply is a problem. The
567   choice depends on whether chip count or pipeline throughput is
568   the binding constraint for the initial build. See section 3 above
569   for the full approach comparison and cycle counts.
570
5712. **Frame SRAM contention under realistic workloads.** The pipeline
572   stall analysis in section 3 uses worst-case consecutive tokens.
573   Simulate representative dataflow programs in the behavioural
574   emulator to measure actual stage 3 vs stage 5 contention rates
575   and determine whether dual-port SRAM or faster SRAM is justified
576   for v0.
577
5783. **SC block register capacity.** With 4-6 registers available from
579   repurposed 670s (depending on how many are shared), what is the
580   longest SC block the compiler can generate before register
581   pressure forces a spill? Evaluate empirically on target workloads.
582
5834. **Predicate register encoding.** Document specific instruction
584   encodings for predicate test/set/clear, and how SWITCH
585   instructions interact with predicate bits. The predicate register
586   may subsume some of the cancel-bit functionality planned for
587   token format.
588
5895. **Mode switch latency measurement.** Build a cycle-accurate model
590   of the save-to-SRAM / restore-from-SRAM path and determine exact
591   overhead. Target: <=10 cycles per transition.
592
5936. **Assembler stall analysis.** The assembler can statically detect
594   instruction pairs whose output tokens may cause frame SRAM
595   contention on the same PE. For hot loops, the assembler can
596   insert mode 4 NOP tokens (zero frame access) as pipeline padding.
597   Validate static stall estimates against emulator simulation, since
598   runtime arrival timing depends on network latency and SM response
599   times.
600
6017. **8-offset matchable constraint validation.** The 670-based
602   presence metadata limits dyadic instructions to offsets 0-7 per
603   frame. Evaluate whether this is sufficient for compiled programs.
604   If tight, the hybrid upgrade path (offset[3]=0 checks 670s,
605   offset[3]=1 falls back to SRAM tags) adds ~4-6 chips of SRAM tag
606   logic for offsets 8-15+.