OR-1 dataflow CPU sketch
at main 606 lines 26 kB view raw view rendered
1# PE Pipelining and Metadata/SC Register Multiplexing 2 3Design notes on pipelining the PE's token processing path under the 4revised frame-based architecture, the frame SRAM contention problem 5between pipeline stages, and the 74LS670-based subsystem that serves 6as both the act_id resolution / presence metadata store (dataflow 7mode) and an SC block register file (sequential mode). 8 9See `pe-design.md` for the frame-based PE architecture and matching 10store, `architecture-overview.md` for the token format, and 11`iram-and-function-calls.md` for instruction encoding. The approach 12comparison is in section 3 of this document. This document focuses on 13pipeline timing, stall analysis, and the 670 subsystem design. 14 15--- 16 17## 1. Pipeline Stages (Reversed Order) 18 19The PE pipeline has been reversed relative to the original design: 20IFETCH now precedes MATCH. The instruction word drives match 21behaviour, frame access patterns, and output routing. The token's 22`activation_id` drives associative lookup in parallel with the IRAM 23read, hiding resolution latency behind SRAM access time. 24 25``` 26Stage 1: INPUT Deserialise flits from bus, classify token type 27Stage 2: IFETCH IRAM read + act_id resolution (parallel, via 670) 28Stage 3: MATCH/FRAME Match check + constant read (variable cycles) 29Stage 4: EXECUTE ALU operation (no SRAM access) 30Stage 5: OUTPUT Destination read from frame + token emission 31``` 32 33**Why IFETCH before MATCH.** In the original design, the pipeline was 34MATCH -> IFETCH -> EXECUTE -> OUTPUT. The match stage ran first, 35using token prefix bits to decide whether the operand was first or 36second. The instruction fetch happened only after matching succeeded. 37 38With the frame-based architecture, the instruction word determines 39*how* matching works: whether the instruction is dyadic or monadic, 40which frame slots to read for operands and constants, whether to 41write back to the frame (sink modes), and how many destinations to 42read at output. Fetching the instruction first gives the pipeline 43controller all the information it needs to sequence stage 3's SRAM 44accesses efficiently. 45 46The token's dyadic/monadic prefix (retained from the original format) 47enables parallel work: when the prefix indicates "dyadic," stage 2 48starts act_id -> frame_id resolution via the 670s simultaneously with 49the IRAM read. By the time stage 3 begins, both the instruction word 50and the frame_id / presence / port metadata are available, and the 51only remaining SRAM work is reading or writing actual operand data 52and constants. 53 54**Unpipelined throughput:** 3-7 cycles per token depending on 55instruction mode (see section 3 for detailed cycle counts). This is 56the baseline against which pipeline overlap improvements are measured. 57 58--- 59 60## 2. The Frame SRAM Contention Problem 61 62### The Shifted Bottleneck 63 64The original design's critical-path problem was a read-modify-write 65on the matching store SRAM: read the cell, check metadata, then 66either store or retrieve an operand, all within a single cycle. With 67200 ns 2114 SRAM, only one operation (read OR write) fit per clock. 68 69The frame-based redesign eliminates this problem entirely. With 70Approach C (670 lookup), act_id -> frame_id resolution is 71combinational (~35 ns via 670 read port), and the presence/port 72check is also combinational (~35 ns from a second set of 670s). 73There is no read-modify-write on SRAM for metadata. 74 75The new bottleneck is **frame SRAM contention between stage 3 and 76stage 5**. Both stages access the same single-ported SRAM chip pair: 77 78- **Stage 3** reads/writes operand data (dyadic match) and reads 79 constants (modes with has_const=1). 80- **Stage 5** reads destinations (modes 0-3), or writes results back 81 to the frame (sink modes 6-7). 82 83When two pipelined tokens have stage 3 and stage 5 active in the 84same cycle, the SRAM can serve only one. The other stalls. 85 86### The Pipeline Hazard 87 88The classic RAW hazard still exists but takes a different form. Two 89consecutive tokens targeting the same frame slot (e.g., two mode 7 90read-modify-write operations on the same accumulator slot) create a 91data dependency: the second token's stage 3 read must see the first 92token's stage 5 write. 93 94Detection requires comparing (act_id, fref) of the incoming token 95against in-flight pipeline latches at stages 3-5. Hardware cost: ~2 96chips (9-bit comparator + AND gate). Alternatively, the assembler 97can guarantee this never happens by never emitting consecutive mode 7 98tokens to the same slot on the same PE. 99 100As with the original design, this hazard is **statistically 101uncommon** in dataflow execution. Two operands arriving back-to-back 102at the exact same frame slot requires coincidental timing. The 103bypass path is cheap insurance that fires infrequently. 104 105--- 106 107## 3. Pipeline Solutions and Cycle Counts 108 109### SRAM Contention Model 110 111The frame SRAM chip is single-ported (one access per clock cycle at 1125 MHz with 55 ns SRAM). The primary stall source is contention 113between stage 3 (frame reads for operand data and constants) and 114stage 5 (frame reads for destinations, or frame writes for sink 115modes). 116 117**Contention arises only when:** 118- Token A is at stage 5, needing a frame SRAM read (dest) or write 119 (sink), AND 120- Token B is at stage 3, needing a frame SRAM read (match operand, 121 constant, or tag word). 122 123**Contention does NOT arise when:** 124- Token A's stage 5 is mode 4/5 (change_tag -- no SRAM access). 125- Token B's stage 3 is zero-cycle (monadic no-const, or match data 126 in register file with no const). 127- Token A was a dyadic miss (terminated at stage 3, never reaches 128 stage 5). 129 130### Cycle Counts by Instruction Type 131 132**Approach C (74LS670 lookup, recommended v0):** 133 134``` 135 stg1 stg2 stg3 stg4 stg5 total 136monadic mode 4 (no frame) 1 1 0 1 0 3 137monadic mode 0 (dest only) 1 1 0 1 1 4 138monadic mode 6 (sink) 1 1 0 1 1 4 139monadic mode 1 (const+dest) 1 1 1 1 1 5 140monadic mode 7 (RMW) 1 1 1 1 1 5 141dyadic miss 1 1 1 -- -- 3 142dyadic hit, mode 0 1 1 1 1 1 5 143dyadic hit, mode 1 1 1 2 1 1 6 144dyadic hit, mode 3 (fan+const) 1 1 2 1 2 7 145``` 146 147Stage 3 breakdown for Approach C: 148- Dyadic hit: 1 SRAM cycle to read stored operand (frame_id and 149 presence already known from 670). +1 cycle for constant if 150 has_const=1. 151- Dyadic miss: 1 SRAM cycle to write operand data. 670 write port 152 sets presence bit combinationally in parallel. 153- Monadic: 0 SRAM cycles (no match), +1 for constant if has_const=1. 154 155**Approach B (register-file match pool):** 156 157``` 158 stg1 stg2 stg3 stg4 stg5 total 159monadic mode 4 (no frame) 1 1 0 1 0 3 160monadic mode 0 (dest only) 1 1 0 1 1 4 161monadic mode 6 (sink) 1 1 0 1 1 4 162monadic mode 1 (const+dest) 1 1 1 1 1 5 163monadic mode 7 (RMW) 1 1 1 1 1 5 164dyadic miss 1 1 1 -- -- 3 165dyadic hit, mode 0 1 1 1 1 1 5 166dyadic hit, mode 1 1 1 2 1 1 6 167dyadic hit, mode 3 (fan+const) 1 1 2 1 2 7 168``` 169 170Approaches B and C produce identical single-token cycle counts. The 171difference emerges under pipelining: Approach B's match data never 172touches the frame SRAM (operands stored in a dedicated register 173file), so stage 3's only SRAM access is the constant read. This 174reduces stage 3 vs stage 5 SRAM contention. 175 176**Approach A (set-associative tags in SRAM, minimal chips):** 177 178``` 179 stg1 stg2 stg3 stg4 stg5 total 180monadic mode 4 (no frame) 1 1 0 1 0 3 181monadic mode 0 (dest only) 1 1 0 1 1 4 182monadic mode 6 (sink) 1 1 0 1 1 4 183monadic mode 1 (const+dest) 1 1 1 1 1 5 184monadic mode 7 (RMW) 1 1 1 1 1 5 185dyadic miss 1 1 2 -- -- 4 186dyadic hit, mode 0 1 1 2 1 1 6 187dyadic hit, mode 1 1 1 3 1 1 7 188dyadic hit, mode 3 (fan+const) 1 1 3 1 2 8 189``` 190 191Approach A adds 1 extra SRAM cycle per dyadic operation (tag word 192read + associative compare) because act_id resolution is not 193combinational. 194 195### Pipeline Overlap Analysis 196 197With single-port frame SRAM at 5 MHz, the pipeline controller must 198arbitrate between stage 3 and stage 5. When both need SRAM in the 199same cycle, stage 3 stalls. 200 201**Approach B, two consecutive dyadic-hit mode 1 tokens:** 202 203``` 204cycle 0: A.stg1 205cycle 1: A.stg2 (IRAM) 206cycle 2: A.stg3 match (reg file) -- frame SRAM FREE 207cycle 3: A.stg3 const (SRAM) 208cycle 4: A.stg4 (ALU) -- frame SRAM FREE 209cycle 5: A.stg5 dest (SRAM) B.stg3 match (reg file) -- NO CONFLICT 210cycle 6: (A done) B.stg3 const (SRAM) 211cycle 7: B.stg4 (ALU) 212cycle 8: B.stg5 dest (SRAM) -- NO CONFLICT 213``` 214 215Token spacing: 4 cycles. Approach A under the same conditions: ~6-7 216cycles due to additional SRAM contention in stage 3. 217 218### Throughput Summary (per PE, at 5 MHz, single-port frame SRAM) 219 220| Instruction mix profile | Approach A | Approach C | Approach B | 221|------------------------|------------|------------|------------| 222| Monadic-heavy (mode 0/4/6) | ~1.25 MIPS | ~1.67 MIPS | ~1.67 MIPS | 223| Mixed (40% dyadic mode 1, 30% monadic, 30% misc) | ~833 KIPS | ~1.25 MIPS | ~1.25 MIPS | 224| Dyadic-heavy with constants | ~714 KIPS | ~1.00 MIPS | ~1.00 MIPS | 225| Worst case (mode 3, const+fanout) | ~625 KIPS | ~714 KIPS | ~714 KIPS | 226 2274-PE system: multiply by 4. Realistic mixed workload: ~3.3-5.0 MIPS 228(A), ~5.0-6.7 MIPS (C), or ~5.0-6.7 MIPS (B). For reference: the 229original Amamiya DFM prototype (TTL, 1982) achieved 1.8 MIPS per PE. 230EM-4 prototype (VLSI gate array, 1990) achieved 12.5 MIPS per PE. 231This design sits between the two, closer to the DFM, which is 232historically appropriate for a discrete TTL build. 233 234--- 235 236## 4. SRAM Configuration 237 238### Unified SRAM Chip Pair 239 240The PE uses a single 32Kx8 chip pair (2 chips for 16-bit data width) 241for both IRAM and frame storage, with address partitioning via a 242single decode bit. The recommended part is the AS6C62256 (55 ns, 24332Kx8, DIP-28) or equivalent. 55 ns access time fits comfortably 244within a 200 ns clock period at 5 MHz, with margin for address setup 245and data hold. 246 247The 2114 (1Kx4, 200 ns) from the original design is no longer used. 248The unified SRAM approach eliminates the chip proliferation problem: 249one chip pair per PE replaces the 4-6 SRAM chips previously needed 250for separate matching store and IRAM. 251 252### Address Map 253 254``` 255v0 address space (simple decode, no 610): 256 257 IRAM region: [0][offset:8] instruction templates 258 offset from token 259 capacity: 256 instructions (512 bytes) 260 261 Frame region: [1][frame_id:2][slot:6] per-activation storage 262 frame_id from tag store resolution 263 capacity: 4 frames x 64 slots = 256 entries (512 bytes) 264 265Future address space (with 610 bank switching): 266 267 IRAM region: [0][bank:4][offset:8] bank-switched templates 268 bank from 610 mapper 269 capacity: 16 banks x 256 instructions = 4096 entries 270 271 Frame region: [1][frame_id:2][slot:6] (unchanged) 272``` 273 274Total v0 SRAM utilisation: under 1.5 KB used out of a 32Kx8 chip 275pair (64 KB). Ample room for future expansion without changing chips. 276SRAM address lines are pre-routed to a 74LS610 socket with a jumper 277wire in place of the chip; when bank switching is needed for programs 278exceeding 256 instructions per PE, the 610 drops in with no board 279changes. 280 281--- 282 283## 5. The 670 Subsystem: Act ID Lookup, Match Metadata, and SC Register File 284 285### Role in the Frame-Based Architecture 286 287The 74LS670s are no longer a metadata cache for the matching store 288(as in the original design). Instead, they serve two critical 289functions: 290 2911. **act_id -> frame_id lookup table.** Indexed by the token's 3-bit 292 `activation_id`, outputs `{valid:1, frame_id:2, spare:1}` in 293 ~35 ns (combinational). This replaces what would otherwise be an 294 SRAM cycle for associative tag comparison. 295 2962. **Presence and port metadata store.** Indexed by `frame_id`, 297 stores presence and port bits for all 8 matchable offsets across 298 all 4 frames. Combinational read (~35 ns after frame_id settles, 299 ~70 ns total from act_id presentation). 300 301Both functions complete within stage 2, in parallel with the IRAM 302read. By the time stage 3 begins, the PE knows frame_id, presence, 303and port -- the only remaining SRAM access is the actual operand 304data. 305 306### Hardware Configuration 307 308**act_id -> frame_id (2x 74LS670):** 309 310Addressed by `act_id[1:0]` with `act_id[2]` selecting between chips. 311Each chip holds 4 words x 4 bits. Output: `{valid:1, frame_id:2, 312spare:1}`. 313 314``` 315ALLOC: write {valid=1, frame_id} at address act_id (670 write port) 316FREE: write {valid=0, ...} at address act_id 317LOOKUP: read port, address = act_id -> {valid, frame_id} in ~35 ns 318``` 319 320The 670's independent read and write ports allow ALLOC to proceed 321while the pipeline reads -- zero conflict. 322 323**Presence + port metadata (4x 74LS670):** 324 325Each 670 word (4 bits) holds presence+port for 2 offsets: 326`{presence_N:1, port_N:1, presence_N+1:1, port_N+1:1}`. 327Read address = `[frame_id:2]`. Output bits selected by 328`offset[2:0]` via bit-select mux. 329 330``` 331670 chip 0 (offsets 0-1): word[frame_id] = {pres0, port0, pres1, port1} 332670 chip 1 (offsets 2-3): word[frame_id] = {pres2, port2, pres3, port3} 333670 chip 2 (offsets 4-5): word[frame_id] = {pres4, port4, pres5, port5} 334670 chip 3 (offsets 6-7): word[frame_id] = {pres6, port6, pres7, port7} 335``` 336 337`offset[2:1]` selects chip, `offset[0]` selects which pair of bits 338within the 4-bit output (a 2:1 mux -- one gate). 339 340The 670's simultaneous read/write is critical: during stage 3, when 341a first operand stores and sets presence, the write port updates the 342presence 670 while the read port remains available for the next 343pipeline stage's lookup. No read-modify-write sequencing needed. 344 345**Bit select mux (1-2 chips):** 346 347Offset-based selection of the relevant presence and port bits from 348the 670 outputs. 349 350### Chip Budget 351 352| Component | Chips | Function | 353|----------------------------|--------|--------------------------------------| 354| act_id -> frame_id lookup | 2 | 74LS670, indexed by act_id | 355| Presence + port metadata | 4 | 74LS670, indexed by frame_id | 356| Bit select mux | 1-2 | offset-based selection | 357| **Total match metadata** | **~8** | | 358 359### SC Register File (Mode-Switched) 360 361During **dataflow mode**, the PE uses act_id resolution and presence 362metadata constantly but the SC register file is idle (no SC block 363executing). During **SC mode**, the PE uses the register file 364constantly but act_id lookup and presence tracking are idle (SC block 365has exclusive PE access; no tokens enter matching). 366 367Some of the 670s can be repurposed for register storage during SC 368mode. The exact mapping depends on the SC block design: 369 370- The 4 presence+port 670s (indexed by frame_id in dataflow mode) can 371 be re-addressed by instruction register fields during SC mode, 372 providing 4 chips x 4 words x 4 bits = 64 bits of register storage. 373 Combined across chips, this gives **4 registers x 16 bits** (4 bits 374 per chip, 4 chips for width). 375 376- With additional mux logic, all 6 shared 670s (excluding the act_id 377 lookup pair, which may need to remain active for frame lifecycle 378 management) could provide **6 registers x 16 bits** during SC mode. 379 380The act_id lookup 670s may need to remain in their dataflow role even 381during SC mode if the PE must handle frame control tokens (ALLOC/FREE) 382arriving during SC block execution. Whether to share them depends on 383the SC block entry/exit protocol. 384 385### The Predicate Slice 386 387One of the 670s can be **permanently dedicated as a predicate 388register** rather than participating in the mode-switched pool: 389 390- 4 entries x 4 bits = 16 predicate bits, always available 391- Useful for: conditional token routing (SWITCH), loop termination 392 flags, SC block branch conditions, I-structure status flags 393- Does not reduce the metadata capacity significantly: the remaining 394 3 presence+port 670s still cover 6 of the 8 matchable offsets; 395 the 2 uncovered offsets can fall back to SRAM-based presence or 396 simply constrain the assembler to 6 dyadic offsets per frame 397 398The predicate register is always readable and writable regardless of 399mode, since it's a dedicated chip with its own address/enable lines. 400Instructions can test or set predicate bits without going through the 401matching store or the ALU result path. 402 403### Mode Switching 404 405When transitioning from dataflow mode to SC mode: 406 4071. **Save metadata** from the shared 670s to spill storage. 4082. **Load initial SC register values** (matched operand pair that 409 triggered the SC block) into the 670s. 4103. **Switch address mux**: 670 address lines now driven by 411 instruction register fields instead of frame_id / act_id. 4124. **Switch IRAM to counter mode**: sequential fetch via incrementing 413 counter rather than token-directed offset. 414 415When transitioning back: 416 4171. **Emit final SC result** as token (last instruction with OUT=1). 4182. **Restore metadata** from spill storage to the 670s. 4193. **Switch address mux back** to frame_id / act_id addressing. 4204. **Resume token processing** from input FIFO. 421 422### Spill Storage Options 423 424Metadata from the shared 670s (~64-96 bits depending on how many 425are shared) needs temporary storage during SC block execution. 426 427**Option A: Shift registers.** 2x 74LS165 (parallel-in, serial-out) 428for save + 2x 74LS595 (serial-in, parallel-out) for restore. Total: 4294 chips. Save/restore takes ~12 clock cycles each. 430 431**Option B: Dedicated spill 670.** One additional 74LS670 (4x4 bits) 432holds 16 bits per save cycle; need ~4-6 write cycles to save all 433shared chips' contents. Total: 1 chip, ~4-6 cycles per save/restore. 434 435**Option C: Spill to frame SRAM.** During SC mode, the frame SRAM 436has bandwidth available (no match operand reads). Write the 670 437metadata contents into a reserved region of the frame SRAM address 438space. No extra chips needed. ~4-6 SRAM write cycles to save, ~4-6 439to restore. The SRAM is single-ported but there's no contention 440because the pipeline is paused during mode switch. 441 442**Recommended: Option C.** Zero additional chips. The save/restore 443overhead of ~4-6 cycles per transition is negligible compared to the 444SC block's execution savings (EM-4 data: 23 clocks pure dataflow vs 4459 clocks SC for Fibonacci, so even with ~10 cycles of mode switch 446overhead, you break even at ~5-7 SC instructions). 447 448--- 449 450## 6. Pipeline Timing by Era 451 452### Key Insight 453 454With the 670-based matching subsystem (Approach C), act_id 455resolution and presence/port checking are combinational (~35-70 ns) 456**regardless of era**. These never become the timing bottleneck. 457 458The era-dependent part is **SRAM access time** for frame reads and 459writes. This determines how many SRAM operations fit per clock cycle 460and thus how much stage 3 vs stage 5 contention exists. 461 462### 1979-1983 (5 MHz, 55 ns SRAM) 463 464``` 465670 metadata: combinational (~35-70 ns), well within 200 ns cycle 466Frame SRAM: one access per 200 ns cycle (55 ns access + setup/hold margin) 467Bottleneck: frame SRAM single-port, stage 3 vs stage 5 contention 468SC block throughput: ~1 instruction per clock (670 dual-port) 469Overall token throughput: ~1 token per 3-5 clocks (pipelined, mode-dependent) 470``` 471 472### 1984-1990 (5-10 MHz, dual-port SRAM) 473 474``` 475670 metadata: combinational (unchanged) 476Frame SRAM: dual-port (IDT7132 or similar), port A for stage 3, port B for stage 5 477Bottleneck: eliminated -- both stages access SRAM simultaneously 478SC block throughput: ~1 instruction per clock 479Overall token throughput: approaches 1 token per 3 clocks for most modes 480``` 481 482Dual-port SRAM eliminates the primary stall source. The pipeline 483becomes instruction-latency-limited rather than SRAM-contention-limited. 484 485### Modern Parts (5 MHz clock, 15 ns SRAM) 486 487``` 488670 metadata: combinational (unchanged) 489Frame SRAM: 15 ns access, ~13 accesses fit in 200 ns cycle 490Practical: 2-3 sub-cycle accesses via time-division multiplexing 491Bottleneck: none -- frame SRAM has excess bandwidth 492Token throughput: 1 token per 3 clocks (pipeline-stage-limited, not SRAM-limited) 493``` 494 495With 15 ns AS7C256B-15PIN (DIP, currently available at ~$3), two 496sub-cycle accesses fit within a 200 ns clock period. This achieves 497TDM-like parallelism without additional MUX logic, effectively 498giving the pipeline a dual-port view of a single-port chip. 499 500### Integrated (on-chip SRAM, sub-ns access) 501 502``` 503670 equivalent: on-chip multi-ported register file, ~200 transistors 504Frame SRAM: on-chip, sub-cycle access trivially 505Token throughput: 1 per 3 clocks, potentially faster with deeper pipelining 506``` 507 508--- 509 510## 7. Interaction with PE-to-PE Pipelining 511 512When multiple PEs are chained for software-pipelined loops (see 513architecture overview), the per-PE pipeline throughput determines the 514overall chain throughput. 515 516With the pipelined design (1 token per 3-5 clocks depending on 517instruction mix and era), the inter-PE hop cost becomes the critical 518path for chained execution: 519 520| Interconnect | Hop latency | Viable? | 521|-------------|-------------|---------| 522| Shared bus (discrete build) | 5-8 cycles | Marginal -- chain overhead dominates | 523| Dedicated FIFO between adjacent PEs | 2-3 cycles | Worthwhile for tight loops | 524| On-chip wide parallel link (integrated) | 1-2 cycles | Competitive with intra-PE SC block | 525 526For the discrete v0 build, dedicated inter-PE FIFOs (bypassing the 527shared bus) would enable PE chaining at reasonable cost. This is a 528low-chip-count addition (~2-4 chips per PE pair) that unlocks 529software-pipelined loop execution. 530 531**Loopback bypass.** When a PE emits a token destined for itself 532(common in iterative computations), the token can be looped back 533internally without traversing the bus at all. See 534`bus-interconnect-design.md` for the loopback bypass design, which 535eliminates the bus hop latency entirely for self-targeted tokens. 536 537--- 538 539## 8. The Execution Mode Spectrum 540 541The pipelined PE with frame-based storage, SC blocks, and predicate 542register supports a spectrum of execution modes, selectable by the 543compiler per-region: 544 545| Mode | Pipeline behaviour | Throughput | When to use | 546|------|-------------------|-----------|-------------| 547| Pure dataflow | Token -> ifetch -> match/frame -> exec -> output | 1 token / 3-7 clocks (mode-dependent) | Parallel regions, independent ops | 548| SC block (register) | Sequential IRAM fetch, 670 register file | ~1 instr / clock | Short sequential regions | 549| SC block + predicate | As above, with conditional skip/branch via predicate bits | ~1 instr / clock | Conditional sequential regions | 550| PE chain (software pipeline) | Tokens flow PE0->PE1->PE2, each PE handles one stage | 1 iteration / PE-pipeline-depth clocks | Loop bodies across PEs | 551| SM-mediated sequential | Tokens to/from SM for memory-intensive work | SM-bandwidth-limited | Array/structure traversal | 552 553The compiler partitions the program graph and selects the best mode 554for each region. This spectrum is arguably more expressive than what a 555modern OoO core offers (which has exactly one mode: "pretend to be 556sequential, discover parallelism at runtime"). 557 558--- 559 560## 9. Open Items 561 5621. **Approach selection for v0.** Approach C (670 lookup) is 563 recommended as the starting point: combinational metadata at ~8 564 chips. Approach B (register-file match pool) eliminates the last 565 SRAM cycle from matching at the cost of ~16-18 chips. Approach A 566 (SRAM tags) is the fallback if 670 supply is a problem. The 567 choice depends on whether chip count or pipeline throughput is 568 the binding constraint for the initial build. See section 3 above 569 for the full approach comparison and cycle counts. 570 5712. **Frame SRAM contention under realistic workloads.** The pipeline 572 stall analysis in section 3 uses worst-case consecutive tokens. 573 Simulate representative dataflow programs in the behavioural 574 emulator to measure actual stage 3 vs stage 5 contention rates 575 and determine whether dual-port SRAM or faster SRAM is justified 576 for v0. 577 5783. **SC block register capacity.** With 4-6 registers available from 579 repurposed 670s (depending on how many are shared), what is the 580 longest SC block the compiler can generate before register 581 pressure forces a spill? Evaluate empirically on target workloads. 582 5834. **Predicate register encoding.** Document specific instruction 584 encodings for predicate test/set/clear, and how SWITCH 585 instructions interact with predicate bits. The predicate register 586 may subsume some of the cancel-bit functionality planned for 587 token format. 588 5895. **Mode switch latency measurement.** Build a cycle-accurate model 590 of the save-to-SRAM / restore-from-SRAM path and determine exact 591 overhead. Target: <=10 cycles per transition. 592 5936. **Assembler stall analysis.** The assembler can statically detect 594 instruction pairs whose output tokens may cause frame SRAM 595 contention on the same PE. For hot loops, the assembler can 596 insert mode 4 NOP tokens (zero frame access) as pipeline padding. 597 Validate static stall estimates against emulator simulation, since 598 runtime arrival timing depends on network latency and SM response 599 times. 600 6017. **8-offset matchable constraint validation.** The 670-based 602 presence metadata limits dyadic instructions to offsets 0-7 per 603 frame. Evaluate whether this is sufficient for compiled programs. 604 If tight, the hybrid upgrade path (offset[3]=0 checks 670s, 605 offset[3]=1 falls back to SRAM tags) adds ~4-6 chips of SRAM tag 606 logic for offsets 8-15+.