OR-1 dataflow CPU sketch
at main 382 lines 17 kB view raw view rendered
1# Historical Plausibility: A Dataflow Microcomputer circa 1979–1984 2 3Research notes on transistor budgets, memory technology, and the 4counterfactual case for a multi-PE dataflow system built with 5period-appropriate technology. 6 7--- 8 9## 1. Contemporary Processor Reference Points 10 11| Chip | Year | Transistors | Process | On-chip storage | Notes | 12|------|------|------------|---------|----------------|-------| 13| MOS 6502 | 1975 | ~3,510 (logic only) | 8 µm NMOS | A, X, Y internal regs only | Minimal transistor budget | 14| Z80 | 1976 | ~8,500 | 4 µm NMOS | ~20 registers (two banks + specials) | | 15| TMS9900 | 1976 | ~8,000 est. | NMOS | 3 internal regs; **16 GP regs in external RAM** | Minicomputer heritage | 16| Am2901 | 1975 | ~1,000 | Bipolar (TTL/ECL) | 16 × 4-bit register file | Bit-slice, 16 MHz | 17| Intel 8086 | 1978 | ~29,000 | 3.2 µm NMOS | 14 registers; large microcode ROM | | 18| Motorola 68000 | 1979 | ~68,000 | 3.5 µm NMOS | 8 data + 7 addr regs (32-bit) | Clean 32-bit ISA | 19| Intel 80286 | 1982 | ~134,000 | 1.5 µm | No cache | | 20| ARM1 | 1985 | ~25,000 | 3 µm CMOS | 25 × 32-bit registers | RISC; 50 mm² die | 21| Motorola 68020 | 1984 | ~190,000 | 2 µm CMOS | **256-byte instruction cache** | First µP with on-chip cache | 22| INMOS T414 | 1985 | ~200,000* | 1.5 µm CMOS | **2 KB on-chip SRAM** | Transputer; 4 serial links | 23| INMOS T800 | 1987 | ~250,000+ | CMOS | **4 KB on-chip SRAM** + FPU | | 24| EM-4 (EMC-R) | 1990 | ~45,788 gates | 1.5 µm CMOS | 1.31 MB SRAM per PE | 80-PE prototype built | 25 26\*The T414's ~200K figure likely includes SRAM cell transistors. The logic 27core was probably 50–80K transistors, with on-chip SRAM accounting for a 28large fraction of the total. 29 30**Our PE target: ~3–5K transistors of logic + external SRAM chips.** 31 32This places each PE squarely in 6502-to-Z80 territory for logic 33complexity. The PE is not a complete general-purpose processor — it 34trades away the program counter, complex instruction decoder, microcode 35ROM, and sequential control flow machinery in exchange for matching 36store logic and token handling. 37 38--- 39 40## 2. The TMS9900 Precedent 41 42The TMS9900 (1976) is the strongest historical precedent for the 43external-storage PE model. It had only 3 internal registers (PC, WP, SR) 44and accessed its 16 "general purpose registers" through external SRAM via 45a workspace pointer. Every register access was a memory access. 46 47At 3 MHz with ~300 ns SRAM, register accesses were single-cycle. The 48penalty was real — a register-to-register ADD took 14 clock cycles due to 49multiple memory round-trips — but the design worked, and context-switch 50speed was unmatched (change one pointer, entire register set swaps). 51 52Our PE has the same structural relationship to its SRAM: matching store, 53instruction memory, and context slots all live in external SRAM, accessed 54via direct address concatenation. The key difference is that our PE 55doesn't need multiple memory round-trips per instruction — a token 56arrives, we index into the matching store (one SRAM access), check the 57occupied bit, and if matched, fetch the instruction (one SRAM access) and 58execute. Arguably *fewer* memory accesses per useful operation than the 59TMS9900. 60 61The TMS9900 also demonstrates that the "registers in external memory" 62approach was commercially viable and accepted by the market in the 63mid-1970s. The 40-pin implementations (TMS9995 etc.) later included 64128–256 bytes of fast on-chip RAM for registers, validating the 65evolutionary path from external to on-chip storage. 66 67--- 68 69## 3. SRAM Technology and Access Times 70 71### Available Parts by Year 72 73| Part | Organisation | Access Time | Approx. Availability | 74|------|-------------|------------|---------------------| 75| Intel 2102 | 1K × 1 | 500–850 ns | 1972 | 76| Intel 2114 | 1K × 4 | 200–450 ns | ~1977 | 77| Intel 2147 | 4K × 1 | 55–70 ns | 1979 (bipolar, expensive) | 78| HM6116 | 2K × 8 | 120–200 ns | ~1981 | 79| HM6264 | 8K × 8 | 70–200 ns | ~1983–84 | 80| HM62256 | 32K × 8 | 70–150 ns | mid-1980s | 81 82### Clock Speed vs SRAM Access 83 84| Clock | Period | Single-cycle SRAM needed | Available by | 85|-------|--------|-------------------------|-------------| 86| 4 MHz | 250 ns | 250 ns (2114 comfortably) | 1977 | 87| 5 MHz | 200 ns | 200 ns (2114, tight) | 1978 | 88| 8 MHz | 125 ns | 125 ns (6116 fast grades) | 1982 | 89| 10 MHz | 100 ns | 100 ns (6264 fast grades) | 1984 | 90 91At 5 MHz with 200 ns 2114s, single-cycle *read or write* is achievable. 92Single-cycle read-modify-write (required for matching store) is not — 93the 2114 is single-ported and 200 ns access fills the entire clock 94period. This constrains matching store pipeline throughput to one 95operation per 2 clock cycles in the 1979 scenario. See the companion 96document on pipelining for approaches to mitigating this. 97 98By 1981–82 with 150 ns 6116 parts at 5 MHz, a half-clock 99read/write split becomes feasible (100 ns per half-cycle, with margin). 100By 1984 with fast 6264 parts, 10 MHz pipelined operation is practical. 101 102### The 74LS670 Register File 103 104The SN74LS670 (4 × 4 register file with 3-state outputs) provides a 105critical capability: **true simultaneous read and write to different 106addresses**, with: 107 108- Read access time: ~20–24 ns typical 109- Write time: ~27 ns typical 110- Separate read and write address/enable inputs (dual-ported) 111- 16-pin DIP, ~98 gate equivalents, ~125 mW 112 113This part was available in the LS family by the late 1970s (the LS 114subfamily was comprehensively available by 1977–78). At $2–4 per chip in 115volume, it's affordable for targeted use in pipeline bypass logic and 116small register files. 117 118The 670's 4-bit word width is an exact match for per-entry matching 119store metadata (1 occupied bit + 1 port bit + 2 generation counter 120bits), making it ideal for a write-through metadata cache. See the 121companion pipelining document for the full design. 122 123--- 124 125## 4. Per-PE Chip Count Analysis (1979 Scenario) 126 127### Configuration: 8 context slots × 32 entries = 256 cells 128 129Using 2114 (1K × 4) SRAM for bulk storage and 74LS670 for fast-path 130and register file functions. 131 132**Matching store data (16-bit operand values):** 133256 entries × 16 bits = 512 bytes. 1344 × 2114 in parallel (each contributes 4 bits of the 16-bit word, 135using 256 of 1024 available locations). 136 137**Matching store metadata:** 138Handled by shared 670-based cache / SC register file. 139See pipelining companion document. 140 141**Instruction RAM (128 entries × 24-bit):** 142128 × 24 bits = 384 bytes. 1434 × 2114 (using 128 of 1024 locations × 4 bits each, paralleled for 144width). Alternatively 6 × 2114 for 24-bit width without bit waste, 145depending on encoding. 146 147**Shared metadata cache / SC register file / predicate register:** 1488 × 74LS670 (see companion document). 149 150**ALU + control logic:** 151~15–20 TTL chips (adder, logic unit, comparators, muxes, shifter, 152EEPROM decoder, pipeline state machine, bus serialiser/deserialiser). 153 154**Per-PE total: ~31–36 chips (1979 parts)** 155 156### 4-PE System Total 157 158| Subsystem | Chips | 159|-----------|-------| 160| 4 × PE logic + SRAM | 124–144 | 161| Interconnect (shared bus, arbitration) | ~10–15 | 162| SM (structure memory, 4–8 banks) | ~20–40 | 163| I/O + bootstrap | ~15–25 | 164| **System total** | **~170–225 chips** | 165 166This is comparable to a late-1970s minicomputer CPU board, or roughly 167two S-100 boards' worth of components. Well within the engineering 168capability and cost envelope of a minicomputer product. 169 170--- 171 172## 5. The 68000 Comparison 173 174The 68000 (1979) is the most apt contemporary comparison: 175 176- **Instruction width**: 68000 uses 16-bit instruction words encoding a 177 32-bit ISA. Our IRAM uses ~24-bit instruction words encoding dataflow 178 operations. Comparable. 179- **Data path**: 68000 has 16-bit external bus, 32-bit internal paths. 180 Our design has 16-bit external bus, wider internal pipeline registers 181 (~64–68 bits). Structurally similar. 182- **Logic budget**: 68000 uses ~68,000 transistors, of which a huge 183 fraction is microcode ROM for complex instruction decode. Our 4-PE 184 system at ~3–5K logic transistors each = 12–20K transistors of PE 185 logic. With interconnect and I/O, maybe 25–35K total. Roughly half a 186 68000 in logic, or about one-third when counting the 68000's internal 187 register file transistors. 188- **SRAM dependency**: 68000 has on-chip registers (expensive in 189 transistors). Our design uses external SRAM (cheap in silicon, more 190 board space). The TMS9900 proved this trade-off was commercially 191 viable three years earlier. 192 193At 1979's 3.5 µm NMOS process, 25K transistors of logic fits in 194~15–25 mm² of die area. The 68000 die was ~44 mm². A single-die 195integration of 4 PEs (logic only, SRAM external) would be significantly 196smaller and cheaper than a 68000. 197 198--- 199 200## 6. The Transputer Comparison (1985) 201 202The INMOS T414 transputer (1985) is the closest historical analogue to 203what we're proposing, but approached from a different direction: 204 205| | T414 Transputer | Our 4-PE Design | 206|---|---|---| 207| Architecture | Single complex PE | 4 simple PEs | 208| Parallelism model | CSP message passing (explicit) | Dataflow (implicit) | 209| On-chip storage | 2 KB SRAM | External SRAM | 210| Transistors | ~200,000 | ~25–35K logic | 211| Process | 1.5 µm CMOS | 3.5 µm NMOS (1979 target) | 212| Inter-PE communication | Serial links (20 Mbit/s) | Shared bus or dedicated links | 213| Programming model | occam (explicit distribution) | Compiler-managed graph | 214 215The Transputer took the "one big smart PE with built-in message passing" 216path. Our architecture takes the "many small dumb PEs with implicit 217synchronisation" path. The Transputer's 200K transistors could fund 218~40–60 of our PEs in raw logic. Even accounting for SRAM overhead, 219an integrated version at Transputer-class process technology could pack 2208–16 PEs on a die, which is qualitatively different from a single 221Transputer — you're getting genuine fine-grained parallelism rather than 222coarse-grained task parallelism. 223 224--- 225 226## 7. Why Multi-Processor Microcomputers Didn't Happen (And Why Dataflow Changes This) 227 228### The Historical Blockers 229 2301. **Cache coherence**: von Neumann processors sharing memory need 231 coherence protocols. These are complex and were not well understood 232 until the mid-1980s. 233 2342. **Software parallelism**: writing parallel software for shared-memory 235 von Neumann machines was (and remains) brutally difficult. The 236 installed base of sequential FORTRAN and C code was enormous. 237 2383. **Instruction set compatibility**: the IBM 360 lesson — ISA 239 compatibility wins markets. A parallel machine that can't run 240 existing binaries starts with zero software. 241 2424. **Single-thread performance**: for inherently sequential code, one 243 big fast core beats multiple small slow cores. In 1979, most 244 programs were deeply sequential. 245 246### What Dataflow Changes 247 248- **No cache coherence needed**: each PE has its own local IRAM and 249 matching store. Data moves as tokens. There is no shared mutable 250 state at the PE level (SM handles shared data with its own 251 synchronisation protocol via I-structure semantics). 252 253- **Implicit parallelism**: the compiler decomposes the program into a 254 dataflow graph. Parallelism is inherent in the graph structure. The 255 hardware handles synchronisation through token matching. No 256 programmer effort required beyond writing the source code. 257 258- **Software compatibility via compiler**: an LLVM backend targeting the 259 dataflow ISA could compile standard C/Rust. The gap between LLVM's 260 SSA-form IR and a dataflow graph is much smaller than the gap 261 between 1979-era C compilers and dataflow assembly. 262 263- **Latency tolerance**: the PE processes whatever tokens are ready. 264 If one token is waiting on a slow SM access, the PE works on other 265 tokens. This is inherent in the execution model — no special 266 hardware needed. 267 268### The Remaining Hard Problem: Compiler Technology 269 270The biggest genuine blocker in 1979 was compiler technology. Dataflow 271compilers need to partition programs into graphs, assign nodes to PEs, 272manage context slots, and schedule token routes. In 1979, compilers 273could barely optimise sequential code. By the mid-1980s, the Manchester 274and Monsoon teams had working dataflow compilers, but these were 275research efforts, not production tools. 276 277Today, this is a solvable problem. LLVM already performs sophisticated 278dependency analysis, loop vectorisation, and graph-based intermediate 279representations. A dataflow backend is substantial but not unreasonable. 280 281--- 282 283## 8. The "Road Not Taken" Argument 284 285Modern out-of-order superscalar processors are, at their core, dataflow 286engines trapped inside a von Neumann straitjacket: 287 288- **Register renaming** creates unique names for each value — this is 289 exactly what tagged tokens do in a dataflow machine. 290- **Reservation stations** (Tomasulo, 1967) are matching stores: an 291 instruction waits for its operands to arrive, then fires. 292- **The reorder buffer** exists solely to reconstruct sequential 293 semantics from what is internally dataflow execution. It is the 294 tax paid for pretending to be von Neumann. 295- **Branch prediction** attempts to speculate about the dataflow graph's 296 structure, because the sequential ISA doesn't encode it. A dataflow 297 graph has no branches to predict — conditional execution is a SWITCH 298 node that routes tokens deterministically. 299- **Out-of-order execution** discovers at runtime the parallelism that 300 was always present in the program but obscured by the sequential 301 instruction stream. A dataflow compiler encodes this parallelism 302 explicitly. 303 304A modern high-performance core dedicates ~80–90% of its transistor 305budget to the cache hierarchy and the OoO/speculation engine. The 306actual ALU is a tiny fraction of the die. A dataflow PE is almost 307entirely ALU and matching, because the execution model eliminates the 308need for the translation layer. 309 310As the memory wall has worsened (DRAM latency ~100–300 cycles on modern 311systems vs 1–2 cycles in 1979), the overhead of the von Neumann 312translation layer has grown proportionally. The dataflow model's 313inherent latency tolerance — process whatever token is ready — becomes 314more valuable as memory gets relatively slower. 315 316This suggests that a dataflow architecture, while perhaps premature in 3171979 due to compiler limitations, might actually age *better* than 318von Neumann as the memory wall gets worse. The matching store never 319misses. The data arrives when it arrives. The PE does useful work in the 320meantime. No cache hierarchy needed in the PE pipeline — just fast 321local SRAM for matching and instruction storage, with SM handling the 322shared data. 323 324--- 325 326## 9. Scaling Considerations for Integration 327 328### 1979–1982: Discrete Logic (Prototype / Low-Volume Minicomputer) 329 330- 4 PEs on 1–2 large PCBs 331- ~170–225 TTL + SRAM chips total 332- 5 MHz clock, 2-cycle matching store access 333- Shared bus interconnect 334- Competitive with a 68000 on parallel workloads 335 336### 1983–1985: Single-Chip PE (1.5–2 µm CMOS) 337 338- PE logic on-chip (~3–5K transistors) 339- Matching store metadata on-chip (670-equivalent register file) 340- Bulk SRAM external 341- 4–8 PEs per board with external SRAM 342- 8–10 MHz, single-cycle matching via half-clock or on-chip dual-port 343 344### 1986–1988: Multi-PE Chip (1–1.5 µm CMOS) 345 346- 4–8 PE cores on one die with shared on-chip SRAM 347- Wide parallel local interconnect between adjacent PEs (~1 cycle hop) 348- External SM SRAM 349- 15–20 MHz 350- Competitive with Transputer systems at lower per-chip cost 351 352### Modern: Many-PE Tile (Sub-100 nm) 353 354- 64–256+ PEs per die with on-chip SRAM hierarchy 355- Network-on-chip interconnect 356- I-structure SM cache with simplified coherence protocol 357 (write-once semantics reduce coherence to fill notifications) 358- 1+ GHz 359- LLVM-based compiler toolchain 360 361--- 362 363## 10. Open Questions 364 3651. **SM cache architecture for modern integration**: does the 366 I-structure write-once semantics enable a dramatically simpler 367 coherence protocol than MESI? What does the cache hierarchy look 368 like for SM in a many-PE chip? 369 3702. **Compiler partitioning strategy**: how does matching store size 371 (context slots × entries) interact with compiler code generation? 372 What's the minimum matching store that supports "most" programs 373 without excessive function splitting? 374 3753. **Sequential performance floor**: what is the minimum acceptable 376 single-thread performance for a dataflow PE, and how does the 377 strongly connected block mechanism close the gap with conventional 378 cores? See companion pipelining document. 379 3804. **Network topology at scale**: at what PE count does the shared bus 381 become inadequate, and what's the right topology for 16–64 PEs? 382 Ring, mesh, omega network, or hierarchical?