OR-1 dataflow CPU sketch

Historical Plausibility: A Dataflow Microcomputer circa 1979–1984#

Research notes on transistor budgets, memory technology, and the counterfactual case for a multi-PE dataflow system built with period-appropriate technology.


1. Contemporary Processor Reference Points#

Chip Year Transistors Process On-chip storage Notes
MOS 6502 1975 ~3,510 (logic only) 8 µm NMOS A, X, Y internal regs only Minimal transistor budget
Z80 1976 ~8,500 4 µm NMOS ~20 registers (two banks + specials)
TMS9900 1976 ~8,000 est. NMOS 3 internal regs; 16 GP regs in external RAM Minicomputer heritage
Am2901 1975 ~1,000 Bipolar (TTL/ECL) 16 × 4-bit register file Bit-slice, 16 MHz
Intel 8086 1978 ~29,000 3.2 µm NMOS 14 registers; large microcode ROM
Motorola 68000 1979 ~68,000 3.5 µm NMOS 8 data + 7 addr regs (32-bit) Clean 32-bit ISA
Intel 80286 1982 ~134,000 1.5 µm No cache
ARM1 1985 ~25,000 3 µm CMOS 25 × 32-bit registers RISC; 50 mm² die
Motorola 68020 1984 ~190,000 2 µm CMOS 256-byte instruction cache First µP with on-chip cache
INMOS T414 1985 ~200,000* 1.5 µm CMOS 2 KB on-chip SRAM Transputer; 4 serial links
INMOS T800 1987 ~250,000+ CMOS 4 KB on-chip SRAM + FPU
EM-4 (EMC-R) 1990 ~45,788 gates 1.5 µm CMOS 1.31 MB SRAM per PE 80-PE prototype built

*The T414's ~200K figure likely includes SRAM cell transistors. The logic core was probably 50–80K transistors, with on-chip SRAM accounting for a large fraction of the total.

Our PE target: ~3–5K transistors of logic + external SRAM chips.

This places each PE squarely in 6502-to-Z80 territory for logic complexity. The PE is not a complete general-purpose processor — it trades away the program counter, complex instruction decoder, microcode ROM, and sequential control flow machinery in exchange for matching store logic and token handling.


2. The TMS9900 Precedent#

The TMS9900 (1976) is the strongest historical precedent for the external-storage PE model. It had only 3 internal registers (PC, WP, SR) and accessed its 16 "general purpose registers" through external SRAM via a workspace pointer. Every register access was a memory access.

At 3 MHz with ~300 ns SRAM, register accesses were single-cycle. The penalty was real — a register-to-register ADD took 14 clock cycles due to multiple memory round-trips — but the design worked, and context-switch speed was unmatched (change one pointer, entire register set swaps).

Our PE has the same structural relationship to its SRAM: matching store, instruction memory, and context slots all live in external SRAM, accessed via direct address concatenation. The key difference is that our PE doesn't need multiple memory round-trips per instruction — a token arrives, we index into the matching store (one SRAM access), check the occupied bit, and if matched, fetch the instruction (one SRAM access) and execute. Arguably fewer memory accesses per useful operation than the TMS9900.

The TMS9900 also demonstrates that the "registers in external memory" approach was commercially viable and accepted by the market in the mid-1970s. The 40-pin implementations (TMS9995 etc.) later included 128–256 bytes of fast on-chip RAM for registers, validating the evolutionary path from external to on-chip storage.


3. SRAM Technology and Access Times#

Available Parts by Year#

Part Organisation Access Time Approx. Availability
Intel 2102 1K × 1 500–850 ns 1972
Intel 2114 1K × 4 200–450 ns ~1977
Intel 2147 4K × 1 55–70 ns 1979 (bipolar, expensive)
HM6116 2K × 8 120–200 ns ~1981
HM6264 8K × 8 70–200 ns ~1983–84
HM62256 32K × 8 70–150 ns mid-1980s

Clock Speed vs SRAM Access#

Clock Period Single-cycle SRAM needed Available by
4 MHz 250 ns 250 ns (2114 comfortably) 1977
5 MHz 200 ns 200 ns (2114, tight) 1978
8 MHz 125 ns 125 ns (6116 fast grades) 1982
10 MHz 100 ns 100 ns (6264 fast grades) 1984

At 5 MHz with 200 ns 2114s, single-cycle read or write is achievable. Single-cycle read-modify-write (required for matching store) is not — the 2114 is single-ported and 200 ns access fills the entire clock period. This constrains matching store pipeline throughput to one operation per 2 clock cycles in the 1979 scenario. See the companion document on pipelining for approaches to mitigating this.

By 1981–82 with 150 ns 6116 parts at 5 MHz, a half-clock read/write split becomes feasible (100 ns per half-cycle, with margin). By 1984 with fast 6264 parts, 10 MHz pipelined operation is practical.

The 74LS670 Register File#

The SN74LS670 (4 × 4 register file with 3-state outputs) provides a critical capability: true simultaneous read and write to different addresses, with:

  • Read access time: ~20–24 ns typical
  • Write time: ~27 ns typical
  • Separate read and write address/enable inputs (dual-ported)
  • 16-pin DIP, ~98 gate equivalents, ~125 mW

This part was available in the LS family by the late 1970s (the LS subfamily was comprehensively available by 1977–78). At $2–4 per chip in volume, it's affordable for targeted use in pipeline bypass logic and small register files.

The 670's 4-bit word width is an exact match for per-entry matching store metadata (1 occupied bit + 1 port bit + 2 generation counter bits), making it ideal for a write-through metadata cache. See the companion pipelining document for the full design.


4. Per-PE Chip Count Analysis (1979 Scenario)#

Configuration: 8 context slots × 32 entries = 256 cells#

Using 2114 (1K × 4) SRAM for bulk storage and 74LS670 for fast-path and register file functions.

Matching store data (16-bit operand values): 256 entries × 16 bits = 512 bytes. 4 × 2114 in parallel (each contributes 4 bits of the 16-bit word, using 256 of 1024 available locations).

Matching store metadata: Handled by shared 670-based cache / SC register file. See pipelining companion document.

Instruction RAM (128 entries × 24-bit): 128 × 24 bits = 384 bytes. 4 × 2114 (using 128 of 1024 locations × 4 bits each, paralleled for width). Alternatively 6 × 2114 for 24-bit width without bit waste, depending on encoding.

Shared metadata cache / SC register file / predicate register: 8 × 74LS670 (see companion document).

ALU + control logic: ~15–20 TTL chips (adder, logic unit, comparators, muxes, shifter, EEPROM decoder, pipeline state machine, bus serialiser/deserialiser).

Per-PE total: ~31–36 chips (1979 parts)

4-PE System Total#

Subsystem Chips
4 × PE logic + SRAM 124–144
Interconnect (shared bus, arbitration) ~10–15
SM (structure memory, 4–8 banks) ~20–40
I/O + bootstrap ~15–25
System total ~170–225 chips

This is comparable to a late-1970s minicomputer CPU board, or roughly two S-100 boards' worth of components. Well within the engineering capability and cost envelope of a minicomputer product.


5. The 68000 Comparison#

The 68000 (1979) is the most apt contemporary comparison:

  • Instruction width: 68000 uses 16-bit instruction words encoding a 32-bit ISA. Our IRAM uses ~24-bit instruction words encoding dataflow operations. Comparable.
  • Data path: 68000 has 16-bit external bus, 32-bit internal paths. Our design has 16-bit external bus, wider internal pipeline registers (~64–68 bits). Structurally similar.
  • Logic budget: 68000 uses ~68,000 transistors, of which a huge fraction is microcode ROM for complex instruction decode. Our 4-PE system at ~3–5K logic transistors each = 12–20K transistors of PE logic. With interconnect and I/O, maybe 25–35K total. Roughly half a 68000 in logic, or about one-third when counting the 68000's internal register file transistors.
  • SRAM dependency: 68000 has on-chip registers (expensive in transistors). Our design uses external SRAM (cheap in silicon, more board space). The TMS9900 proved this trade-off was commercially viable three years earlier.

At 1979's 3.5 µm NMOS process, 25K transistors of logic fits in ~15–25 mm² of die area. The 68000 die was ~44 mm². A single-die integration of 4 PEs (logic only, SRAM external) would be significantly smaller and cheaper than a 68000.


6. The Transputer Comparison (1985)#

The INMOS T414 transputer (1985) is the closest historical analogue to what we're proposing, but approached from a different direction:

T414 Transputer Our 4-PE Design
Architecture Single complex PE 4 simple PEs
Parallelism model CSP message passing (explicit) Dataflow (implicit)
On-chip storage 2 KB SRAM External SRAM
Transistors ~200,000 ~25–35K logic
Process 1.5 µm CMOS 3.5 µm NMOS (1979 target)
Inter-PE communication Serial links (20 Mbit/s) Shared bus or dedicated links
Programming model occam (explicit distribution) Compiler-managed graph

The Transputer took the "one big smart PE with built-in message passing" path. Our architecture takes the "many small dumb PEs with implicit synchronisation" path. The Transputer's 200K transistors could fund ~40–60 of our PEs in raw logic. Even accounting for SRAM overhead, an integrated version at Transputer-class process technology could pack 8–16 PEs on a die, which is qualitatively different from a single Transputer — you're getting genuine fine-grained parallelism rather than coarse-grained task parallelism.


7. Why Multi-Processor Microcomputers Didn't Happen (And Why Dataflow Changes This)#

The Historical Blockers#

  1. Cache coherence: von Neumann processors sharing memory need coherence protocols. These are complex and were not well understood until the mid-1980s.

  2. Software parallelism: writing parallel software for shared-memory von Neumann machines was (and remains) brutally difficult. The installed base of sequential FORTRAN and C code was enormous.

  3. Instruction set compatibility: the IBM 360 lesson — ISA compatibility wins markets. A parallel machine that can't run existing binaries starts with zero software.

  4. Single-thread performance: for inherently sequential code, one big fast core beats multiple small slow cores. In 1979, most programs were deeply sequential.

What Dataflow Changes#

  • No cache coherence needed: each PE has its own local IRAM and matching store. Data moves as tokens. There is no shared mutable state at the PE level (SM handles shared data with its own synchronisation protocol via I-structure semantics).

  • Implicit parallelism: the compiler decomposes the program into a dataflow graph. Parallelism is inherent in the graph structure. The hardware handles synchronisation through token matching. No programmer effort required beyond writing the source code.

  • Software compatibility via compiler: an LLVM backend targeting the dataflow ISA could compile standard C/Rust. The gap between LLVM's SSA-form IR and a dataflow graph is much smaller than the gap between 1979-era C compilers and dataflow assembly.

  • Latency tolerance: the PE processes whatever tokens are ready. If one token is waiting on a slow SM access, the PE works on other tokens. This is inherent in the execution model — no special hardware needed.

The Remaining Hard Problem: Compiler Technology#

The biggest genuine blocker in 1979 was compiler technology. Dataflow compilers need to partition programs into graphs, assign nodes to PEs, manage context slots, and schedule token routes. In 1979, compilers could barely optimise sequential code. By the mid-1980s, the Manchester and Monsoon teams had working dataflow compilers, but these were research efforts, not production tools.

Today, this is a solvable problem. LLVM already performs sophisticated dependency analysis, loop vectorisation, and graph-based intermediate representations. A dataflow backend is substantial but not unreasonable.


8. The "Road Not Taken" Argument#

Modern out-of-order superscalar processors are, at their core, dataflow engines trapped inside a von Neumann straitjacket:

  • Register renaming creates unique names for each value — this is exactly what tagged tokens do in a dataflow machine.
  • Reservation stations (Tomasulo, 1967) are matching stores: an instruction waits for its operands to arrive, then fires.
  • The reorder buffer exists solely to reconstruct sequential semantics from what is internally dataflow execution. It is the tax paid for pretending to be von Neumann.
  • Branch prediction attempts to speculate about the dataflow graph's structure, because the sequential ISA doesn't encode it. A dataflow graph has no branches to predict — conditional execution is a SWITCH node that routes tokens deterministically.
  • Out-of-order execution discovers at runtime the parallelism that was always present in the program but obscured by the sequential instruction stream. A dataflow compiler encodes this parallelism explicitly.

A modern high-performance core dedicates ~80–90% of its transistor budget to the cache hierarchy and the OoO/speculation engine. The actual ALU is a tiny fraction of the die. A dataflow PE is almost entirely ALU and matching, because the execution model eliminates the need for the translation layer.

As the memory wall has worsened (DRAM latency ~100–300 cycles on modern systems vs 1–2 cycles in 1979), the overhead of the von Neumann translation layer has grown proportionally. The dataflow model's inherent latency tolerance — process whatever token is ready — becomes more valuable as memory gets relatively slower.

This suggests that a dataflow architecture, while perhaps premature in 1979 due to compiler limitations, might actually age better than von Neumann as the memory wall gets worse. The matching store never misses. The data arrives when it arrives. The PE does useful work in the meantime. No cache hierarchy needed in the PE pipeline — just fast local SRAM for matching and instruction storage, with SM handling the shared data.


9. Scaling Considerations for Integration#

1979–1982: Discrete Logic (Prototype / Low-Volume Minicomputer)#

  • 4 PEs on 1–2 large PCBs
  • ~170–225 TTL + SRAM chips total
  • 5 MHz clock, 2-cycle matching store access
  • Shared bus interconnect
  • Competitive with a 68000 on parallel workloads

1983–1985: Single-Chip PE (1.5–2 µm CMOS)#

  • PE logic on-chip (~3–5K transistors)
  • Matching store metadata on-chip (670-equivalent register file)
  • Bulk SRAM external
  • 4–8 PEs per board with external SRAM
  • 8–10 MHz, single-cycle matching via half-clock or on-chip dual-port

1986–1988: Multi-PE Chip (1–1.5 µm CMOS)#

  • 4–8 PE cores on one die with shared on-chip SRAM
  • Wide parallel local interconnect between adjacent PEs (~1 cycle hop)
  • External SM SRAM
  • 15–20 MHz
  • Competitive with Transputer systems at lower per-chip cost

Modern: Many-PE Tile (Sub-100 nm)#

  • 64–256+ PEs per die with on-chip SRAM hierarchy
  • Network-on-chip interconnect
  • I-structure SM cache with simplified coherence protocol (write-once semantics reduce coherence to fill notifications)
  • 1+ GHz
  • LLVM-based compiler toolchain

10. Open Questions#

  1. SM cache architecture for modern integration: does the I-structure write-once semantics enable a dramatically simpler coherence protocol than MESI? What does the cache hierarchy look like for SM in a many-PE chip?

  2. Compiler partitioning strategy: how does matching store size (context slots × entries) interact with compiler code generation? What's the minimum matching store that supports "most" programs without excessive function splitting?

  3. Sequential performance floor: what is the minimum acceptable single-thread performance for a dataflow PE, and how does the strongly connected block mechanism close the gap with conventional cores? See companion pipelining document.

  4. Network topology at scale: at what PE count does the shared bus become inadequate, and what's the right topology for 16–64 PEs? Ring, mesh, omega network, or hierarchical?