design-notes/historical-plausibility.md at main · nonbinary.computer/or1-design

nonbinary.computer / or1-design
fork atom
OR-1 dataflow CPU sketch
fork atom
or1-design / design-notes / historical-plausibility.md
at main 382 lines 17 kB view raw view rendered
wrap content
Orual docs: add token format migration design plan 19d ago
dea94049
  1# Historical Plausibility: A Dataflow Microcomputer circa 1979–1984
  2
  3Research notes on transistor budgets, memory technology, and the
  4counterfactual case for a multi-PE dataflow system built with
  5period-appropriate technology.
  6
  7---
  8
  9## 1. Contemporary Processor Reference Points
 10
 11| Chip | Year | Transistors | Process | On-chip storage | Notes |
 12|------|------|------------|---------|----------------|-------|
 13| MOS 6502 | 1975 | ~3,510 (logic only) | 8 µm NMOS | A, X, Y internal regs only | Minimal transistor budget |
 14| Z80 | 1976 | ~8,500 | 4 µm NMOS | ~20 registers (two banks + specials) | |
 15| TMS9900 | 1976 | ~8,000 est. | NMOS | 3 internal regs; **16 GP regs in external RAM** | Minicomputer heritage |
 16| Am2901 | 1975 | ~1,000 | Bipolar (TTL/ECL) | 16 × 4-bit register file | Bit-slice, 16 MHz |
 17| Intel 8086 | 1978 | ~29,000 | 3.2 µm NMOS | 14 registers; large microcode ROM | |
 18| Motorola 68000 | 1979 | ~68,000 | 3.5 µm NMOS | 8 data + 7 addr regs (32-bit) | Clean 32-bit ISA |
 19| Intel 80286 | 1982 | ~134,000 | 1.5 µm | No cache | |
 20| ARM1 | 1985 | ~25,000 | 3 µm CMOS | 25 × 32-bit registers | RISC; 50 mm² die |
 21| Motorola 68020 | 1984 | ~190,000 | 2 µm CMOS | **256-byte instruction cache** | First µP with on-chip cache |
 22| INMOS T414 | 1985 | ~200,000* | 1.5 µm CMOS | **2 KB on-chip SRAM** | Transputer; 4 serial links |
 23| INMOS T800 | 1987 | ~250,000+ | CMOS | **4 KB on-chip SRAM** + FPU | |
 24| EM-4 (EMC-R) | 1990 | ~45,788 gates | 1.5 µm CMOS | 1.31 MB SRAM per PE | 80-PE prototype built |
 25
 26\*The T414's ~200K figure likely includes SRAM cell transistors. The logic
 27core was probably 50–80K transistors, with on-chip SRAM accounting for a
 28large fraction of the total.
 29
 30**Our PE target: ~3–5K transistors of logic + external SRAM chips.**
 31
 32This places each PE squarely in 6502-to-Z80 territory for logic
 33complexity. The PE is not a complete general-purpose processor — it
 34trades away the program counter, complex instruction decoder, microcode
 35ROM, and sequential control flow machinery in exchange for matching
 36store logic and token handling.
 37
 38---
 39
 40## 2. The TMS9900 Precedent
 41
 42The TMS9900 (1976) is the strongest historical precedent for the
 43external-storage PE model. It had only 3 internal registers (PC, WP, SR)
 44and accessed its 16 "general purpose registers" through external SRAM via
 45a workspace pointer. Every register access was a memory access.
 46
 47At 3 MHz with ~300 ns SRAM, register accesses were single-cycle. The
 48penalty was real — a register-to-register ADD took 14 clock cycles due to
 49multiple memory round-trips — but the design worked, and context-switch
 50speed was unmatched (change one pointer, entire register set swaps).
 51
 52Our PE has the same structural relationship to its SRAM: matching store,
 53instruction memory, and context slots all live in external SRAM, accessed
 54via direct address concatenation. The key difference is that our PE
 55doesn't need multiple memory round-trips per instruction — a token
 56arrives, we index into the matching store (one SRAM access), check the
 57occupied bit, and if matched, fetch the instruction (one SRAM access) and
 58execute. Arguably *fewer* memory accesses per useful operation than the
 59TMS9900.
 60
 61The TMS9900 also demonstrates that the "registers in external memory"
 62approach was commercially viable and accepted by the market in the
 63mid-1970s. The 40-pin implementations (TMS9995 etc.) later included
 64128–256 bytes of fast on-chip RAM for registers, validating the
 65evolutionary path from external to on-chip storage.
 66
 67---
 68
 69## 3. SRAM Technology and Access Times
 70
 71### Available Parts by Year
 72
 73| Part | Organisation | Access Time | Approx. Availability |
 74|------|-------------|------------|---------------------|
 75| Intel 2102 | 1K × 1 | 500–850 ns | 1972 |
 76| Intel 2114 | 1K × 4 | 200–450 ns | ~1977 |
 77| Intel 2147 | 4K × 1 | 55–70 ns | 1979 (bipolar, expensive) |
 78| HM6116 | 2K × 8 | 120–200 ns | ~1981 |
 79| HM6264 | 8K × 8 | 70–200 ns | ~1983–84 |
 80| HM62256 | 32K × 8 | 70–150 ns | mid-1980s |
 81
 82### Clock Speed vs SRAM Access
 83
 84| Clock | Period | Single-cycle SRAM needed | Available by |
 85|-------|--------|-------------------------|-------------|
 86| 4 MHz | 250 ns | 250 ns (2114 comfortably) | 1977 |
 87| 5 MHz | 200 ns | 200 ns (2114, tight) | 1978 |
 88| 8 MHz | 125 ns | 125 ns (6116 fast grades) | 1982 |
 89| 10 MHz | 100 ns | 100 ns (6264 fast grades) | 1984 |
 90
 91At 5 MHz with 200 ns 2114s, single-cycle *read or write* is achievable.
 92Single-cycle read-modify-write (required for matching store) is not —
 93the 2114 is single-ported and 200 ns access fills the entire clock
 94period. This constrains matching store pipeline throughput to one
 95operation per 2 clock cycles in the 1979 scenario. See the companion
 96document on pipelining for approaches to mitigating this.
 97
 98By 1981–82 with 150 ns 6116 parts at 5 MHz, a half-clock
 99read/write split becomes feasible (100 ns per half-cycle, with margin).
100By 1984 with fast 6264 parts, 10 MHz pipelined operation is practical.
101
102### The 74LS670 Register File
103
104The SN74LS670 (4 × 4 register file with 3-state outputs) provides a
105critical capability: **true simultaneous read and write to different
106addresses**, with:
107
108- Read access time: ~20–24 ns typical
109- Write time: ~27 ns typical
110- Separate read and write address/enable inputs (dual-ported)
111- 16-pin DIP, ~98 gate equivalents, ~125 mW
112
113This part was available in the LS family by the late 1970s (the LS
114subfamily was comprehensively available by 1977–78). At $2–4 per chip in
115volume, it's affordable for targeted use in pipeline bypass logic and
116small register files.
117
118The 670's 4-bit word width is an exact match for per-entry matching
119store metadata (1 occupied bit + 1 port bit + 2 generation counter
120bits), making it ideal for a write-through metadata cache. See the
121companion pipelining document for the full design.
122
123---
124
125## 4. Per-PE Chip Count Analysis (1979 Scenario)
126
127### Configuration: 8 context slots × 32 entries = 256 cells
128
129Using 2114 (1K × 4) SRAM for bulk storage and 74LS670 for fast-path
130and register file functions.
131
132**Matching store data (16-bit operand values):**
133256 entries × 16 bits = 512 bytes.
1344 × 2114 in parallel (each contributes 4 bits of the 16-bit word,
135using 256 of 1024 available locations).
136
137**Matching store metadata:**
138Handled by shared 670-based cache / SC register file.
139See pipelining companion document.
140
141**Instruction RAM (128 entries × 24-bit):**
142128 × 24 bits = 384 bytes.
1434 × 2114 (using 128 of 1024 locations × 4 bits each, paralleled for
144width). Alternatively 6 × 2114 for 24-bit width without bit waste,
145depending on encoding.
146
147**Shared metadata cache / SC register file / predicate register:**
1488 × 74LS670 (see companion document).
149
150**ALU + control logic:**
151~15–20 TTL chips (adder, logic unit, comparators, muxes, shifter,
152EEPROM decoder, pipeline state machine, bus serialiser/deserialiser).
153
154**Per-PE total: ~31–36 chips (1979 parts)**
155
156### 4-PE System Total
157
158| Subsystem | Chips |
159|-----------|-------|
160| 4 × PE logic + SRAM | 124–144 |
161| Interconnect (shared bus, arbitration) | ~10–15 |
162| SM (structure memory, 4–8 banks) | ~20–40 |
163| I/O + bootstrap | ~15–25 |
164| **System total** | **~170–225 chips** |
165
166This is comparable to a late-1970s minicomputer CPU board, or roughly
167two S-100 boards' worth of components. Well within the engineering
168capability and cost envelope of a minicomputer product.
169
170---
171
172## 5. The 68000 Comparison
173
174The 68000 (1979) is the most apt contemporary comparison:
175
176- **Instruction width**: 68000 uses 16-bit instruction words encoding a
177  32-bit ISA. Our IRAM uses ~24-bit instruction words encoding dataflow
178  operations. Comparable.
179- **Data path**: 68000 has 16-bit external bus, 32-bit internal paths.
180  Our design has 16-bit external bus, wider internal pipeline registers
181  (~64–68 bits). Structurally similar.
182- **Logic budget**: 68000 uses ~68,000 transistors, of which a huge
183  fraction is microcode ROM for complex instruction decode. Our 4-PE
184  system at ~3–5K logic transistors each = 12–20K transistors of PE
185  logic. With interconnect and I/O, maybe 25–35K total. Roughly half a
186  68000 in logic, or about one-third when counting the 68000's internal
187  register file transistors.
188- **SRAM dependency**: 68000 has on-chip registers (expensive in
189  transistors). Our design uses external SRAM (cheap in silicon, more
190  board space). The TMS9900 proved this trade-off was commercially
191  viable three years earlier.
192
193At 1979's 3.5 µm NMOS process, 25K transistors of logic fits in
194~15–25 mm² of die area. The 68000 die was ~44 mm². A single-die
195integration of 4 PEs (logic only, SRAM external) would be significantly
196smaller and cheaper than a 68000.
197
198---
199
200## 6. The Transputer Comparison (1985)
201
202The INMOS T414 transputer (1985) is the closest historical analogue to
203what we're proposing, but approached from a different direction:
204
205| | T414 Transputer | Our 4-PE Design |
206|---|---|---|
207| Architecture | Single complex PE | 4 simple PEs |
208| Parallelism model | CSP message passing (explicit) | Dataflow (implicit) |
209| On-chip storage | 2 KB SRAM | External SRAM |
210| Transistors | ~200,000 | ~25–35K logic |
211| Process | 1.5 µm CMOS | 3.5 µm NMOS (1979 target) |
212| Inter-PE communication | Serial links (20 Mbit/s) | Shared bus or dedicated links |
213| Programming model | occam (explicit distribution) | Compiler-managed graph |
214
215The Transputer took the "one big smart PE with built-in message passing"
216path. Our architecture takes the "many small dumb PEs with implicit
217synchronisation" path. The Transputer's 200K transistors could fund
218~40–60 of our PEs in raw logic. Even accounting for SRAM overhead,
219an integrated version at Transputer-class process technology could pack
2208–16 PEs on a die, which is qualitatively different from a single
221Transputer — you're getting genuine fine-grained parallelism rather than
222coarse-grained task parallelism.
223
224---
225
226## 7. Why Multi-Processor Microcomputers Didn't Happen (And Why Dataflow Changes This)
227
228### The Historical Blockers
229
2301. **Cache coherence**: von Neumann processors sharing memory need
231   coherence protocols. These are complex and were not well understood
232   until the mid-1980s.
233
2342. **Software parallelism**: writing parallel software for shared-memory
235   von Neumann machines was (and remains) brutally difficult. The
236   installed base of sequential FORTRAN and C code was enormous.
237
2383. **Instruction set compatibility**: the IBM 360 lesson — ISA
239   compatibility wins markets. A parallel machine that can't run
240   existing binaries starts with zero software.
241
2424. **Single-thread performance**: for inherently sequential code, one
243   big fast core beats multiple small slow cores. In 1979, most
244   programs were deeply sequential.
245
246### What Dataflow Changes
247
248- **No cache coherence needed**: each PE has its own local IRAM and
249  matching store. Data moves as tokens. There is no shared mutable
250  state at the PE level (SM handles shared data with its own
251  synchronisation protocol via I-structure semantics).
252
253- **Implicit parallelism**: the compiler decomposes the program into a
254  dataflow graph. Parallelism is inherent in the graph structure. The
255  hardware handles synchronisation through token matching. No
256  programmer effort required beyond writing the source code.
257
258- **Software compatibility via compiler**: an LLVM backend targeting the
259  dataflow ISA could compile standard C/Rust. The gap between LLVM's
260  SSA-form IR and a dataflow graph is much smaller than the gap
261  between 1979-era C compilers and dataflow assembly.
262
263- **Latency tolerance**: the PE processes whatever tokens are ready.
264  If one token is waiting on a slow SM access, the PE works on other
265  tokens. This is inherent in the execution model — no special
266  hardware needed.
267
268### The Remaining Hard Problem: Compiler Technology
269
270The biggest genuine blocker in 1979 was compiler technology. Dataflow
271compilers need to partition programs into graphs, assign nodes to PEs,
272manage context slots, and schedule token routes. In 1979, compilers
273could barely optimise sequential code. By the mid-1980s, the Manchester
274and Monsoon teams had working dataflow compilers, but these were
275research efforts, not production tools.
276
277Today, this is a solvable problem. LLVM already performs sophisticated
278dependency analysis, loop vectorisation, and graph-based intermediate
279representations. A dataflow backend is substantial but not unreasonable.
280
281---
282
283## 8. The "Road Not Taken" Argument
284
285Modern out-of-order superscalar processors are, at their core, dataflow
286engines trapped inside a von Neumann straitjacket:
287
288- **Register renaming** creates unique names for each value — this is
289  exactly what tagged tokens do in a dataflow machine.
290- **Reservation stations** (Tomasulo, 1967) are matching stores: an
291  instruction waits for its operands to arrive, then fires.
292- **The reorder buffer** exists solely to reconstruct sequential
293  semantics from what is internally dataflow execution. It is the
294  tax paid for pretending to be von Neumann.
295- **Branch prediction** attempts to speculate about the dataflow graph's
296  structure, because the sequential ISA doesn't encode it. A dataflow
297  graph has no branches to predict — conditional execution is a SWITCH
298  node that routes tokens deterministically.
299- **Out-of-order execution** discovers at runtime the parallelism that
300  was always present in the program but obscured by the sequential
301  instruction stream. A dataflow compiler encodes this parallelism
302  explicitly.
303
304A modern high-performance core dedicates ~80–90% of its transistor
305budget to the cache hierarchy and the OoO/speculation engine. The
306actual ALU is a tiny fraction of the die. A dataflow PE is almost
307entirely ALU and matching, because the execution model eliminates the
308need for the translation layer.
309
310As the memory wall has worsened (DRAM latency ~100–300 cycles on modern
311systems vs 1–2 cycles in 1979), the overhead of the von Neumann
312translation layer has grown proportionally. The dataflow model's
313inherent latency tolerance — process whatever token is ready — becomes
314more valuable as memory gets relatively slower.
315
316This suggests that a dataflow architecture, while perhaps premature in
3171979 due to compiler limitations, might actually age *better* than
318von Neumann as the memory wall gets worse. The matching store never
319misses. The data arrives when it arrives. The PE does useful work in the
320meantime. No cache hierarchy needed in the PE pipeline — just fast
321local SRAM for matching and instruction storage, with SM handling the
322shared data.
323
324---
325
326## 9. Scaling Considerations for Integration
327
328### 1979–1982: Discrete Logic (Prototype / Low-Volume Minicomputer)
329
330- 4 PEs on 1–2 large PCBs
331- ~170–225 TTL + SRAM chips total
332- 5 MHz clock, 2-cycle matching store access
333- Shared bus interconnect
334- Competitive with a 68000 on parallel workloads
335
336### 1983–1985: Single-Chip PE (1.5–2 µm CMOS)
337
338- PE logic on-chip (~3–5K transistors)
339- Matching store metadata on-chip (670-equivalent register file)
340- Bulk SRAM external
341- 4–8 PEs per board with external SRAM
342- 8–10 MHz, single-cycle matching via half-clock or on-chip dual-port
343
344### 1986–1988: Multi-PE Chip (1–1.5 µm CMOS)
345
346- 4–8 PE cores on one die with shared on-chip SRAM
347- Wide parallel local interconnect between adjacent PEs (~1 cycle hop)
348- External SM SRAM
349- 15–20 MHz
350- Competitive with Transputer systems at lower per-chip cost
351
352### Modern: Many-PE Tile (Sub-100 nm)
353
354- 64–256+ PEs per die with on-chip SRAM hierarchy
355- Network-on-chip interconnect
356- I-structure SM cache with simplified coherence protocol
357  (write-once semantics reduce coherence to fill notifications)
358- 1+ GHz
359- LLVM-based compiler toolchain
360
361---
362
363## 10. Open Questions
364
3651. **SM cache architecture for modern integration**: does the
366   I-structure write-once semantics enable a dramatically simpler
367   coherence protocol than MESI? What does the cache hierarchy look
368   like for SM in a many-PE chip?
369
3702. **Compiler partitioning strategy**: how does matching store size
371   (context slots × entries) interact with compiler code generation?
372   What's the minimum matching store that supports "most" programs
373   without excessive function splitting?
374
3753. **Sequential performance floor**: what is the minimum acceptable
376   single-thread performance for a dataflow PE, and how does the
377   strongly connected block mechanism close the gap with conventional
378   cores? See companion pipelining document.
379
3804. **Network topology at scale**: at what PE count does the shared bus
381   become inadequate, and what's the right topology for 16–64 PEs?
382   Ring, mesh, omega network, or hierarchical?