OR-1#
The OR-1 is an experiment in applying all the lessons we've learned since the 1980s to a computing concept which was largely abandoned after the 1980s in favour of doubling down on Turing machines and modified Harvard architectures, that abandoned concept being the dynamic dataflow machine.
Background#
There are a number of good reasons why dataflow CPUs never really took off. One is the challenges of memory locality. Original dynamic dataflow designs relied heavily on content-addressable memory for fast direct lookup, and for a host of reasons that didn't scale nearly as well as conventional SRAM and later DRAM. This was quickly realized and moved away from. One method to do a pseudo-CAM memory is to essentially build a hardware hash table, but that has obvious problems in terms of trading off a vast amount of wasted memory space against a high rate of collisions. Later systems used more conventional addressing schemes, relying on the compiler or assembler to allocate addresses ahead of time, greatly reducing overhead.
However, fully dynamic dataflow machines still suffer from either non-locality in their instruction matching memory, or replication of instructions across execution units, or significant overhead in communications due to the extra metadata which must be passed around, or all of the above. The only modern dataflow machine I'm aware of is the Electron E1, a static dataflow machine, making extensive use of a smart compiler to handle allocation and routing configuration. This is directionally correct, but constrains the CPU to require an external management processor, acting primarily as a re-programmable accelerator, like a less granular (and presumably more cost and power-efficient) FPGA. That's all well and good, but it's not what I'm interested in.
So what am I interested in?
Building a dataflow CPU that actually makes sense, could have reasonably been manufactured contemporaneously with the early 16-bit systems, or with the Motorola 68000, could have been used in a system to run normal software written for it, and which is at minimum not embarrassingly behind them in performance at a given clock speed.
Why?
There's two reasons. One makes me sound like a crank, the other makes me sound like an asshole. I'll start with the latter. I've wanted to build a breadboard CPU for a bit now, but I didn't want to just follow in someone else's footsteps. James Sharman has made perhaps the most polished and usable system, even writing some decent games for his JAM-1, and achieved high subscalar performance. Fabian Schuiki is targeting a superscalar system with out-of-order execution, backporting ideas from the late 90s and 2000s. I think I can do something way more out of distribution than either, and get competitive results.
The other reason is that I do genuinely think we might have missed something when we dropped dataflow machines because the PC and thus the x86 CPU (and now the ARM CPU) ate the world. There are a bunch of concepts that, even within dataflow designs, got abandoned in the 90s favour of moving closer to a multi-core Turing machine, just with hardware message passing and semaphores. And in the name of pushing performance on conventional CPUs higher, as memory got further and further away from the faster and faster cores, silicon designers ended up also reproducing many of the core dataflow primitives via cache hierarchies, speculative and out-of-order execution, and branch prediction, except operating mostly at runtime and without complete information about the execution graph available in the binary.
Project Goals#
- Dynamic dataflow CPU achievable with discrete logic (74-series TTL or CMOS + SRAM)
- Multi-PE design targeting superscalar-equivalent IPC
- "Period-plausible" transistor budget: ~25-35K logic transistors + SRAM chips
- Comparable to a 68000 or a couple of Z80s in logic complexity
- Must be able to load and execute a binary over serial without a substantial conventional CPU-based control core
- Architecture must not rule out future evolution: specifically, must preserve design space for asynchronous operation, network topology changes, and runtime reprogramming
Architecture#
The OR-1 is a 16-bit machine, with a 16-bit core data path. Tokens and instruction words are default 32-bit, serialized into two flits on the external bus. The amount of memory it can address is somewhat a complex question, due to the memory model, but the primary memory component, called the structure memory element, or SM, can potentially have a 16-bit address space, partially in overlapping raw memory (ROM, memory-mapped IO, or RAM), partially in dedicated banks that provide "I-structure"-like memory semantics.
All main memory and memory-mapped IO is addressed asynchronously in request/response fashion, regardless of its support for the structure memory guarantees. Structure memory cells have extra metadata fields to determine if a cell is full (a read will return immediately), waiting for a write to fulfill a pending read, reserved, or empty.
If you are familiar with futures, and in particular the 'completion-based' futures of systems like the Linux kernel's io_uring and JavaScript promises (as opposed to the poll-driven futures of Rust), or coroutines in languages like Python, Kotlin, or Go, you will have some of the right intuitions already. Rust's poll and waker primitives do provide a good intuition for how two-operand instructions are triggered, though. When the first operand arrives, the instruction returns Poll::Pending and the operand is saved into the matching store, along with some metadata. When the second shows up, it returns Poll::Ready and the result.
Inspirations and partial analogs#
- Amamiya parallel dataflow LISP machine
- Logical splitting of CM and SM, function-instance-based addressing
- Instructions in structure memory
- EM-4
- Direct addressing for instruction matching, compiler assigned
- Strong arc connections to help speed up sequential operation groups
- Monsoon ETS
- Frame slots and compiler-assigned addresses
- Presence bits
Things the OR-1 does differently#
- Very small instruction and operand storage in the CM (think register file or L1 cache, not RAM) at least relative to other dataflow computers
- This means that instructions must be fetched while running.
- Instructions don't travel. Tokens are, with the exception of SM tokens, entirely addressing (and other metadata) and operand.
- There is still no program counter or similar, loads are explicit. The compiler/assembler inserts loads as best it can.
- The 'exec' SM instruction offers a straightforward way to load a coherent block of code into the instruction cache at runtime and optionally trigger its execution.
- Structure memory is a hybrid of owned I-structure-esque memory and a standard shared address space with more typical guarantees.
- ROM and memory-mapped IO devices which do not need I-structure guarantees are generally mapped into the shared address space.
- Stronger guarantees over a block of raw memory space can be obtained in the typical way using synchronization primitives located in I-structure memory
Boot process#
One of the challenges in making a dataflow CPU without a dedicated (and more conventional) control unit is figuring out how to bootstrap the system. Instructions must be loaded into the control memory elements and the first seed tokens must be emitted. The OR-1 solves this by giving one structure memory element responsibility for bootstrapping the system. It reuses the SM 'exec' instruction circuitry with a hardwired address to clock tokens stored in ROM onto the bus until it reaches a stop signal.
- Bootstrap SM (SM00) activates on reset
- Reset signal latches reset vector address into SM00's address register and triggers 'exec' instruction circuit
- SM00 loads contents at reset vector into counter register, adds address
- SM00 loads next address, and pushes direct to interconnect, increments address
- Repeat until address == counter
- Bootstrap SM input FIFO output now enabled
The contents of the reset vector, after the length, contain the raw tokens required to load each PE's initial instructions and data, plus any seed tokens to begin execution. This can be a simple bootstrap program to enable loading of other code, or it can be the primary program itself. Valid programs must not send commands to the bootstrap SM during this process. The input FIFO's output is disabled during exec instructions, but putting any traffic intended for SM00 onto the bus risks interfering with other bus traffic, as SM00 makes no guarantees about the behaviour of its input FIFO during the boot process.
Dynamic scheduling#
The OR-1 is, for a number of reasons, a mostly static dataflow machine. The way instructions are routed is built in to the instructions and tokens themselves. There's no dynamic load balancing inherent to it, due to the lack of control unit, as that would add a nontrivial amount of additional logic. There are of course a few escapes for this. One is exec, which reads out memory directly as tokens, and the closely related iram_write, which replaces the contents of a word in instruction memory. ctx_override options on existing compatible instructions can jump betweeen slots. A planned change_tag instruction would allow explicit token creation based on a data value, with extract_tag capturing the executing token's context info. call descriptor tables in SM and an SM cell service as ref counter can dynamically schedule arbitrary code, albeit with overhead. The main barrier to going fully dynamic this way is simply the, um, challenges of doing runtime code generation on this small a machine, though it's certainly possible to write a small runtime capable of it, potentially enough to implement something like BASIC. If it is possible to port a version of the dfasm assembler to run on it, then I will have succeeded beyond my wildest dreams.
That being said, loading code dynamically goes a long, long way, and staging it in structure memory allows rewriting of specific fields as necessary, it's just a high overhead operation you must explicitly do. This is one of the major contraints of the architecture. Executing directly from ROM just does not work well with each PE requiring its own miniature address bus. It either introduces a severe bottleneck, requires significant fanout and arbitration (and likely caching!) logic, or all of the above. The benefit is that you get a wildly efficient machine outside of that.
Function calls#
While obviously the exec instruction can effectively "call" a function, that is, as described, a high-overhead operation. Optimized code interleaves IRAM writes or judicious exec calls following free_ctx operations, or preloads and uses bank switching, if that hardware makes it in.
Passing arguments to a function requires setting the destination context to match the correct context for the function call. If there is only one call site, then the return instruction is effectively baked into the code already, setting the context appropriately. However, for multiple call sites, this gets more complex. change_tag simplifies a lot, here, but if not, a return trampoline (essentially a pass instruction with a ctx_override) can be placed in IRAM for each call site. Furthermore, context slots are a cap on simultaneous calls in flight (e.g. recursion). Clever multiplexing of context slots within a code section for repeated operations can greatly reduce IRAM usage if that is the primary constraint.
; Function definition:
$add_pair |> {
add &a, &b |> @ret
}
; Static call: named args wire to internal labels:
$add_pair a=&x, b=&y |> @output
The OR-1's dfasm assembler handles some but not all of this. It will override context as needed and wire up arguments to function or macro parameters as shown above, add trampolines, and route returns to the correct nodes. However it will not preload code. Code beyond what can be loaded at bootstrap must be loaded explicitly.
Assuming the extract_tag and change_tag instructions are available, here is what you need to set up a dynamic recursive call.
- A call stub in IRAM for the function. You need more or less $2 + 2n$ IRAM slots, where N is the number of parameters.
- In ROM or elsewhere in structure memory, $n$
read_ctokens providing argument and return tag templates forchange_tag - Per call site, instructions to allocate the callee context,
execthe call sequence, and capture the tag for return
Call stubs can be parameterized by exec target and descriptor shape and the addresses loaded from structure memory, effectively creating a vtable. Functions with many arguments, or which take larger data structures as arguments, can have their arguments passed via structure memory. Args get written to allocated cells for the call frame, and then the caller runs exec on the call sequence, which can load the entire function, including the saved arguments. If the function is infrequently called or not latency sensitive on initial invocation and there isn't substantial bus contention during the first call, the exec sequence can handle all of the required setup.
Built-in macros will ease this as they are developed.
Loops and flow control#
Dataflow machines require thinking about iteration in odd ways, and the OR-1 is no exception. Perhaps the strangest feature is how, depending on the data dependencies involved, multiple parts of an iterative process can be partially complete at the same time, traversing through the pipeline extremely tightly packed, even on a single PE without strongly-connected blocks or anything other than a very basic pipeline, even with token overhead. A "for each" or "map" where no accumulation happens can essentially proceed through the pipeline as fast as the amount of parallel context allocated will allow and as much as the code required is replicated across PEs (or designed in such a way as to spread the work across them correctly).
The initial iterator, if it completes faster than the body, will leave multiple iterations in progress, waiting on their data dependencies. To avoid deadlocks and context slot exhaustion, some means of flow control is required. The typical method is to use permit tokens. These circulate through the system after loop initialization. The token will be generated by a const instruction resident in IRAM, and any token directed to its address will trigger its emission.
; Permit-gated dispatch
&gate <| gate ; dyadic: L=permit, R=dispatch data
&loop_output |> &gate:R ; loop control feeds data to gate
; Body completion recycles the permit
&body_done |> &gate:L ; body's final instruction returns permit
![[flow-control.excalidraw]]
A binary reduction tree will allow operations at same level in the tree to proceed concurrently. An iterative sum can be easily made concurrent in this fashion.