···11+# OR-1
22+13The OR-1 is an experiment in applying all the lessons we've learned since the 1980s to a computing concept which was largely abandoned after the 1980s in favour of doubling down on Turing machines and modified Harvard architectures, that abandoned concept being the dynamic dataflow machine.
2435## Background
···810911So what *am* I interested in?
10121111-Building a dataflow CPU that actually makes sense, could have reasonably been manufactured contemporaneously with the early 8 and 16-bit systems, or with the Motorola 68000, could have been used in a system to run normal software written for it, and which is at minimum not embarrassingly behind them in performance at a given clock speed.
1313+Building a dataflow CPU that actually makes sense, could have reasonably been manufactured contemporaneously with the early 16-bit systems, or with the Motorola 68000, could have been used in a system to run normal software written for it, and which is at minimum not embarrassingly behind them in performance at a given clock speed.
12141315Why?
1416···2426 - Comparable to a 68000 or a couple of Z80s in logic complexity
2527- Must be able to load and execute a binary over serial without a substantial conventional CPU-based control core
2628- Architecture must not rule out future evolution: specifically, must preserve design space for asynchronous operation, network topology changes, and runtime reprogramming
2929+3030+# Architecture
3131+3232+The OR-1 is a 16-bit machine, with a 16-bit core data path. Tokens and instruction words are default 32-bit, serialized into two flits on the external bus. The amount of memory it can address is somewhat a complex question, due to the memory model, but the primary memory component, called the structure memory element, or SM, can potentially have a 16-bit address space, partially in overlapping raw memory (ROM, memory-mapped IO, or RAM), partially in dedicated banks that provide "I-structure"-like memory semantics.
3333+3434+> All main memory and memory-mapped IO is addressed asynchronously in request/response fashion, regardless of its support for the structure memory guarantees. Structure memory cells have extra metadata fields to determine if a cell is full (a read will return immediately), waiting for a write to fulfill a pending read, reserved, or empty.
3535+3636+If you are familiar with futures, and in particular the 'completion-based' futures of systems like the Linux kernel's io_uring and JavaScript promises (as opposed to the poll-driven futures of Rust), or coroutines in languages like Python, Kotlin, or Go, you will have some of the right intuitions already. Rust's poll and waker primitives *do* provide a good intuition for how two-operand instructions are triggered, though. When the first operand arrives, the instruction returns `Poll::Pending` and the operand is saved into the matching store, along with some metadata. When the second shows up, it returns `Poll::Ready` and the result.
2737## Inspirations and partial analogs
28382939- Amamiya parallel dataflow LISP machine
3040 - Logical splitting of CM and SM, function-instance-based addressing
4141+ - Instructions in structure memory
3142- EM-4
3243 - Direct addressing for instruction matching, compiler assigned
3344 - Strong arc connections to help speed up sequential operation groups
3445- Monsoon ETS
3546 - Frame slots and compiler-assigned addresses
3647 - Presence bits
3737-3848## Things the OR-1 does differently
39494050- **Very** small instruction and operand storage in the CM (think register file or L1 cache, not RAM) at least relative to other dataflow computers
4151- This means that instructions must be fetched while running.
5252+- Instructions don't travel. Tokens are, with the exception of SM tokens, entirely addressing (and other metadata) and operand.
4253- There is still no program counter or similar, loads are explicit. The compiler/assembler inserts loads as best it can.
4354 - The 'exec' SM instruction offers a straightforward way to load a coherent block of code into the instruction cache at runtime and optionally trigger its execution.
4444-- SM is a hybrid of owned I-structure-esque memory and a standard shared address space with more typical guarantees.
5555+- Structure memory is a hybrid of owned I-structure-esque memory and a standard shared address space with more typical guarantees.
4556 - ROM and memory-mapped IO devices which do not need I-structure guarantees are generally mapped into the shared address space.
4657 - Stronger guarantees over a block of raw memory space can be obtained in the typical way using synchronization primitives located in I-structure memory
4758···8091$add_pair a=&x, b=&y |> @output
8192```
82938383-dfasm's assembler handles some but not all of this. It will override context as needed and wire up arguments to function or macro parameters as shown above, add trampolines, and route returns to the correct nodes. However it will *not* preload code. Code beyond what can be loaded at bootstrap must be loaded explicitly.
9494+The OR-1's dfasm assembler handles some but not all of this. It will override context as needed and wire up arguments to function or macro parameters as shown above, add trampolines, and route returns to the correct nodes. However it will *not* preload code. Code beyond what can be loaded at bootstrap must be loaded explicitly.
84958596Assuming the `extract_tag` and `change_tag` instructions are available, here is what you need to set up a dynamic recursive call.
8697···88992. In ROM or elsewhere in structure memory, $n$ `read_c` tokens providing argument and return tag templates for `change_tag`
891003. Per call site, instructions to allocate the callee context, `exec` the call sequence, and capture the tag for return
901019191-Call stubs can be parameterized by `exec` target and descriptor shape and the addresses loaded from structure memory, effectively creating a vtable. Functions with many arguments, or which take larger data structures as arguments, can have their arguments passed via structure memory. Args get written to allocated cells for the call frame, and then the caller runs `exec` on the call sequence, which can load the entire function, including the saved arguments. If the function is infrequently called or not latency sensitive on initial invocation, the exec sequence can handle all of the required setup.
102102+Call stubs can be parameterized by `exec` target and descriptor shape and the addresses loaded from structure memory, effectively creating a vtable. Functions with many arguments, or which take larger data structures as arguments, can have their arguments passed via structure memory. Args get written to allocated cells for the call frame, and then the caller runs `exec` on the call sequence, which can load the entire function, including the saved arguments. If the function is infrequently called or not latency sensitive on initial invocation and there isn't substantial bus contention during the first call, the exec sequence can handle all of the required setup.
9210393104Built-in macros will ease this as they are developed.
9410595106#### Loops and flow control
961079797-Dataflow machines require thinking about iteration in odd ways, and the OR-1 is no exception. Perhaps the *strangest* feature is how, depending on the data dependencies involved, multiple parts of an iterative process can be partially complete at the same time, traversing through the pipeline *extremely* tightly packed, even on a single PE without strongly-connected blocks. A "for each" or "map" where no accumulation happens can essentially proceed through the pipeline as fast as the amount of parallel context allocated will allow and as much as the code required is replicated across PEs (or designed in such a way as to spread the work across them correctly).
108108+Dataflow machines require thinking about iteration in odd ways, and the OR-1 is no exception. Perhaps the *strangest* feature is how, depending on the data dependencies involved, multiple parts of an iterative process can be partially complete at the same time, traversing through the pipeline *extremely* tightly packed, even on a single PE without strongly-connected blocks or anything other than a very basic pipeline, even with token overhead. A "for each" or "map" where no accumulation happens can essentially proceed through the pipeline as fast as the amount of parallel context allocated will allow and as much as the code required is replicated across PEs (or designed in such a way as to spread the work across them correctly).
981099999-The initial iterator, if it completes faster than the body, will leave multiple iterations in progress, waiting on their data dependencies. To avoid deadlocks and context slot exhaustion, some means of flow control is required. The typical method is to use **permit tokens**. These circulate through the system after loop initialization. The token will be a `const` instruction resident in IRAM, and any token directed to its address will trigger its emission.
110110+The initial iterator, if it completes faster than the body, will leave multiple iterations in progress, waiting on their data dependencies. To avoid deadlocks and context slot exhaustion, some means of flow control is required. The typical method is to use **permit tokens**. These circulate through the system after loop initialization. The token will be generated by a `const` instruction resident in IRAM, and any token directed to its address will trigger its emission.
100111101112```dfasm
102113; Permit-gated dispatch
···109120110121![[flow-control.excalidraw]]
111122123123+A binary reduction tree will allow operations at same level in the tree to proceed concurrently. An iterative sum can be easily made concurrent in this fashion.