OR-1 dataflow CPU sketch

Extended SM Design Discussion & Token Format Rework#

Covers SM enhancements, token format rearrangement, DRAM latency context, wide pointers, bulk SM operations, bootstrap/EXEC unification, bootstrap SM ownership, and presence metadata design.

See sm-design.md for the base SM design this builds on. See architecture-overview.md for module taxonomy and existing token format.


DRAM vs SRAM Latency in the Target Era#

Historical context for SM backing store decisions.

4116 DRAM (the workhorse of 1979)#

The MK4116 (16Kx1) was ubiquitous: ZX Spectrum, Apple II, IBM PC.

Speed grade   Access time (RAS)   Cycle time   Page mode access
─────────────────────────────────────────────────────────────────
4116-2        150 ns              320 ns       100 ns
4116-3        200 ns              375 ns       135 ns
4116-4        250 ns              410 ns       165 ns

Access time = when valid data appears after RAS goes low. Cycle time = minimum time between the start of one access and start of the next. Includes RAS precharge recovery. A new random access cannot begin until the full cycle time elapses.

Comparison: 2114 SRAM at 200ns has access time AND cycle time of ~200ns. No precharge, no refresh. DRAM is roughly half the random-access throughput of SRAM at the same clock speed due to cycle time overhead.

The 4116 also needs 128 refresh cycles every 2ms, stealing ~2-3% of available bandwidth.

Page mode is noteworthy: when accessing multiple locations in the same 128-bit row, page mode on the 4116-2 drops to 100ns — faster than the 2114. The row stays open; the controller just strobes new column addresses. This is how the C64 and BBC Micro shared DRAM between CPU and video on alternate half-cycles.

The SRAM/DRAM Ratio Over Time#

Era      SRAM access   DRAM access   DRAM cycle   Ratio (DRAM/SRAM)
─────────────────────────────────────────────────────────────────────
1979     200 ns        200 ns        375 ns       ~1-2x
1985     70 ns         100 ns        230 ns       ~1.5-3x
1995     15 ns         60 ns         ~120 ns      ~4-8x
2005     2 ns          ~50 ns (CAS)  ~100 ns      ~25-50x
2024     0.5 ns        ~15 ns (CAS)  ~50 ns       ~30-100x

In 1979, DRAM was at most 2x slower than SRAM for random access. Today it's 50-100x slower relative to on-chip SRAM. The memory wall barely existed in 1979.

Implication for SM: SM backed by DRAM in 1979 incurs only a 2x latency penalty vs SRAM — viable without caching. In a modern version, the same architecture is better positioned for the memory wall than conventional designs, because dataflow provides natural latency tolerance: the PE processes other tokens while waiting for SM responses.

C64 Memory Architecture Note#

Main 64K: 8× 4116 DRAM. Colour RAM: single 2114 SRAM (1K×4, low nybble only, 16 colours). The colour RAM was SRAM specifically because it needed independent access without fighting the VIC-II for DRAM bus cycles.


74LS610 Memory Mapper for SM Banking#

The 74LS610 is a TI memory mapper chip (originally for TMS9900 family):

  • 16 mapping registers, each 12 bits wide
  • 4-bit logical address input (MA0-MA3) selects register
  • 12-bit physical address output (MO0-MO11)
  • The '610 has a latch control pin (pin 28) absent from the '612 — outputs can be frozen while register contents change
  • Appeared in PC-AT boards and Nintendo cartridges
  • Internally a multiplexed register file, not dual-port

SM Address Translation Path#

  1. Token arrives with structure address
  2. High bits go to '610, select mapping register → 12-bit physical bank (~40-50ns propagation delay, LS family)
  3. Bank bits + low address bits drive DRAM row/column
  4. DRAM access (~200ns access + 375ns cycle for 4116)

The '610 propagation delay is pipelineable — it's a combinational lookup that overlaps with DRAM RAS setup. The real cost comes from changing a mapping register mid-operation (a write cycle to the '610's internal registers via the data bus). That is the bank-switch overhead.

SM Usage Model#

Mapping registers are set at load time and mostly left alone. SM addresses are global — every PE sees the same SM address space. The mapping registers expand physical capacity rather than providing per-context isolation.

Bank switching hurts when more live structures exist than fit in the directly-mapped physical space. The compiler can hint which structures are hot vs cold, and the SM controller can manage the mapping registers accordingly: fast path (bank already mapped → straight through) vs slow path (bank miss → remap → re-access).


SM Return Routing: Pre-Formed Token Templates#

The return routing field in SM READ requests (flit 2) is conceptualized as a pre-formed CM token template. The SM's result formatter simply latches the template, concatenates the read data, and pushes the result to the output. No bit-shuffling, no field packing — the requesting CM does all that work upfront when constructing the request.

Because the template is a full 16-bit flit carried in flit 2 (or flit 3 for extended formats), it has enough room to encode any token format whose routing information fits in a single flit. In practice this means any format OTHER than dyadic narrow can serve as a return route: dyadic wide, monadic normal, and monadic inline all fit their routing fields into flit 1. The SM does not need to understand the format — it just prepends the template as flit 1 and appends the read data as flit 2.

This means SM read results can land directly in a matching store slot as one operand of a dyadic instruction. The result does not need to pass through an intermediate monadic forwarding step — it can match against another token that arrived independently, enabling patterns like "fetch value from SM, combine with a locally-computed operand, produce result" in a single matching store cycle.

Implication: bits cannot be stolen from the return routing field for page register selection or other SM-side metadata, because the SM would need to parse its own return routing — defeating the purpose of making it an opaque blob.


Token Format Rework: 1-Bit SM/CM Split#

Motivation#

Eliminating type-11 (system) tokens and moving to a 1-bit SM/CM discriminator reclaims 1 bit in the SM flit. This bit can be spent on addressing, opcodes, or SM_id width.

Why Type-11 Can Be Eliminated#

Each type-11 subtype can be absorbed into existing types:

  • 11+00 (IO): IO module becomes a specialised SM or memory-mapped region within SM address space
  • 11+01 (config/IRAM load): becomes a CM opcode (see IRAM Write below)
  • 11+10 (debug/trace): reserved SM address range or special monadic opcode
  • 11+11 (reserved): not committed

Bootstrap/initial loading uses a dedicated hardware path (see Bootstrap section below), not runtime tokens.

IO as Memory-Mapped SM#

The IO module is interacted with like an SM: reads/writes use the standard SM token format. I-structure semantics help naturally — a READ from an IO device that doesn't have data ready defers, and the response arrives when the device is ready. This provides interrupt-driven IO without interrupts.

At v0 scale, dedicating one SM_id to IO leaves 3 SMs × 512 = 1536 structure cells. See "Bootstrap SM Ownership" below for how this interacts with the bootstrap path.

IRAM Write as CM Opcode#

From the CM's perspective, an IRAM write is "put this instruction word at this IRAM address." It needs no ctx, port, gen, or matching store access. Those bits become extra address or data bits.

The CM is also in the best position to know whether it can safely swap out a given instruction or whether tokens are in flight for it, enabling finer-grained hot-reload at runtime.

New Token Encoding#

═══════════════════════════════════════════════════════════════════
BIT[15] = 1: SM TOKEN
═══════════════════════════════════════════════════════════════════

Standard (2 flit):
  flit 1: [1][SM_id:2][op:3-5][addr:8-10]                     = 16
  flit 2: [data:16] or [return_routing:16]

  15 bits available. See "SM Opcode Width Options" below.

═══════════════════════════════════════════════════════════════════
BIT[15:14] = 00: DYADIC WIDE (hot path)
═══════════════════════════════════════════════════════════════════

  flit 1: [0][0][PE:2][offset:5][ctx:4][port:1][gen:2]        = 16
  flit 2: [data:16]

  offset:5 = 32 dyadic slots per context (doubled from 16)
  matching store addr = [ctx:4][offset:5] = 9 bits = 512 cells
  decode: bit[15]=0 AND bit[14]=0 → two gates

═══════════════════════════════════════════════════════════════════
BIT[15:13] = 010: MONADIC NORMAL (2 flit)
═══════════════════════════════════════════════════════════════════

  flit 1: [0][1][0][PE:2][offset:7][ctx:4]                    = 16
  flit 2: [data:16]

  offset:7 = 128 IRAM slots, unchanged
  No port, no gen

═══════════════════════════════════════════════════════════════════
BIT[15:13] = 011: MISC BUCKET (infrequent formats)
═══════════════════════════════════════════════════════════════════

  flit 1: [0][1][1][PE:2][sub:2][...9 bits...]                = 16

  sub=00: DYADIC NARROW (2 flit, 8-bit data)
    flit 1: [011][PE:2][00][offset:5][ctx:4]                   = 16
    flit 2: [data:8][port:1][gen:2][spare:5]                   = 16

  sub=01: IRAM WRITE (2+ flit)
    flit 1: [011][PE:2][01][iram_addr:7][flags:2]             = 16
    flit 2: [instruction_word_low:16]                          = 16
    (flit 3: [instruction_word_high:8][spare:8] if needed)
    7-bit addr = 128 IRAM slots, full coverage.
    No ctx/port/gen needed.

  sub=10: MONADIC INLINE (1 flit, trigger)
    flit 1: [011][PE:2][10][offset:4][ctx:4][spare:1]         = 16
    No flit 2.

  sub=11: SPARE
    Reserved. Candidates: extended monadic with wider offset,
    broadcast/multicast, debug/trace injection.

Summary Table#

prefix   format            flits  offset  ctx  port  gen   vs current
──────────────────────────────────────────────────────────────────────
1        SM standard       2      9-10    —    —     —     +1 bit (addr or op)
00       dyadic wide       2      5 (32)  4    1     2     +1 offset (was 4)
010      monadic normal    2      7 (128) 4    —     —     unchanged
011+00   dyadic narrow     2      5 (32)  4    1     2     unchanged
011+01   IRAM write        2-3    7 (128) —    —     —     NEW
011+10   monadic inline    1      4 (16)  4    —     —     unchanged
011+11   (spare)           ?      ?       ?    ?     ?     reserved

Hot Path Decode#

Two bits determine the fast-path pipeline:

  • bit[15] splits SM/CM: one gate
  • bit[14] splits dyadic-wide from everything else: one gate
  • the PE can start matching store SRAM read on [ctx:4][offset:5] the instant flit 1 is latched for dyadic-wide tokens (the dominant format)
  • monadic decode adds one more gate at bit[13]
  • the misc bucket is three gates deep, but nothing there is latency-critical

Dyadic Narrow as Demoted Format#

Dyadic narrow carries 8-bit data but still costs two flits. Its main advantage is a wider offset field plus spare bits in flit 2. It is less broadly useful than dyadic wide (16-bit data, the common case) and comfortably shares the misc bucket with IRAM write and monadic inline — all three are infrequent relative to the two hot-path formats.

Monadic Offset Relative Addressing#

Dyadic instructions pack into the lowest IRAM offsets (0-15 wide, 0-31 narrow). Monadic tokens never target those slots. Making monadic offsets relative to the dyadic ceiling avoids wasting encodings:

wide mode:   16 dyadic slots → monadic base = 16
  6-bit relative offset → addresses 16-79 (64 monadic slots, all valid)

narrow mode: 32 dyadic slots → monadic base = 32
  6-bit relative offset → addresses 32-95 (64 monadic slots, all valid)

Hardware cost: since the dyadic ceiling is always a power of 2 (16 or 32), the base can be OR'd onto the high address bits. One gate plus a config bit.

SC block interaction: SC blocks need contiguous IRAM space that does not respect the dyadic-below-monadic packing rule. The compiler packs SC blocks into a separate IRAM region above the monadic ceiling, addressed via a base register set on entering SC mode.

SM Opcode Width Options#

With 15 bits available after the SM discriminator bit, the SM_id (2 bits) leaves 13 bits split between opcode and address. Three alternatives:

3-bit fixed: 8 ops, 10-bit addr (1024 cells). READ, WRITE, ALLOC, FREE, CLEAR, READ_INC, READ_DEC, CAS fills it exactly. No headroom, no decode complexity.

4-bit fixed: 16 ops, 9-bit addr (512 cells). Room for EXT, RAW_READ, EXEC, and 5 spare slots. Trivial decode. 11 defined operations with 5 spare gives comfortable expansion room.

Variable 3/5: one decode gate. Common ops get 3-bit opcode + 10-bit addr (1024 cells). Rare/special ops get 5-bit opcode + 8-bit payload (256 addresses or inline data).

op[2:1] != 11:  6 opcodes × 10-bit addr (1024 cells)
  READ, WRITE, ALLOC, FREE, CLEAR, EXT

op[2:1] == 11:  extends to 5-bit → 8 opcodes × 8-bit payload
  READ_INC, READ_DEC, CAS, RAW_READ, EXEC, SET_PAGE, WRITE_IMM, (spare)

Key insight for variable-width encoding: not all SM ops are op(address). Some are op(config_value) or op() with no cell operand. The 8-bit payload in the restricted tier can be inline data, config values, or range counts depending on the opcode:

  • EXEC: payload = length/count (base addr in config register)
  • SET_PAGE: payload = page register value
  • WRITE_IMM: 8-bit addr + 8-bit immediate (single-flit small constant writes, feasible if flit 2 is repurposed or omitted)
  • CLEAR_RANGE: base addr + count

Bootstrap and EXEC Unification#

The EXEC Concept#

An SM operation that reads from a region of SM address space, bypasses the result token formatter, and pushes raw data onto the token bus as pre-formed tokens.

Bootstrap via EXEC#

At power-on, a small state machine (counter + comparator + a few gates) begins reading from a ROM base address. The ROM is mapped into the SM's extended address space (directly wired or via '610 mapper). Its contents are pre-assembled tokens: IRAM writes, routing config, and everything else needed to bring the system alive.

Bootstrap sequence:
  1. Power on, reset
  2. Bootstrap state machine starts clocking reads from ROM base
  3. SM reads ROM, pushes raw bytes onto token bus
  4. Those bytes ARE tokens — CM IRAM writes, SM config, routing setup
  5. CMs and SMs see valid tokens on the bus, process normally
  6. Bootstrap hits stop sentinel, halts
  7. System is loaded, bootstrap state machine goes idle
  8. Final token in ROM sequence could be a "go" trigger

Bootstrap hardware: ~5-8 chips (12-bit counter, stop comparator, bus driver, trigger logic, mux between bootstrap and runtime address sources).

Bootstrap Bus Arbitration#

During bootstrap, nothing else transmits (nothing is loaded yet). The bootstrap SM gets unconditional bus access until completion. A single flip-flop ("bootstrap complete," active-low) gates other bus requesters. Normal arbitration activates after bootstrap signals done.

ROM Image Format#

The toolchain compiles the program, generates a token stream for loading, and packs it into a ROM image. No special bootstrap format or separate loader protocol — the ROM contains the same tokens that would flow on the bus during normal operation, pre-baked.

Token ordering in the ROM is critical: routing config must precede compute tokens (so the network knows where to route), IRAM loads must precede trigger tokens (so instructions exist before anything fires). This is a constraint on the assembler/linker output ordering.

Runtime EXEC#

At runtime, EXEC reuses the same address counter, bus output path, and sequencing logic as bootstrap. The differences:

  • Triggered by a token instead of a power-on reset signal
  • Address + length come from the token (or from a wide pointer) rather than "start at 0, go until sentinel"
  • Additional hardware: trigger latch, length register (~2 chips)

Runtime EXEC enables:

  • Checkpoint/restore: snapshot tokens into SM region, reload via EXEC
  • Code migration: EXEC a sequence that writes new IRAM to a target PE, updates routing, and re-injects pending tokens
  • Computed token streams: a CM builds a token sequence in SM via WRITEs, then triggers EXEC to emit them all at once
  • Bulk scatter: pre-formed token templates in SM, EXEC distributes data to PEs (faster than a CM emit loop)

Bus Bandwidth During EXEC#

EXEC monopolises the bus while clocking out tokens. At v0 scale this is acceptable. At scale, EXEC should be backpressure-aware — the SM only emits when the bus grants access, naturally interleaving with other traffic.

Security Consideration#

EXEC allows an SM to inject tokens that look like they came from any source, targeting any PE, with any opcode. At v0 this is a non-issue (same threat model as a 6502 — all code is trusted). At scale, EXEC should be gated behind privilege levels or a config bit controlling which SMs can use it.


Bootstrap SM Ownership#

Dedicated Bootstrap SM#

One SM (likely SM00) is wired to the system reset signal. On coming out of reset, it calls EXEC on a predetermined address in program storage (ROM or flash mapped into its extended address space).

The bootstrap program at the reset vector is responsible for:

  1. Commanding all other CMs and SMs to clear existing state
  2. Loading the program (IRAM writes, routing config, initial data)
  3. Emitting the initial trigger tokens to start execution

Any program loaded by the bootstrap must not issue commands to the bootstrap SM as part of its own loading sequence — doing so could interfere with the bootstrap process itself.

At runtime, the same EXEC behaviour can be triggered on any SM targeting any part of memory. The only thing special about SM00 is the reset-vector wiring.

Should the Bootstrap SM Be Further Specialised?#

This is an open question with trade-offs in both directions.

Arguments for specialisation (SM00 as "IO SM"):

Some SM opcodes (atomics, alloc/free) don't have meaningful semantics on memory-mapped IO addresses. On SM00 (aliased io in the assembler), the atomic and allocation opcodes could be repurposed for specialised IO operations. This avoids wasting opcode space on operations that don't apply to IO, and gives the IO subsystem its own instruction vocabulary within the existing token format.

SM00 would also be the natural home for program storage (ROM/flash) and the serial port, since the bootstrap hardware already requires access to external storage.

Arguments against specialisation:

Making SM00 special outside the boot process creates a bottleneck. If SM00 is the only path to program storage, serial IO, and other peripherals, it concentrates traffic. In a system with complex IO requirements or significant address space demand, a single special-cased SM limits scalability.

The more devices or address space mapped through SM00, the worse the contention. Other SMs cannot help share the load because they lack the specialised decode logic.

Middle ground (recommended for v0):

SM00 is special only at boot (reset vector wiring). At runtime, IO is memory-mapped into SM00's address space using the standard SM opcode set. Opcode reuse for IO-specific operations is deferred until profiling shows the standard opcodes are insufficient. Any SM can perform EXEC at runtime, so program loading and code migration are not locked to SM00.

This avoids committing to a specialised instruction set before knowing what IO operations actually need hardware support vs what the standard read/write/presence-bit model can handle.


Memory Tier Model#

Not all addressable storage needs synchronising memory semantics. Treating every cell as an I-structure is expensive (presence SRAM, FSM complexity, deferred read logic) and unnecessary for many use cases. The SM address space should support regions with different semantics tiers, selected by address range.

Tier Definitions#

Tier 0: Raw memory. No presence bits, no metadata, no I-structure semantics. Reads always return whatever is stored; writes always go through. No per-cell state means no consistency concerns — this tier can be trivially shared across SMs because there is nothing to keep synchronised. The SM hardware path for tier 0 is minimal: address decode, SRAM/DRAM read/write, done. No FSM, no deferred read register, no presence SRAM access.

Use cases: framebuffers, ROM, DMA buffers, lookup tables, program storage, any memory region where the programmer accepts "last writer wins" semantics or read-only access.

Tier 1: I-structure memory. The full presence-bit model. EMPTY/RESERVED/FULL/WAITING state per cell, deferred reads, write-once semantics with diagnostic on overwrite. Each cell has metadata in the presence SRAM. This is the synchronising memory that makes the dataflow architecture work — producer-consumer coordination without locks. Must be owned by a single SM because the metadata is authoritative.

Tier 2: Wide/bulk memory. Tier 1 plus the is_wide tag, sequencer support for ITERATE/COPY_RANGE, and bounds checking via wide pointer metadata. A superset of tier 1 that enables SM-local bulk operations. Same ownership constraints as tier 1.

Tier Selection by Address Range#

For v0, the tier boundary is a fixed address convention. Addresses below a threshold are tier 1 (I-structure); addresses above it are tier 0 (raw). The boundary may be a hardwired constant or a config register. The decode logic is one comparator.

The compiler and assembler know the layout and enforce placement: I-structure cells go in the tier 1 range, framebuffers and lookup tables go in tier 0. The ROM base address (and therefore the EXEC reset vector) is always in tier 0 at a fixed location, because the bootstrap hardware must know where to start reading without any configuration having occurred.

Future upgrade path: per-page tier tags. If the '610 mapper is present, each mapping register can carry 2 extra bits indicating the tier for that page. The mapper already translates logical→physical addresses; it additionally outputs "this page is tier 0/1/2" alongside the physical address. The SM FSM uses this signal to skip the presence check and deferred-read logic for tier 0 pages. Hardware cost: 2 extra bits per mapping register and one mux on the presence SRAM read-enable line.

Implications for Shared Access#

Tier 0 regions do not need to be owned by a single SM. If there are no presence bits, there is no per-cell state to be inconsistent. Multiple SMs can map the same physical DRAM or ROM as tier 0, and concurrent access works without coordination — reads are instantaneous, writes are last-writer-wins. The hardware cost per SM for tier 0 access is essentially zero beyond address decode and the bus interface.

This directly solves the "program storage bottleneck on SM00" problem from the bootstrap discussion. If ROM is tier 0 and any SM can map it, then any SM can EXEC from it or issue READs to it. SM00 has the reset vector wiring for initial bootstrap, but at runtime program storage is accessible from any SM. No bottleneck.

For ROM specifically, the path is even simpler: tier 0 read-only. The SM does not need write logic for the ROM region. EXEC on ROM is the bootstrap path. Regular READs on ROM are just reads — no presence check, no state transition, the fastest possible SM path.


Address Space Distribution Across SMs#

The Core Tension#

Each SM manages presence bits and metadata for its own cells. This requires per-SM ownership of tier 1/2 cells, which inherently segments the memory space. The question is how to present this to the programmer and how to handle structures that don't fit neatly into one SM's address range.

Option Analysis#

Hard partitioned (current design). SM_id in the token selects the SM; address bits are local to that SM. Each SM owns its cells, its presence metadata, everything. The compiler/programmer must know which SM holds what.

Advantages: zero contention on metadata, SMs are completely independent, simplest hardware. Four SMs running in parallel can service four requests simultaneously. Wide pointers and bulk ops work trivially because all cells in a structure are SM-local.

Disadvantages: the programmer sees 4 small address spaces instead of one large one. A 200-element array must either fit in one SM (consuming nearly half its capacity at 512 cells) or be explicitly split across SMs by the compiler. No parallelism on accesses within one SM's address range.

Interleaved. Low address bits select the SM; high bits select the cell within it. The programmer sees a flat address space.

Advantages: flat addressing, automatic load balancing for sequential access patterns.

Disadvantages: locality is destroyed. Consecutive elements are on different SMs, which breaks wide pointers (cell[N] on SM0, cell[N+1] on SM1) and prevents SM-local bulk ops (ITERATE, COPY_RANGE). Wide pointers effectively rule out pure interleaving.

Block interleaved. Blocks of N cells (16-32) assigned to SMs round-robin. Consecutive cells within a block are on the same SM.

Advantages: flat-ish addressing, wide pointers work within blocks, some load balancing for large structures. Bulk ops work within a block.

Disadvantages: block boundaries create edge cases for structures that straddle two SMs. The compiler must be block-aware for optimal placement.

Segmented with toolchain abstraction (recommended for v0). Keep hard partitioning in hardware, but provide a toolchain abstraction layer. The assembler provides named memory regions mapping to SM_id + local address. The compiler handles placement. At runtime, the SM_id bits in the token explicitly select the SM, but the programmer writes STORE array[i] and the toolchain resolves the target.

Advantages: hardware stays simple, programmer does not manually manage SM placement, wide pointers and bulk ops work trivially (everything within a structure is on one SM). The "address space" is as large as the toolchain can represent.

Disadvantages: parallelism requires the compiler to deliberately distribute structures across SMs. A hot array on one SM cannot benefit from idle SMs. Essentially hard partitioning with better ergonomics.

v0: segmented with toolchain abstraction. Hard partitioning is the correct hardware choice because independent metadata, no contention, and no multi-port reads are prerequisites for the wide pointer and bulk operation features. The programmer experience problem is real but solvable in the toolchain. The compiler can be smart about placement: arrays that will be iterated go on one SM (for ITERATE), independent structures go on different SMs (for parallelism), hot accumulators go on SMs not busy with bulk ops.

Scale: block interleaving. At 8+ SMs with a dedicated CM×SM fabric, the network can route based on address bits without needing an explicit SM_id in the token. The SM_id effectively becomes the high address bits. This is a clean evolution from "explicit SM_id in the token" to "implicit SM selection from address bits" without changing the token format — the 2-3 SM_id bits are simply reinterpreted.

Cross-SM Wide Pointers#

A wide pointer can reference cells on a different SM if the base address includes an SM_id (or maps to a different SM through the address translation layer). When an SM processes an ITERATE on such a pointer, it cannot read the target cells locally. Instead, it emits READ tokens to the target SM for each element, acting as a "list controller" that orchestrates access across the system.

This is slower than SM-local iteration (bus traffic per element instead of internal reads) but papers over the distribution from the programmer's perspective. The programmer writes ITERATE; the hardware determines whether it can run locally or must go remote. Amamiya's SM list operations worked on a similar principle.

The resulting access patterns:

Local ITERATE (target cells on same SM):
  1 ITERATE token + N internal reads + N response tokens  = 2 + 2N flits

Remote ITERATE (target cells on different SM):
  1 ITERATE token + N remote READs + N responses back     = 2 + 4N flits

Remote iteration is 2× the bus traffic of local iteration, but still better than the CM orchestrating each access individually (which would also be 4N flits, plus the CM compute overhead per element).

Tier 0 Regions and Distribution#

Tier 0 (raw) memory sidesteps the distribution problem entirely. Because there are no presence bits or metadata, tier 0 regions can be shared across SMs without ownership constraints. A framebuffer mapped as tier 0 can be written by any SM, read by any SM, with no coordination overhead. The only concern is last-writer-wins semantics, which is acceptable for use cases like display buffers, shared lookup tables, and ROM.

This creates a natural split: tier 1/2 memory is partitioned across SMs for correctness (metadata ownership), while tier 0 memory is shared for convenience and throughput.


Wide Pointers and Bulk SM Operations#

Motivation#

Historical SM designs (including Amamiya's) treated SM as dumb storage: cells with presence bits, every operation one-cell-at-a-time, CM doing all orchestration. Amamiya's SM had CAR/CDR fields per cell for LISP cons cells.

The approach here is analogous to Rust's wide pointers: carry length + address metadata alongside the cell data. The SM becomes a smart memory controller that can operate on structures without per-element CM round-trips.

Inspired in part by near-data-processing research (e.g. SFU 2022 paper on smarter caches, DOI: 10.1145/3470496.3527380).

Wide Pointer = (address, length)#

An SM cell tagged as a "wide pointer" implicitly pairs with the next cell to form a (base_address, length) tuple. The SM can act on this metadata without CM involvement.

Capabilities Enabled#

Bounded slices: the SM knows the extent of a structure. Hardware bounds checking on READ rejects out-of-range accesses without a CM round-trip.

SM-local iteration (READ_NEXT / ITERATE): the SM maintains an internal cursor. The CM sends "give me the next element," the SM increments its pointer, and returns the value. One token in, one token out, repeated until exhaustion. The CM does not track the index.

Bus traffic for 64-element array iteration:
  without bulk ops:  64 READs + 64 responses                = 256 flits
  with ITERATE:      1 ITERATE + 64 responses                = 130 flits

SM-local memcpy/memmove (COPY_RANGE): source and destination slices provided, SM handles the copy internally. No per-element tokens on the bus.

  without:  64 READs + 64 responses + 64 WRITEs             = 384 flits
  with:     1 COPY_RANGE + 1 completion token                = 4 flits

SM-local string ops: a length-aware SM can perform compare, search, and concat on packed byte data (2 chars per 16-bit cell) without per-byte CM involvement.

Cell Type Tag Implementation#

Recommended: tagged cells. Widen per-cell metadata from 2-bit presence to 3+ bits: presence:2 + is_wide:1.

  • Cells that are not wide pointers pay zero overhead (is_wide = 0)
  • Wide pointer cells consume 2 cells (the pointer cell + the length/ metadata cell)
  • The SM knows to treat them as a unit because of the tag
  • Presence SRAM goes from 2 bits/cell to 3 bits/cell, still trivially fits the same small SRAM chip

Alternatives considered:

  • Paired cells: even addresses hold data, odd hold metadata. Simple, but halves effective cell count for all cells regardless of use.
  • Separate metadata SRAM: a third SRAM plane alongside data and presence. Does not eat cell count but costs additional chips.

Tagged cells are the most flexible option — only wide-pointer cells pay the cost of consuming two cells.

Hardware Reuse: EXEC Sequencer = Bulk Op Engine#

The bootstrap/EXEC hardware and the bulk operation engine share nearly all their components:

Already present for EXEC:
  - Address counter
  - Limit comparator (stop sentinel)
  - Increment logic (READ_INC/DEC already requires 16-bit incrementer)
  - Output path to bus (result formatter bypass)
  - Internal read/write sequencing (FSM)

Additional for bulk ops:
  - Second address register (destination for COPY_RANGE)
  - Mode selector (determines per-element action)
  ~5-6 extra chips on top of the EXEC sequencer

All bulk operations are modes of the same sequencer:

  • EXEC: read cells, push raw to bus
  • ITERATE: read cells, push as formatted response tokens
  • COPY_RANGE: read cells, write to other cells internally
  • CLEAR_RANGE: write zeros across a cell range

Wide Pointer as Sequencer Parameter Block#

Without wide pointers, every bulk op requires the CM to send base+length as part of the command (3-flit tokens minimum; SM latches parameters from the token).

With wide pointers, the command is just "ITERATE cell[N]." The SM reads cell[N], sees the is_wide tag, reads cell[N+1] for the length, loads the sequencer, and runs. A single 2-flit token kicks off an arbitrarily long bulk operation.

The limit register IS the length from the wide pointer. The base address IS the address from the wide pointer. The wide pointer serves directly as the parameter block for SM-internal bulk operations.

Bus Contention During Bulk Ops#

Backpressure-based (recommended for v0): the SM only emits when it receives bus access, naturally interleaving with other traffic. The sequencer pauses when it cannot transmit. This falls out of normal bus arbitration and degrades throughput gracefully rather than starving other devices.

An alternative is interruptible sequencing with burst length limits (the SM processes N elements, yields the bus, resumes later). This adds a resume-state register but is probably unnecessary at v0 scale.

Complexity Boundary#

The SM should perform address arithmetic and counting, not data-dependent branching on cell values. Once the SM starts evaluating data content (e.g. "if value > threshold, do X"), it has crossed from smart memory controller into coprocessor territory and gate count escalates.

Acceptable SM-internal decisions:

  • Presence state checks (metadata the SM owns)
  • Wide pointer length/bounds checks (structural metadata)
  • Counter increment/compare (sequencer logic)

Belongs in the CM / dataflow graph, not the SM:

  • Conditional operations based on data values
  • Arbitrary arithmetic on cell contents
  • Data-dependent control flow

The existing atomic ops (READ_INC, READ_DEC, CAS) do modify data, but they are fixed-function single-cell operations, not sequenced bulk compute. They remain below the complexity line.

Road Not Taken: SM Evolution Trajectory#

Hypothetical trajectory if someone had cracked the compiler problem circa 1975:

1975-78:  SM = dumb cells + presence bits (Amamiya baseline)
1979-82:  SM gains atomic ops, wide pointers (structure-aware memory)
1983-86:  SM gains bulk sequencer (EXEC/ITERATE/COPY_RANGE)
1987-90:  SM gets microcode ROM for near-data programs (filter, reduce)
1992+:    SM becomes a specialised vector unit next to memory
          (essentially what GPUs independently reinvented)

Compare to the actual historical trajectory: make the CPU faster, add cache layers to hide memory latency, add OoO to hide cache miss latency, add prefetchers to hide OoO limits — each layer compensating for the previous layer's failure to solve data-far-from-compute.


Presence Metadata SRAM Design#

Constraint#

Presence metadata must have single-cycle immediate access. In the historical scenario this means fast SRAM regardless of whether the data backing store is DRAM. The presence check determines the FSM's action (defer, satisfy, error) and cannot wait for DRAM latency.

Parallel Access#

Presence SRAM and data SRAM share address lines (same address driven to both chips simultaneously). The presence result feeds the FSM decision logic while the data result waits in a latch. If the FSM decides the data is needed (cell is FULL, op is READ), it is already available. If not (cell is EMPTY, defer), the data latch is simply ignored.

For historical DRAM backing: the DRAM data read can be speculative — kicked off in parallel with the presence check and cancelled or ignored if the cell turns out to be EMPTY.

Per-Cell Metadata Candidates#

Checked on every operation (fast path):

  • Presence state: 2 bits (EMPTY/RESERVED/FULL/WAITING)
  • is_wide tag: 1 bit (SM needs this before deciding whether to also read the next cell)

Checked on most operations (likely fast path):

  • Type tag: 1-2 bits (scalar, wide pointer, packed bytes — avoids wasted reads or misinterpretation of cell contents)
  • Write-once flag: 1 bit (hard error on overwrite, distinct from FULL which permits overwrite with a diagnostic indicator)

Checked selectively (possibly fast path):

  • Owner/source: 2 bits (which CM "owns" this cell — for future access control)
  • Refcount indicator: 1 bit (flag that this cell uses atomic semantics, allowing ref-counted and non-ref-counted cells to coexist)

presence:2 + is_wide:1 + spare:1

At 512 cells this is 256 bytes — trivially fits any small SRAM chip. Using a byte-wide SRAM chip means bits 4-7 are physically present whether used or not. Committing 4 bits with 4 spare avoids needing to change the presence SRAM layout when write-once enforcement or type tags are added during testing.


Open Design Questions#

  1. SM opcode width — 3-bit fixed (maximum address range), 4-bit fixed (maximum opcodes, simplest decode), or variable 3/5 (both benefits, one extra decode gate)? Depends on how much the 512 vs 1024 cell difference matters in practice.

  2. Wide pointer cell format — is (base_addr, length) in two consecutive cells sufficient, or should the metadata cell also carry a type/tag byte? e.g. cell[N] = base_addr, cell[N+1] = length | element_type.

  3. EXEC stop condition — sentinel value in the data stream, or length register preloaded from a wide pointer or command? Sentinel is simpler for bootstrap (no length needed); a length register is safer for runtime EXEC (no risk of data accidentally matching the sentinel).

  4. Bulk op completion signalling — when ITERATE or COPY_RANGE finishes, does the SM emit a completion token? Could reuse the return routing from the original request to send a "done" signal that fires the next node in the dataflow graph.

  5. Bulk ops on non-FULL cells — if ITERATE reads a cell in WAITING state (an I-structure not yet fulfilled), should the sequencer stall per-cell, skip the cell, or error? Per-cell stalling risks blocking the sequencer indefinitely.

  6. IRAM write multi-flit handling — if instruction words are wider than 16 bits, IRAM write needs 3 flits. The misc bucket sub=01 format has 2-bit flags that could signal 2-flit vs 3-flit mode. Depends on final instruction word width.

  7. IO address space allocation — if IO becomes memory-mapped SM, which SM_id is reserved? Is IO mapped into a specific address range within SM00, or distributed across SMs?

  8. Relative monadic offset + SC blocks — SC blocks need their own IRAM region. Does the relative offset mechanism require a third base register for SC mode, or does SC bypass offset translation entirely (using raw IRAM addresses from a sequential counter)?

  9. Bootstrap SM specialisation — should SM00 repurpose unused opcodes (atomics, alloc/free) for IO-specific operations, or should it use the standard SM opcode set with IO handled purely through read/write and presence-bit semantics? Specialisation provides richer IO control; uniformity avoids creating a traffic bottleneck and simplifies the hardware.

  10. Tier boundary configuration — hardwired address split between tier 0 (raw) and tier 1 (I-structure), or config register settable at load time? Hardwired is simplest; a config register allows different programs to allocate different amounts of raw vs synchronised memory. The ROM base / reset vector address is always fixed regardless.

  11. Tier 0 write semantics — should tier 0 cells support writes from any SM (true shared memory), or only from the "home" SM with other SMs issuing remote write requests? True sharing is simpler but last-writer-wins can cause subtle bugs if two CMs write the same cell concurrently. Remote writes add a serialisation point.

  12. Cross-SM ITERATE — remote iteration is 2× the bus traffic of local iteration. Is this acceptable, or should the compiler be required to keep iterable structures SM-local? If required, the cross-SM path could be omitted from v0 and treated as an error.

  13. Per-page tier tags with '610 — the '610 mapper has 12-bit output registers. Are spare bits available for tier tags, or does the full 12-bit output go to physical addressing? If the physical address space does not need all 12 bits, the high bits could encode tier.