[libre-riscv-dev] GPU design

Sun Dec 9 12:38:33 GMT 2018

On Sun, Dec 9, 2018 at 12:09 PM Jacob Lifshay <programmerjake at gmail.com> wrote:

> Ok, to save time and prevent putting off the decision for too long, we can
> just go with the split reg file.

 my stress-indicators can climb back down a bit, now :)

> Note that AMDGPU doesn't split between int and fp registers and it is
> supported by both gcc and llvm, so it isn't a problem for compilers.

 that slightly (only slightly!) reassures me: it doesn't automatically
allocate us the time / resources to modify the compilers.

> We need to definitely implement proper sv context switching or prevent more
> than 1 process using the sv registers at a time otherwise it's a security
> hole.

 true.  as is the shared (tile) memory area (they always are).
normally, the security issue is "fixed" by leveraging the L1 cache
address mechanism (or, the TLB, more to the point), however we need to
look at this very carefully, to make sure that doing so doesn't result
in a large power hit.

> I still think it's a good idea to build the prototype register allocator,
> but targeting the split reg file, to get the algorithms worked out before
> getting into the complexity of existing code in llvm.

 do you mean the "renaming" mechanism? or the multiplexor?  or the
Reservation Stations... or all of the above? :)

 it's an absolutely critical part of the whole project.  i'm now
convinced that a *proper* (modernised) 6600 design is how we should
progress, thanks to insights from mitch
https://groups.google.com/d/msg/comp.arch/w5fUBkrcw-s/-9JNF0cUCAAJ

not even the Reorder Buffer is needed, as the combination of
Dependency Matrices *and* scoreboard - including features that are
totally left out of the academic literature - give everything we need.

even operand forwarding is part of the 6600 design!  that's thanks to
the same-cycle "write-through" capability of the 6600 Register File.

the current design layout that i am favouring is the 3R1W
4-way-multiplexed diagram you drew, with *four* sets of Functional
Units for the majority of int/fp operations, where the instruction
issue phase typically allocates modulo 4 SV-expanded operations in a
stripe across those 4 sets.

*some* of those 4 FUs will actually be front-ends to the exact same
ALU.  the technique for doing so (identify and reducing / merging some
of the gates) is outlined in the 2nd chapter of mitch's unpublished
book.  he calls this "concurrent computation units'  and they're
typically pipelined units, so are quite capable of absorbing 4 sets of
incoming operands.  a bank of 8 output-latched stores can keep the
results of the computations without having to propagate a stall back
down the instruction chain.

one of the really really important things that's completely missing
from academic literature is the difference between the "Go Read"
signals (which are mentioned prominently in 6600 academic literature)
and the "Go Write" signals (which are NOT properly mentioned).

the "Go Write" signals basically go directly to the Register File...
however they go via the *dependency matrix*, where *other functional
units* have the opportunity to stop them from occurring.

*this* is how things like speculative branch execution, precise
exceptions, and memory hazards are all avoided.

it's frickin awesome.  and not a single CAM or Reorder Buffer or
operand forwarding special-case or any kind of pipeline bypass
mechanism in sight!

in short i'm absolutely astounded and deeply impressed.

anyway: 4-level register-file stratification plus 4 sets of
"Functional Units" will provide the parallelism we're looking for, yet
without the power-hungry disadvantages which would come with a Reorder
Buffer.

l.