[libre-riscv-dev] GPU design

Luke Kenneth Casson Leighton lkcl at lkcl.net
Sun Dec 9 17:16:20 GMT 2018

ok so i believe i have a handle on a scoreboard-style
reorder-buffer-less register renaming design:

* a "register alias table" (aka RAT) has a DIRECT correspondance:
physical register file **EQUALS** the architectural register file
* the RAT entry points to a reservation station *row* (corresponding
to a functional unit "line", see below)


* each functional unit has a one-buffer-deep reservation station and
there are at least 2 of them OR
* each functional unit has a two-deep reservation station and they
have *separate* src1 and src2 "ready" lines

these two being effectively the same thing.

youtube video p4SdrUhZrBM is important to watch.  at time 8:14 here:

the author points out that the two FUs, adder and mul unit, *both*
produce an "r1 dest".  he says "and there is some logic which chooses
which of these shall be first".

in a Reorder Buffer, that logic is, "the instruction which is at the
head of the queue".  so, that's what we need: an instruction sequence
number, where the lowest sequence number (indicating the oldest
instruction) "wins".

the "winner" will get priority to write its result (CDC 6600
"Go_Write" signal) to the register file.  this is directly equivalent
to a Reorder Buffer "head of queue commit".

now, what i think will also need to be transmitted into (and through)
the Register File is an encoding of the Functional Unit src1/2 /
Reservation Station row.  i.e. the single-bit of a Scoreboard needs to
be packed down to a unique index, dropped through the Register File,
passed *out* the other side, and "unpacked" again.

the reason for this i believe is down to the fact that whilst the 6600
Register File is capable of passing through write values out the other
side onto "reads", what we *don't* know is: which *Reservation*
Station is supposed to pick that up (on both src1 and src2).

so, rather than pass through 50 bits of src1/src2 "ready" bits, create
a much smaller (7 bit or so) offset/index, pass that through, and
de-multiplex it on the other side.

in effect, this "packed" representation is directly equivalent to the
Tomasulo "Reservation Station Number".  the "packed" representation
would normally be sent on a Common Data Bus, however the CDB in the
6600 design has been replaced with the pass-through register file.

the 4x multiplexing may still be applied on top of this, to give 4
simultaneous instruction issue, striped register files and so on.  the
only thing that we would have to watch out for is: with 4 simultaneous
instructions and far more results being generated, the instruction
sequence order now needs to be preserved *at the register file
multiplexers* as well.

so, the instruction sequence number needs to be the means of
prioritising the regfile 4-way multiplexers.  lowest (oldest)
sequential number always wins, and in this way i believe we may ensure
that even with up to 4 instructions completing at once, the commit
order is always preserved.

it may be more complex than that, now that i think about it.


More information about the libre-riscv-dev mailing list