[libre-riscv-dev] GPU design

Fri Dec 7 10:27:54 GMT 2018

On Fri, Dec 7, 2018 at 1:19 AM Luke Kenneth Casson Leighton <lkcl at lkcl.net>
wrote:

> On Mon, Dec 3, 2018 at 11:02 PM Jacob Lifshay <programmerjake at gmail.com>
> wrote:
> >
> > I created a simple diagram of what I think would work for the ALUs and
> > register file for the GPU design. The diagram doesn't include forwarding
> or
> > pipeline registers.
> >
> >
> https://salsa.debian.org/Kazan-team/kazan/blob/e4b516e29469e26146e717e0ef4b552efdac694b/docs/ALU%20lanes.svg
>
>  so, coming back to this diagram, i think if we stratify the
> Functional Units into lanes as well, we may get a multi-issue
> architecture.
>
>  the 6600 scoreboard rules - which are awesomely simple and actually
> involve D-Latches (3 gates) *not* flip-flops (10 gates) can be
> executed in parallel because there will be no overlap between
> stratified registers.
>
Yeah, I was a little surprised when I heard that. I do think, however, that
we should use flip-flops instead of latches since it makes it much easier
to design (not having to worry about glitches and stuff) and doesn't use
much more resources.

>
>  if using that odd-even / msw-lsw division (instead of modulo 4 on the
> register number) it will be more like a 2-issue for standard RV
> instructions and a 4-issue for when SV 32-bit ops are loop-generated.
>
>  by subdividing the registers into odd-even banks we will need a
> _pair_ of (completely independent) register-renaming tables:
>   https://libre-riscv.org/3d_gpu/rat_table.png
>
>  for SIMD'd operations, if we have the same type of reservation
> station queue as with Tomasulo, it can be augmented with the
> byte-mask: if the byte-masks in the queue of both the src and dest
> registers do not overlap, the operations may be done in parallel.
>
>  i still have not yet thought through how the Reorder Buffer would
> work: here, again, i am tempted to recommend that, again, we
> "stratify" the ROB into odd-even (modulo 2) or perhaps modulo 4, with
> 32 entries, however the CAM is only 4-bit or 3-bit wide.
>
>  if an instruction's destination register does not meet the modulo
> requirements, that ROB entry is *left empty*.  this does mean that,
> for a 32-entry Reorder Buffer, if the stratification is 4-wide (modulo
> 4), and there are 4 sequential instructions that happen e.g. to have a
> destination of r4 for insn1, r24 for insn2, r16 for insn3.... etc.
> etc.... the ROB will only hold 8 such instructions
>
> and that i think is perfectly fine, because, statistically, it'll
> balance out, and SV generates sequentially-incrementing instruction
> registers, so *that* is fine, too.
>

I think we will need enough entries in the ROB that we have at least a few
more clocks than the latency of the divide unit when it's processing 32-bit
numbers (int or fp), so we'll probably need more than 8.
For a pipelined divider, if we give the divide unit 3-4 multipliers, then
we can shrink the latency to around 12 cycles by using the newton-raphson
method. Alternatively, we could implement a radix-4 pipelined divider that
would shrink the latency to around 16 cycles.
If we want to go with non-pipelined dividers, we will at least need more
than 1 since we need a division per pixel and 30-cycles per pixel will eat
up all our performance.

If we do decide to use a pipelined divider, we could share 1 divider
between 2 cores, since that would be more than enough performance and would
increase the average latency by 1 cycle at most. If we do decide to share
the divider, we will need to take care that the division latency doesn't
become a side-channel that speculated instructions can leak info through.

Note that it should be somewhat easy to add sqrt and recip-sqrt operations
to most divider designs. Recip-sqrt is particularly useful for normalizing
vectors, which is a common graphics operation.

> i'll keep working on diagrams, and also reading mitch alsup's chapters
> on the 6600.  they're frickin awesome.  the 6600 could do multi-issue
> LD and ST by way of having dedicated registers to LD and ST.  X1-X5
> were for ST, X6 and X7 for LD.
>
Have fun!

Jacob Lifshay