[libre-riscv-dev] GPU design

Fri Dec 7 11:37:41 GMT 2018

On Fri, Dec 7, 2018 at 10:28 AM Jacob Lifshay <programmerjake at gmail.com> wrote:
>
> On Fri, Dec 7, 2018 at 1:19 AM Luke Kenneth Casson Leighton <lkcl at lkcl.net>
> wrote:

> >  the 6600 scoreboard rules - which are awesomely simple and actually
> > involve D-Latches (3 gates) *not* flip-flops (10 gates) can be
> > executed in parallel because there will be no overlap between
> > stratified registers.
> >
> Yeah, I was a little surprised when I heard that. I do think, however, that
> we should use flip-flops instead of latches since it makes it much easier
> to design (not having to worry about glitches and stuff) and doesn't use
> much more resources.

 well, yosys has an option to disable generation of d-latches
(substituting flip-flops instead).  so if the source code is designed
to do d-latches and they do turn out to be problematic, they can be
eliminated.

 of course, that's if migen (or whatever we decide on) actually allows
verilog that can *be* turned into d-latches.

 we may have to be careful on this one (resource-wise), as we may end
up with O(N^2) in several places: FU-to-FU dependency matrices for
example.  just have to see.

> I think we will need enough entries in the ROB that we have at least a few
> more clocks than the latency of the divide unit when it's processing 32-bit
> numbers (int or fp), so we'll probably need more than 8.

the intel processors have 32 (and a separate Reservation Station table
with the same order of size)

if we also have 32, divided down modulo 4 (such that the first 2 bits
of the ROB# *must* be equal to the Dest Reg#), we not only have a
cleaner way to do 4-wide instruction issue, the bit-wdith of the ROB
CAM is reduced from 256 bit (1 for INT/FP, 7 for Reg#) down to 6.

> For a pipelined divider, if we give the divide unit 3-4 multipliers, then
> we can shrink the latency to around 12 cycles by using the newton-raphson
> method. Alternatively, we could implement a radix-4 pipelined divider that
> would shrink the latency to around 16 cycles.

 well, the nice thing is: whatever the time (and even if there's no
pipelining) there's no knock-on design impact, as long as, yes, it's
kept below the ROB size.

 where things might go a little bit astray, here is: if a chain of
divide operations get issued on the same stratification destination
register (i.e. modulo 4 the dest regs come up with the same
bank/lane).

 in the scheme i propose, that *will* result in a lot of empty ROB
slots.  is that ok? i honestly have no idea.  is it a likely scenario?
i have no idea... however, hmmm, it should be easy enough to check,
using the spike instruction trace analyser written by an IIT Madras
student, known as "RiTA".

> If we want to go with non-pipelined dividers, we will at least need more
> than 1 since we need a division per pixel and 30-cycles per pixel will eat
> up all our performance.

 and it may be a good idea to do so, because we definitely want 1 per
stratification layer.

> If we do decide to use a pipelined divider, we could share 1 divider
> between 2 cores, since that would be more than enough performance and would
> increase the average latency by 1 cycle at most. If we do decide to share
> the divider, we will need to take care that the division latency doesn't
> become a side-channel that speculated instructions can leak info through.

 interesting.

 well, with the stratification proposal, the divider could
hypothetically be shared across banks, instead.

> Note that it should be somewhat easy to add sqrt and recip-sqrt operations
> to most divider designs. Recip-sqrt is particularly useful for normalizing
> vectors, which is a common graphics operation.

 i heard that, yes.  have you seen that hilarious approximation where
you subtract from the magic number 0x5f3759df?
   https://en.wikipedia.org/wiki/Fast_inverse_square_root

what it does is alias a float to an int as a way to approximate
log2(x).  that's then used as a way to approximate log2(1/sqrt(x)).
back to float you get an approximation of pow(x,2), and it's accurate
to around 3.5%.

which is really funny.

>
> > i'll keep working on diagrams, and also reading mitch alsup's chapters
> > on the 6600.  they're frickin awesome.  the 6600 could do multi-issue
> > LD and ST by way of having dedicated registers to LD and ST.  X1-X5
> > were for ST, X6 and X7 for LD.
> >
> Have fun!

 i spilled coffee on them already :)

l.