[libre-riscv-dev] GPU design
programmerjake at gmail.com
Fri Dec 7 10:27:54 GMT 2018
On Fri, Dec 7, 2018 at 1:19 AM Luke Kenneth Casson Leighton <lkcl at lkcl.net>
> On Mon, Dec 3, 2018 at 11:02 PM Jacob Lifshay <programmerjake at gmail.com>
> > I created a simple diagram of what I think would work for the ALUs and
> > register file for the GPU design. The diagram doesn't include forwarding
> > pipeline registers.
> so, coming back to this diagram, i think if we stratify the
> Functional Units into lanes as well, we may get a multi-issue
> the 6600 scoreboard rules - which are awesomely simple and actually
> involve D-Latches (3 gates) *not* flip-flops (10 gates) can be
> executed in parallel because there will be no overlap between
> stratified registers.
Yeah, I was a little surprised when I heard that. I do think, however, that
we should use flip-flops instead of latches since it makes it much easier
to design (not having to worry about glitches and stuff) and doesn't use
much more resources.
> if using that odd-even / msw-lsw division (instead of modulo 4 on the
> register number) it will be more like a 2-issue for standard RV
> instructions and a 4-issue for when SV 32-bit ops are loop-generated.
> by subdividing the registers into odd-even banks we will need a
> _pair_ of (completely independent) register-renaming tables:
> for SIMD'd operations, if we have the same type of reservation
> station queue as with Tomasulo, it can be augmented with the
> byte-mask: if the byte-masks in the queue of both the src and dest
> registers do not overlap, the operations may be done in parallel.
> i still have not yet thought through how the Reorder Buffer would
> work: here, again, i am tempted to recommend that, again, we
> "stratify" the ROB into odd-even (modulo 2) or perhaps modulo 4, with
> 32 entries, however the CAM is only 4-bit or 3-bit wide.
> if an instruction's destination register does not meet the modulo
> requirements, that ROB entry is *left empty*. this does mean that,
> for a 32-entry Reorder Buffer, if the stratification is 4-wide (modulo
> 4), and there are 4 sequential instructions that happen e.g. to have a
> destination of r4 for insn1, r24 for insn2, r16 for insn3.... etc.
> etc.... the ROB will only hold 8 such instructions
> and that i think is perfectly fine, because, statistically, it'll
> balance out, and SV generates sequentially-incrementing instruction
> registers, so *that* is fine, too.
I think we will need enough entries in the ROB that we have at least a few
more clocks than the latency of the divide unit when it's processing 32-bit
numbers (int or fp), so we'll probably need more than 8.
For a pipelined divider, if we give the divide unit 3-4 multipliers, then
we can shrink the latency to around 12 cycles by using the newton-raphson
method. Alternatively, we could implement a radix-4 pipelined divider that
would shrink the latency to around 16 cycles.
If we want to go with non-pipelined dividers, we will at least need more
than 1 since we need a division per pixel and 30-cycles per pixel will eat
up all our performance.
If we do decide to use a pipelined divider, we could share 1 divider
between 2 cores, since that would be more than enough performance and would
increase the average latency by 1 cycle at most. If we do decide to share
the divider, we will need to take care that the division latency doesn't
become a side-channel that speculated instructions can leak info through.
Note that it should be somewhat easy to add sqrt and recip-sqrt operations
to most divider designs. Recip-sqrt is particularly useful for normalizing
vectors, which is a common graphics operation.
> i'll keep working on diagrams, and also reading mitch alsup's chapters
> on the 6600. they're frickin awesome. the 6600 could do multi-issue
> LD and ST by way of having dedicated registers to LD and ST. X1-X5
> were for ST, X6 and X7 for LD.
More information about the libre-riscv-dev