[libre-riscv-dev] GPU design

Tue Dec 4 07:58:16 GMT 2018

---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68

On Tue, Dec 4, 2018 at 6:42 AM Jacob Lifshay <programmerjake at gmail.com> wrote:
>
> On Mon, Dec 3, 2018, 22:14 lkcl <lkcl at libre-riscv.org wrote:
>
> > On Tue, Dec 4, 2018 at 3:29 AM Jacob Lifshay <programmerjake at gmail.com>
> > wrote:
> > >
> > > On Mon, Dec 3, 2018, 18:04 Luke Kenneth Casson Leighton <lkcl at lkcl.net
> > > wrote:
> >
> > > >
> > https://salsa.debian.org/Kazan-team/kazan/blob/e4b516e29469e26146e717e0ef4b552efdac694b/docs/ALU%20lanes.svg
> > > >
> > > >  nice. very clear.  thoughts: those would need to be 64-bit wide (in
> > > > order to handle up to the 64-bit FP and also SIMD), so those muxes (2
> > > > each per lane) are taking in 256 bits each, that's 512 input wires per
> > > > lane, 4 lanes is 2048 wires, which seems like an awful lot.  oh, darn:
> > > > two register files (one int, one FP), so 4096 wires.
> > > >
> > > I was thinking that we could have the architecturally-visible fp
> > registers
> > > be some of the higher numbered architecturally-visible integer registers
> > > allowing us to have only 128 architecturally visible registers, since
> > most
> > > of the registers will be used for fp. example (without SV renaming):
> > > fadd f1, f2, f3 ; really fadd x33, x34, x35
> > > add r4, r5, r6 ; really add x4, x5, x6
> >
> >  i'm slightly confused, possibly by the prefix "x" being the same.
>
> Sorry, I had swapped r# and x#.
>
> Register index (SV's register renaming target):
> 0: zero
> 1-31: "integer" registers x1-x31
> 32-63: "fp" registers f0-f31
> 64-127: SV-only registers
>
> so, if a1 is set to reg 30 with vl=4:
> shl a1, a1, a1
> does:
> shl x30, x30, x30
> shl x31, x31, x31
> shl f0, f0, f0 ; really shl x32, x32, x32
> shl f1, f1, f1 ; really shl x33, x33, x33
>
>
> > > >  estimated number of gates in a 4-in priority mux: abouuut... 20?  so
> > > > it would be somewhere around 80,000 gates for the lane routing.
> > > > https://www.electronics-tutorials.ws/combination/comb_2.html
> > >
> > > we only need regular 4 to 1 muxes, since the select input to the mux is
> > > just the high bits of the register number, so, sharing decoding
> > inverters,
> > > 1x4-in nand gate, and 4x3-in nand gate; approx 32 transistors per 1-bit
> > > mux, 64 bits x 8 alu inputs = 16k transistors total (plus a few hundred
> > for
> > > the decoding inverters and buffers). equivalent of 4k 2-in nand gates.
> >
> >  ah yeah: i'd multiplied by the number of incoming wires, rather than
> > the number of outgoing (which is 4x less)
> >
> > > >  the other alternative that mitch alsup suggested, i recorded his
> > > > advice on the microarchitecture page: you just lengthen out the
> > > > pipeline by as many stages as is required to read the source operands.
> > > > really really simple.
> > > >
> > > the problem is that you need a read port on the register file for each
> > > stage, so you take longer and still need a lot of read ports.
> >
> >  ... because the operands are shuffling down stages of the pipeline...
> >
> >  ... which is another reason why i like the tomasulo algorithm, as the
> > reservation stations are on the CDB.  the operation stays at "ALU
> > pipeline stage 1" until all operands are available.
> >
> > > >  now, could we use a hybrid approach? possibly!  we'll find out :)
> > > >
> > >
> > > We could fall back on a barrel processor, similar to the sun t1 (note
> > that
> > > the t1 has a single fpu per 8-core chip, so it's fp stats are junk), that
> > > lets us keep the pipeline full, but each individual thread runs slowly.
> >
> >  interesting.
> >
> > > > > etc.) just stall the rest of the processor when the instructions
> > finish
> > > > in
> > > > > order to create a free slot to write, though we could add another
> > write
> > > > > port if long instructions are too slow.
> > > >
> > > >  i'm... not totally enamoured with something that relies on stalling
> > > > the entire core to deal with a bottleneck.
> > > >
> > > If we have 3 write ports, we don't need to stall.
> >
> >  ok.  and predicated FP uses the INT regfile to source the predicate...
> >
> note that if we're using register renaming or tomasulo's algorithm, then
> for fmadd we need to read from src1, src2, src3, pred, and dest and write
> to the new dest.

 if following through with the SIMD idea, and using the xBitManip pre-
and post- to shuffle bytes/words into the correct SIMD ALU slot, that
could easily go up by a factor of up to 8 (8 bytes per 64-bit-wide
SIMD ALU).  and y'know what? i don't have a problem with that :)  a
very difficult technical problem is solved, data stays in-place
without dropping down through L1/L2 cache.

 on 32-bit SIMD it would be just under double, as the same predicate is used.

> I think that's the worst case except maybe for texturing
> instructions (which we haven't added yet).

 yikes! :)

 oh: we also need to deal with LOAD/STOREs, particularly overlapping
memory addressing.  i've seen some modified variants of the original
tomasulo, which merge the STORE buffer into the reorder algorithm, and
obey a simple set of rules.  i found an online lecture from IIT
Kharagpur, lecture 19, https://www.youtube.com/watch?v=OU3jI8j7Ozw
i'll type out the slide from 42 minutes into the video:

Avoiding Memory Hazards

* WAR and WAR hazards through memory are eliminated with speculation
because actual updating of memory occurs in order, when a store is at
the head of the ROB, and hence, no earlier loads or stores can still
be pending
* RAW hazards are maintained by two restrictions: (1) not allowing a
load to initiate the second step of its execution if any active ROB
entry occupied by a store has a destination field that matches the
value of the A field of the load and (2) maintaining the program order
for the computation of an effective address of a load with respect to
all earlier stores
* These restrictions ensure that any load that access a memory
location written to by an earlier store cannot perform the memory
access until the store has written the data.

then also, whilst it's not necessary, it would be nice to respect FENCE.

48:43 into the video professor pal goes into an example of a LD/ST
loop (non-speculative).  the BNE blocks the progress.

note at 50:30 he explains how the values get broadcast (copies sent)
over the CDB.

l.