[libre-riscv-dev] GPU design

Tue Dec 4 06:42:06 GMT 2018

On Mon, Dec 3, 2018, 22:14 lkcl <lkcl at libre-riscv.org wrote:

> On Tue, Dec 4, 2018 at 3:29 AM Jacob Lifshay <programmerjake at gmail.com>
> wrote:
> >
> > On Mon, Dec 3, 2018, 18:04 Luke Kenneth Casson Leighton <lkcl at lkcl.net
> > wrote:
>
> > >
> https://salsa.debian.org/Kazan-team/kazan/blob/e4b516e29469e26146e717e0ef4b552efdac694b/docs/ALU%20lanes.svg
> > >
> > >  nice. very clear.  thoughts: those would need to be 64-bit wide (in
> > > order to handle up to the 64-bit FP and also SIMD), so those muxes (2
> > > each per lane) are taking in 256 bits each, that's 512 input wires per
> > > lane, 4 lanes is 2048 wires, which seems like an awful lot.  oh, darn:
> > > two register files (one int, one FP), so 4096 wires.
> > >
> > I was thinking that we could have the architecturally-visible fp
> registers
> > be some of the higher numbered architecturally-visible integer registers
> > allowing us to have only 128 architecturally visible registers, since
> most
> > of the registers will be used for fp. example (without SV renaming):
> > fadd f1, f2, f3 ; really fadd x33, x34, x35
> > add r4, r5, r6 ; really add x4, x5, x6
>
>  i'm slightly confused, possibly by the prefix "x" being the same.

Sorry, I had swapped r# and x#.

Register index (SV's register renaming target):
0: zero
1-31: "integer" registers x1-x31
32-63: "fp" registers f0-f31
64-127: SV-only registers

so, if a1 is set to reg 30 with vl=4:
shl a1, a1, a1
does:
shl x30, x30, x30
shl x31, x31, x31
shl f0, f0, f0 ; really shl x32, x32, x32
shl f1, f1, f1 ; really shl x33, x33, x33

> > >  estimated number of gates in a 4-in priority mux: abouuut... 20?  so
> > > it would be somewhere around 80,000 gates for the lane routing.
> > > https://www.electronics-tutorials.ws/combination/comb_2.html
> >
> > we only need regular 4 to 1 muxes, since the select input to the mux is
> > just the high bits of the register number, so, sharing decoding
> inverters,
> > 1x4-in nand gate, and 4x3-in nand gate; approx 32 transistors per 1-bit
> > mux, 64 bits x 8 alu inputs = 16k transistors total (plus a few hundred
> for
> > the decoding inverters and buffers). equivalent of 4k 2-in nand gates.
>
>  ah yeah: i'd multiplied by the number of incoming wires, rather than
> the number of outgoing (which is 4x less)
>
> > >  the other alternative that mitch alsup suggested, i recorded his
> > > advice on the microarchitecture page: you just lengthen out the
> > > pipeline by as many stages as is required to read the source operands.
> > > really really simple.
> > >
> > the problem is that you need a read port on the register file for each
> > stage, so you take longer and still need a lot of read ports.
>
>  ... because the operands are shuffling down stages of the pipeline...
>
>  ... which is another reason why i like the tomasulo algorithm, as the
> reservation stations are on the CDB.  the operation stays at "ALU
> pipeline stage 1" until all operands are available.
>
> > >  now, could we use a hybrid approach? possibly!  we'll find out :)
> > >
> >
> > We could fall back on a barrel processor, similar to the sun t1 (note
> that
> > the t1 has a single fpu per 8-core chip, so it's fp stats are junk), that
> > lets us keep the pipeline full, but each individual thread runs slowly.
>
>  interesting.
>
> > > > etc.) just stall the rest of the processor when the instructions
> finish
> > > in
> > > > order to create a free slot to write, though we could add another
> write
> > > > port if long instructions are too slow.
> > >
> > >  i'm... not totally enamoured with something that relies on stalling
> > > the entire core to deal with a bottleneck.
> > >
> > If we have 3 write ports, we don't need to stall.
>
>  ok.  and predicated FP uses the INT regfile to source the predicate...
>
note that if we're using register renaming or tomasulo's algorithm, then
for fmadd we need to read from src1, src2, src3, pred, and dest and write
to the new dest. I think that's the worst case except maybe for texturing
instructions (which we haven't added yet).

>
> > >  plus, assuming a 100% pipeline fill (unrealistic but ok for
> > > illustrative purposes) you would also need a 4-wide Common Data Bus
> > > (64-bit x 4) meaning, there's no point issuing 4 instructions if the
> > > results are bottlenecked.
> > >
> > you would need a 4-wide cdb anyway, since that's the performance we're
> > trying for.
>
>  if the 32-bit ops can be grouped as 2x SIMD to a 64-bit-wide ALU,
> then only 2 such ALUs would be needed to give 4x 32-bit FP per cycle
> per core, which means only a 2-wide CDB, a heck of a lot better than
> 4.
>
>  oh: i thought of another way to cut the power-impact of the Reorder
> Buffer CAMs: a simple bit-field (a single-bit 2RWW memory, of address
> length equal to the number of registers, 2 is because of 2-issue).
>
>  the CAM of a ROB is on the instruction destination register.  key:
> ROBnum, value: instr-dest-reg.  if you have a bitfleid that says "this
> destreg has no ROB tag", it's dead-easy to check that bitfield, first.
>
> _______________________________________________
> libre-riscv-dev mailing list
> libre-riscv-dev at lists.libre-riscv.org
> http://lists.libre-riscv.org/mailman/listinfo/libre-riscv-dev
>