[libre-riscv-dev] GPU design

lkcl lkcl at libre-riscv.org
Tue Dec 4 06:14:10 GMT 2018

On Tue, Dec 4, 2018 at 3:29 AM Jacob Lifshay <programmerjake at gmail.com> wrote:
> On Mon, Dec 3, 2018, 18:04 Luke Kenneth Casson Leighton <lkcl at lkcl.net
> wrote:

> > https://salsa.debian.org/Kazan-team/kazan/blob/e4b516e29469e26146e717e0ef4b552efdac694b/docs/ALU%20lanes.svg
> >
> >  nice. very clear.  thoughts: those would need to be 64-bit wide (in
> > order to handle up to the 64-bit FP and also SIMD), so those muxes (2
> > each per lane) are taking in 256 bits each, that's 512 input wires per
> > lane, 4 lanes is 2048 wires, which seems like an awful lot.  oh, darn:
> > two register files (one int, one FP), so 4096 wires.
> >
> I was thinking that we could have the architecturally-visible fp registers
> be some of the higher numbered architecturally-visible integer registers
> allowing us to have only 128 architecturally visible registers, since most
> of the registers will be used for fp. example (without SV renaming):
> fadd f1, f2, f3 ; really fadd x33, x34, x35
> add r4, r5, r6 ; really add x4, x5, x6

 i'm slightly confused, possibly by the prefix "x" being the same.

> >  estimated number of gates in a 4-in priority mux: abouuut... 20?  so
> > it would be somewhere around 80,000 gates for the lane routing.
> > https://www.electronics-tutorials.ws/combination/comb_2.html
> we only need regular 4 to 1 muxes, since the select input to the mux is
> just the high bits of the register number, so, sharing decoding inverters,
> 1x4-in nand gate, and 4x3-in nand gate; approx 32 transistors per 1-bit
> mux, 64 bits x 8 alu inputs = 16k transistors total (plus a few hundred for
> the decoding inverters and buffers). equivalent of 4k 2-in nand gates.

 ah yeah: i'd multiplied by the number of incoming wires, rather than
the number of outgoing (which is 4x less)

> >  the other alternative that mitch alsup suggested, i recorded his
> > advice on the microarchitecture page: you just lengthen out the
> > pipeline by as many stages as is required to read the source operands.
> > really really simple.
> >
> the problem is that you need a read port on the register file for each
> stage, so you take longer and still need a lot of read ports.

 ... because the operands are shuffling down stages of the pipeline...

 ... which is another reason why i like the tomasulo algorithm, as the
reservation stations are on the CDB.  the operation stays at "ALU
pipeline stage 1" until all operands are available.

> >  now, could we use a hybrid approach? possibly!  we'll find out :)
> >
> We could fall back on a barrel processor, similar to the sun t1 (note that
> the t1 has a single fpu per 8-core chip, so it's fp stats are junk), that
> lets us keep the pipeline full, but each individual thread runs slowly.


> > > etc.) just stall the rest of the processor when the instructions finish
> > in
> > > order to create a free slot to write, though we could add another write
> > > port if long instructions are too slow.
> >
> >  i'm... not totally enamoured with something that relies on stalling
> > the entire core to deal with a bottleneck.
> >
> If we have 3 write ports, we don't need to stall.

 ok.  and predicated FP uses the INT regfile to source the predicate...

> >  plus, assuming a 100% pipeline fill (unrealistic but ok for
> > illustrative purposes) you would also need a 4-wide Common Data Bus
> > (64-bit x 4) meaning, there's no point issuing 4 instructions if the
> > results are bottlenecked.
> >
> you would need a 4-wide cdb anyway, since that's the performance we're
> trying for.

 if the 32-bit ops can be grouped as 2x SIMD to a 64-bit-wide ALU,
then only 2 such ALUs would be needed to give 4x 32-bit FP per cycle
per core, which means only a 2-wide CDB, a heck of a lot better than

 oh: i thought of another way to cut the power-impact of the Reorder
Buffer CAMs: a simple bit-field (a single-bit 2RWW memory, of address
length equal to the number of registers, 2 is because of 2-issue).

 the CAM of a ROB is on the instruction destination register.  key:
ROBnum, value: instr-dest-reg.  if you have a bitfleid that says "this
destreg has no ROB tag", it's dead-easy to check that bitfield, first.

More information about the libre-riscv-dev mailing list