[libre-riscv-dev] GPU design

Jacob Lifshay programmerjake at gmail.com
Tue Dec 4 03:29:40 GMT 2018


On Mon, Dec 3, 2018, 18:04 Luke Kenneth Casson Leighton <lkcl at lkcl.net
wrote:

> On Mon, Dec 3, 2018 at 11:02 PM Jacob Lifshay <programmerjake at gmail.com>
> wrote:
> >
> > I created a simple diagram of what I think would work for the ALUs and
> > register file for the GPU design. The diagram doesn't include forwarding
> or
> > pipeline registers.
> >
> >
> https://salsa.debian.org/Kazan-team/kazan/blob/e4b516e29469e26146e717e0ef4b552efdac694b/docs/ALU%20lanes.svg
>
>  nice. very clear.  thoughts: those would need to be 64-bit wide (in
> order to handle up to the 64-bit FP and also SIMD), so those muxes (2
> each per lane) are taking in 256 bits each, that's 512 input wires per
> lane, 4 lanes is 2048 wires, which seems like an awful lot.  oh, darn:
> two register files (one int, one FP), so 4096 wires.
>
I was thinking that we could have the architecturally-visible fp registers
be some of the higher numbered architecturally-visible integer registers
allowing us to have only 128 architecturally visible registers, since most
of the registers will be used for fp. example (without SV renaming):
fadd f1, f2, f3 ; really fadd x33, x34, x35
add r4, r5, r6 ; really add x4, x5, x6

>
>  estimated number of gates in a 4-in priority mux: abouuut... 20?  so
> it would be somewhere around 80,000 gates for the lane routing.
> https://www.electronics-tutorials.ws/combination/comb_2.html

we only need regular 4 to 1 muxes, since the select input to the mux is
just the high bits of the register number, so, sharing decoding inverters,
1x4-in nand gate, and 4x3-in nand gate; approx 32 transistors per 1-bit
mux, 64 bits x 8 alu inputs = 16k transistors total (plus a few hundred for
the decoding inverters and buffers). equivalent of 4k 2-in nand gates.

>
>
>  which, as we've not done any other comparative analysis of other
> options yet, i don't know if this is relatively high or around what
> we'd need regardless of which option is picked.
>
>  the other alternative that mitch alsup suggested, i recorded his
> advice on the microarchitecture page: you just lengthen out the
> pipeline by as many stages as is required to read the source operands.
> really really simple.
>
the problem is that you need a read port on the register file for each
stage, so you take longer and still need a lot of read ports.

>
>  now, could we use a hybrid approach? possibly!  we'll find out :)
>

We could fall back on a barrel processor, similar to the sun t1 (note that
the t1 has a single fpu per 8-core chip, so it's fp stats are junk), that
lets us keep the pipeline full, but each individual thread runs slowly.
Having a barrel processor mitigates instruction timing side channels since
each instruction always takes 16 cycles (assuming a 16-cycle thread
rotation, or, for 2 threads/clock, 32 threads). We would still have to
worry about cache timing attacks though.

>
>
> > I noticed that if we use register renaming, we can allocate the output
> > registers of each of the 4 lanes in such a way that the register file can
> > be split into 4 parts with each part only being written by its associated
> > lane, meaning that we can get away with only a few write ports, 1 for
> each
> > supported instruction latency. I'm planning on supporting single-cycle
> > instructions (integer add, sub, xor, etc.), 3-4 cycle instructions (fadd,
> > fmul, fmadd, load, etc.) and for longer instructions (fdiv, integer div,
> > etc.) just stall the rest of the processor when the instructions finish
> in
> > order to create a free slot to write, though we could add another write
> > port if long instructions are too slow.
>
>  i'm... not totally enamoured with something that relies on stalling
> the entire core to deal with a bottleneck.
>
If we have 3 write ports, we don't need to stall.

>
>
> > Note that there are 0xC0 hardware registers because we need 0x80 for the
> > architecturally visible registers, and the other 0x40 are used for
> > renaming. 0x40 spare registers should be enough because that's enough
> for 4
> > 16-cycle instructions issued per clock.
> >
> > I'm planning on adding additional forwarding to skip the extra cycle
> needed
> > to read/write the register file.
> >
> > Note that the GPU probably won't be a 4-wide-issue processor, those are
> > just the per-element operations generated from single vectorized
> operations.
>
>  the augmented-tomasulo i'm currently investigating, i also agree
> 4-wide-issue is probably far too much: it means that on every clock
> cycle you need 4 simultaneous instruction-decoders, 4 simultaneous
> entries-into-the-reorder-buffer.
>
>  plus, assuming a 100% pipeline fill (unrealistic but ok for
> illustrative purposes) you would also need a 4-wide Common Data Bus
> (64-bit x 4) meaning, there's no point issuing 4 instructions if the
> results are bottlenecked.
>
you would need a 4-wide cdb anyway, since that's the performance we're
trying for.

>
>  not only that: each "listener" - the other ALUs, the load buffer, the
> reorder buffer - all need 4-wide inputs, and the CAM entries in the
> reorder buffer would also need to be 4-wide triggers.
>
>  although it would be great for a high-performance core, we're doing
> mobile, first :)  so, 2-issue would be much more sane.
>
> l.
>
> _______________________________________________
> libre-riscv-dev mailing list
> libre-riscv-dev at lists.libre-riscv.org
> http://lists.libre-riscv.org/mailman/listinfo/libre-riscv-dev
>


More information about the libre-riscv-dev mailing list