[libre-riscv-dev] GPU design
programmerjake at gmail.com
Tue Dec 4 06:42:06 GMT 2018
On Mon, Dec 3, 2018, 22:14 lkcl <lkcl at libre-riscv.org wrote:
> On Tue, Dec 4, 2018 at 3:29 AM Jacob Lifshay <programmerjake at gmail.com>
> > On Mon, Dec 3, 2018, 18:04 Luke Kenneth Casson Leighton <lkcl at lkcl.net
> > wrote:
> > >
> > >
> > > nice. very clear. thoughts: those would need to be 64-bit wide (in
> > > order to handle up to the 64-bit FP and also SIMD), so those muxes (2
> > > each per lane) are taking in 256 bits each, that's 512 input wires per
> > > lane, 4 lanes is 2048 wires, which seems like an awful lot. oh, darn:
> > > two register files (one int, one FP), so 4096 wires.
> > >
> > I was thinking that we could have the architecturally-visible fp
> > be some of the higher numbered architecturally-visible integer registers
> > allowing us to have only 128 architecturally visible registers, since
> > of the registers will be used for fp. example (without SV renaming):
> > fadd f1, f2, f3 ; really fadd x33, x34, x35
> > add r4, r5, r6 ; really add x4, x5, x6
> i'm slightly confused, possibly by the prefix "x" being the same.
Sorry, I had swapped r# and x#.
Register index (SV's register renaming target):
1-31: "integer" registers x1-x31
32-63: "fp" registers f0-f31
64-127: SV-only registers
so, if a1 is set to reg 30 with vl=4:
shl a1, a1, a1
shl x30, x30, x30
shl x31, x31, x31
shl f0, f0, f0 ; really shl x32, x32, x32
shl f1, f1, f1 ; really shl x33, x33, x33
> > > estimated number of gates in a 4-in priority mux: abouuut... 20? so
> > > it would be somewhere around 80,000 gates for the lane routing.
> > > https://www.electronics-tutorials.ws/combination/comb_2.html
> > we only need regular 4 to 1 muxes, since the select input to the mux is
> > just the high bits of the register number, so, sharing decoding
> > 1x4-in nand gate, and 4x3-in nand gate; approx 32 transistors per 1-bit
> > mux, 64 bits x 8 alu inputs = 16k transistors total (plus a few hundred
> > the decoding inverters and buffers). equivalent of 4k 2-in nand gates.
> ah yeah: i'd multiplied by the number of incoming wires, rather than
> the number of outgoing (which is 4x less)
> > > the other alternative that mitch alsup suggested, i recorded his
> > > advice on the microarchitecture page: you just lengthen out the
> > > pipeline by as many stages as is required to read the source operands.
> > > really really simple.
> > >
> > the problem is that you need a read port on the register file for each
> > stage, so you take longer and still need a lot of read ports.
> ... because the operands are shuffling down stages of the pipeline...
> ... which is another reason why i like the tomasulo algorithm, as the
> reservation stations are on the CDB. the operation stays at "ALU
> pipeline stage 1" until all operands are available.
> > > now, could we use a hybrid approach? possibly! we'll find out :)
> > >
> > We could fall back on a barrel processor, similar to the sun t1 (note
> > the t1 has a single fpu per 8-core chip, so it's fp stats are junk), that
> > lets us keep the pipeline full, but each individual thread runs slowly.
> > > > etc.) just stall the rest of the processor when the instructions
> > > in
> > > > order to create a free slot to write, though we could add another
> > > > port if long instructions are too slow.
> > >
> > > i'm... not totally enamoured with something that relies on stalling
> > > the entire core to deal with a bottleneck.
> > >
> > If we have 3 write ports, we don't need to stall.
> ok. and predicated FP uses the INT regfile to source the predicate...
note that if we're using register renaming or tomasulo's algorithm, then
for fmadd we need to read from src1, src2, src3, pred, and dest and write
to the new dest. I think that's the worst case except maybe for texturing
instructions (which we haven't added yet).
> > > plus, assuming a 100% pipeline fill (unrealistic but ok for
> > > illustrative purposes) you would also need a 4-wide Common Data Bus
> > > (64-bit x 4) meaning, there's no point issuing 4 instructions if the
> > > results are bottlenecked.
> > >
> > you would need a 4-wide cdb anyway, since that's the performance we're
> > trying for.
> if the 32-bit ops can be grouped as 2x SIMD to a 64-bit-wide ALU,
> then only 2 such ALUs would be needed to give 4x 32-bit FP per cycle
> per core, which means only a 2-wide CDB, a heck of a lot better than
> oh: i thought of another way to cut the power-impact of the Reorder
> Buffer CAMs: a simple bit-field (a single-bit 2RWW memory, of address
> length equal to the number of registers, 2 is because of 2-issue).
> the CAM of a ROB is on the instruction destination register. key:
> ROBnum, value: instr-dest-reg. if you have a bitfleid that says "this
> destreg has no ROB tag", it's dead-easy to check that bitfield, first.
> libre-riscv-dev mailing list
> libre-riscv-dev at lists.libre-riscv.org
More information about the libre-riscv-dev