[libre-riscv-dev] GPU design
lkcl at libre-riscv.org
Wed Dec 5 04:39:59 GMT 2018
On Tue, Dec 4, 2018 at 8:30 PM Jacob Lifshay <programmerjake at gmail.com> wrote:
> If we use renaming and my idea for having the integer and fp register files
> be the same thing, we can get away with 1.5x 128 64-bit or less.
making the int and fp regfiles be the same thing would make the
compilers extremely complex, and/or may require massive
context-thrashing, especially for mixed VPU / GPU workloads. playback
of videos within browsers typically now requires OpenGL (or
heavily-customised patching) as the rendered buffers are passed
to avoid the thrashing we would need, really, to allocate 50-50 on
the register file, 50% to int, 50% to FP, or have the code make some
sort of dynamic adjustment. we *can* do that (halve VL), i just... it
makes me nervous, even just thinking it through.
i have a hunch that the muxing scheme you envisage could be applied
to a two-wide Common Data Bus, with some minor variations:
* split banks along alternating register numbers (x5, x7, x9 + x4 x6 x8)
* split along *half* the register file, upper-32, lower-32.
the reason for that being that the FP workload will be
single-precision SIMD, and, assuming vectors in contiguous registers,
it would result in operand reads, 2 pairs of 32-bits to 2 registers,
also when it comes to conversion to integer and so on (ARGB) that is
again 32-bit (or possibly 16-bit colour).
now, obviously, the matrix multiply (and other 1D/2D/3D remapping)
will screw that up, causing one bank to potentially become a
bottleneck, so we have to be careful here.
> I found a
> pdf that describes how big a 6r3w register file is in 48nm, they have a
> 32x128 register file that takes up about 30,000 um^2 (figure 1, fabmem), so
> we can expect somewhere around 50,000 um^2 for a 64x192 6r3w register file
> in 28nm.
well, there is this concept:
it is a 2-level hierarchy for register cacheing. honestly, though,
the reservation stations of the tomasulo algorithm are similar to a
cache, although only of the intermediate results, not of the initial
i have a feeling we should investigate putting a 2-level register
cache in front of a multiplexed SRAM.
in those loops you referred to, how many 32-bit values are there, and
how many times are they referenced (as registers) more than once, in
between LD and ST? an answer to that question will give us a clear
idea of how large register caches would need to be.
More information about the libre-riscv-dev