[libre-riscv-dev] GPU design

Wed Dec 5 04:39:59 GMT 2018

On Tue, Dec 4, 2018 at 8:30 PM Jacob Lifshay <programmerjake at gmail.com> wrote:

> If we use renaming and my idea for having the integer and fp register files
> be the same thing, we can get away with 1.5x 128 64-bit or less.

 making the int and fp regfiles be the same thing would make the
compilers extremely complex, and/or may require massive
context-thrashing, especially for mixed VPU / GPU workloads.  playback
of videos within browsers typically now requires OpenGL (or
heavily-customised patching) as the rendered buffers are passed
through OpenGL.

 to avoid the thrashing we would need, really, to allocate 50-50 on
the register file, 50% to int, 50% to FP, or have the code make some
sort of dynamic adjustment.  we *can* do that (halve VL), i just... it
makes me nervous, even just thinking it through.

 i have a hunch that the muxing scheme you envisage could be applied
to a two-wide Common Data Bus, with some minor variations:

 * split banks along alternating register numbers (x5, x7, x9 + x4 x6 x8)
 * split along *half* the register file, upper-32, lower-32.

the reason for that being that the FP workload will be
single-precision SIMD, and, assuming vectors in contiguous registers,
it would result in operand reads, 2 pairs of 32-bits to 2 registers,
being single-cycle.

also when it comes to conversion to integer and so on (ARGB) that is
again 32-bit (or possibly 16-bit colour).

now, obviously, the matrix multiply (and other 1D/2D/3D remapping)
will screw that up, causing one bank to potentially become a
bottleneck, so we have to be careful here.

> I found a
> pdf that describes how big a 6r3w register file is in 48nm, they have a
> 32x128 register file that takes up about 30,000 um^2 (figure 1, fabmem), so
> we can expect somewhere around 50,000 um^2 for a 64x192 6r3w register file
> in 28nm.
>
> https://www.researchgate.net/publication/316727584_A_case_for_standard-cell_based_RAMs_in_highly-ported_superscalar_processor_structures

 zowee ok.

 well, there is this concept:
 https://www.princeton.edu/~rblee/ELE572Papers/MultiBankRegFile_ISCA2000.pdf

 it is a 2-level hierarchy for register cacheing.  honestly, though,
the reservation stations of the tomasulo algorithm are similar to a
cache, although only of the intermediate results, not of the initial
operands.

 i have a feeling we should investigate putting a 2-level register
cache in front of a multiplexed SRAM.

 in those loops you referred to, how many 32-bit values are there, and
how many times are they referenced (as registers) more than once, in
between LD and ST?  an answer to that question will give us a clear
idea of how large register caches would need to be.

l.