[libre-riscv-dev] GPU design

Wed Dec 5 09:52:07 GMT 2018

On Tue, Dec 4, 2018, 20:40 lkcl <lkcl at libre-riscv.org wrote:

> On Tue, Dec 4, 2018 at 8:30 PM Jacob Lifshay <programmerjake at gmail.com>
> wrote:
>
> > If we use renaming and my idea for having the integer and fp register
> files
> > be the same thing, we can get away with 1.5x 128 64-bit or less.
>
>  making the int and fp regfiles be the same thing would make the
> compilers extremely complex, and/or may require massive
> context-thrashing, especially for mixed VPU / GPU workloads.  playback
> of videos within browsers typically now requires OpenGL (or
> heavily-customised patching) as the rendered buffers are passed
> through OpenGL.
>
It might actually make the compiler simpler since we wouldn't have as many
different kinds of register to allocate. I don't think you'll have any
context thrashing, unless you have too many processes, which would thrash
on any processor anyway. Other than the standard int and fp registers, the
rest could be all caller saved registers, most the actual code that needs
the upper registers will be calling other code in the same shader, which we
can optimize by using a different calling convention in the jit-compiled
code.

>
>  to avoid the thrashing we would need, really, to allocate 50-50 on
> the register file, 50% to int, 50% to FP, or have the code make some
> sort of dynamic adjustment.  we *can* do that (halve VL), i just... it
> makes me nervous, even just thinking it through.
>
We wouldn't need to make any kind of adjustment along those lines.

You could think of it as transitioning from a disk with 2 partitions to 1
partition, the filesystem can now just allocate any block on the whole disk
rather than being limited to half the disk
where the disk is the register file, the partitions are the rv-base integer
and fp register files, and the blocks are allocatable registers that the
compiler allocates in the register allocator.

>
>
>  i have a hunch that the muxing scheme you envisage could be applied
> to a two-wide Common Data Bus, with some minor variations:
>
>  * split banks along alternating register numbers (x5, x7, x9 + x4 x6 x8)
>  * split along *half* the register file, upper-32, lower-32.
>
> the reason for that being that the FP workload will be
> single-precision SIMD, and, assuming vectors in contiguous registers,
> it would result in operand reads, 2 pairs of 32-bits to 2 registers,
> being single-cycle.
>
> also when it comes to conversion to integer and so on (ARGB) that is
> again 32-bit (or possibly 16-bit colour).
>
> now, obviously, the matrix multiply (and other 1D/2D/3D remapping)
> will screw that up, causing one bank to potentially become a
> bottleneck, so we have to be careful here.
>
> > I found a
> > pdf that describes how big a 6r3w register file is in 48nm, they have a
> > 32x128 register file that takes up about 30,000 um^2 (figure 1, fabmem),
> so
> > we can expect somewhere around 50,000 um^2 for a 64x192 6r3w register
> file
> > in 28nm.
> >
> >
> https://www.researchgate.net/publication/316727584_A_case_for_standard-cell_based_RAMs_in_highly-ported_superscalar_processor_structures
>
>  zowee ok.
>
>  well, there is this concept:
>
> https://www.princeton.edu/~rblee/ELE572Papers/MultiBankRegFile_ISCA2000.pdf
>
>  it is a 2-level hierarchy for register cacheing.  honestly, though,
> the reservation stations of the tomasulo algorithm are similar to a
> cache, although only of the intermediate results, not of the initial
> operands.
>
>  i have a feeling we should investigate putting a 2-level register
> cache in front of a multiplexed SRAM.
>

Wether or not we end up adding caching, i really like combining register
renaming with a scoreboard and reorder buffer, since we could split the
register file so each alu writes to only one portion of the register file
and we could allocate each new register from the portion associated with
the alu creating the value. this would allow us to greatly reduce the
number of write ports required for each register file portion. I guess it's
similar to tomasulo's algorithm except that the part associated with the
alu stores the results instead of the inputs.

>
>  in those loops you referred to, how many 32-bit values are there, and
> how many times are they referenced (as registers) more than once, in
> between LD and ST?  an answer to that question will give us a clear
> idea of how large register caches would need to be.
>

I don't recall which loops I was referring to, but I'm designing kazan so
the entire shader is 1 iteration of the inner loop.

Jacob