[libre-riscv-dev] GPU design

Sun Dec 9 07:33:08 GMT 2018

On Sat, Dec 8, 2018 at 9:04 PM Luke Kenneth Casson Leighton <lkcl at lkcl.net>
wrote:

> On Sun, Dec 9, 2018 at 3:38 AM Jacob Lifshay <programmerjake at gmail.com>
> wrote:
> >
> > Luke,
> >
> > What do you think of me building a proof of concept register allocator
> for
> > my idea for sharing the int and fp register files? I'd estimate that it
> > would take about 2-4 days of work.
>
>  sure... however please note below: modifying a hardware design and
> expecting the compiler to "sort out the mess" is a sure-fire
> guaranteed way to kill a project.  i've learned of *another* project
> that did this:
>
>  https://groups.google.com/d/msg/comp.arch/w5fUBkrcw-s/BMGz1HoUCAAJ
>
>  others include the Aspex Semiconductors Array-String Processor, which
> had one of the worst productivity rates i've ever encountered: DAYS
> per line of ASSEMBLY code.  i worked for aspex as a FAE for six
> months.
>
>
> > If it works well, we could reuse the architecture in the actual llvm
> > backend when we get around to writing that.
>
>  that's exactly what concerns me.  without the compiler work being
> done *as well* we have absolutely no sure-fire guaranteed way to know
> if the idea will be successful... or not.
>

If we decide to go with my design and we end up with problems, it would be
trivial to convert back to a split design, we just have to double the
register-file size, and add an extra fp/int bit to the register index
busses and change the default location of the fp registers in the SV rename
stage; everything else doesn't need to change.
Note that the fp/int bit would need to be generated in the decode stage in
either design anyway since the default mapping for fp registers is to 32-63
and the default mapping for int registers is 0-31.

>
>  consequently, it's extremely risky.  and, more than that, there's
> alternative (non-risky) options that are much more "standard".
>
I think the ease of converting the hw back should we encounter problems
should mitigate most of the risk. We would only need to change the compiler
to not eliminate fp/int bitcasts in the register allocator to end up with a
sub-optimal but usable compiler. I expect that it would take less than a
week to convert the register allocator to take advantage of the newly-split
design should we end up switching.

>
>  also, we cannot just assume that llvm will be the only compiler.  we
> need gcc as well.
>
Since it's designed primarily as a GPU and is compatible with RV32GC-only
code, we can leave GCC for later if we need to since GCC can already
generate usable code and since LLVM is the compiler that I'm planning on
using for the performance sensitive parts of the 3D and video-decode
drivers.

>
>
> > I'm assuming that register allocation is what you think will be most of
> the
> > compiler problems that you're nervous about.
>

For me, the register allocator is what I envision being the most
problematic on the sw side, hence why I am going to try to implement that
first to see if the software is workable.

>
>  it's a huge number of things:
>
> * the shared workload, between VPU and GPU
>
The way I was envisioning it to work, independent of how the registers are
laid out, is that the SV-only registers wouldn't be normally used outside
of shaders/video-codecs (henceforth "shader") (where they are actually
needed) and they wouldn't need all the integer and all the fp registers
simultaneously on a single core: either they would be on the same thread
and one shader would finish (making all the extended registers dead, hence
not needing to be saved/restored) before the next would start, or, they
would be in separate threads and they would switch threads at around the
1kHz linux system tick rate. You wouldn't have to worry about inter-thread
dependencies too much as you would only have an inter-thread dependency
once every frame.

> * the needs of the GPU for using the integer file (now reduced in
> size) for storing pixels
>
I was never intending on storing more than a few pixels in the registers at
any one time. I thought we had planned on the pixels being stored in an
on-chip tile buffer, basically memory-mapped sram, probably shared between
all GPU cores. One advantage of having a on-chip tile buffer is it can be
used as memory for before DRAM initialization during boot.

> * that scalar RV hasn't done it
>
I presume that this is mostly because they wanted to save the bits in the
instructions for specifying registers so they could use more than 24 or so
fp registers at a time, we shouldn't have that problem as 128 is highly
likely to be enough.

One other advantage of sharing int and fp register space is that, if 128
turns out to not be enough, we can implement 256 registers and fp-heavy
code can use like 220 of them and int-heavy code can likewise. This is much
more than the code could use for the alternative split design.

> * that the compiler will need to generate "if else" blocks and/or
> function calls on critical loop setup/teardown to dynamically cope
> with runtime register allocation
>
The registers are allocated at compile time, not dynamically at runtime.

> * several other issues which i suspect will crop up as well
>
> honestly, i feel it's one of those nightmare areas that will take
> several months to _retrospectively_ have worked out that it wasn't a
> good idea.
>
> and, with the exploration of the CDC 6600 with mitch alsup's help,
> register renaming can be done by adding one- two- or three- entry
> register queues at the front of each functional unit.
>
> basically, the need for a merged register file is moot, it's hugely
> problematic, it'll take a *long* time to work out *that* it was
> problematic.
>
> please please think about that before committing the time.  can you do
> the hardware *and* the compiler (both gcc and llvm) in the same 2-4
> days?
>
I was planning on implementing the proof-of-concept in Rust, as a
completely separate program, since I'm not familiar enough with either LLVM
or GCC's register allocators to be able to do anything with them right now
and it would take a while to get up-to-speed on them. I'm planning on using
a similar algorithm to what LLVM currently uses, so translating it to
LLVM's code shouldn't be too difficult.

I don't know for sure, but I'm anticipating it will take similar amounts of
time for LLVM for implementing either option, so it would be an
approximately 0-day length difference.

I am not at all familiar of GCC's internals so I can't say for GCC. I think
GCC will require much more extensive modification irrespective of if we use
my register file modifications since (at least the C/C++ frontend) only
supports power-of-2 vector lengths.

Note that if we use my suggested register file modifications, it won't
affect non-SV code since the registers are mapped to disjoint ranges by
default.

If when we design the hardware, we keep both designs in mind, it should be
easy to use either design. Outside of the code for the actual register
file, I think the only change will be maybe needing one more bit in the
register number busses if we go with the split design, assuming we don't
have the hardware split into a fp half and an integer half, which I would
recommend against anyway as some functional units can be trivially
interconverted between fp and int ops (such as int or fp compare).

> is it possible to reduce the feedback loop latency in *any*
way?
I'm not sure what you mean.

Jacob