[libre-riscv-dev] GPU design

Mon Dec 10 08:21:10 GMT 2018

On Sun, Dec 9, 2018 at 9:16 PM Jacob Lifshay <programmerjake at gmail.com> wrote:
>
> On Sun, Dec 9, 2018, 05:34 Luke Kenneth Casson Leighton <lkcl at lkcl.net
> wrote:
>
> > On Sun, Dec 9, 2018 at 12:09 PM Jacob Lifshay <programmerjake at gmail.com>
> > wrote:
> >
> > > I still think it's a good idea to build the prototype register allocator,
> > > but targeting the split reg file, to get the algorithms worked out before
> > > getting into the complexity of existing code in llvm.
> >
> >  do you mean the "renaming" mechanism? or the multiplexor?  or the
> > Reservation Stations... or all of the above? :)
> >
> I don't mean any of the hw used to implement it, i mean the part of a
> compiler that allocates architecturally-visible registers for variables.

 ah ok.  yes.  in discussions dating back to the llvm AMDGPU RFC,
 the idea occurred to me that a higher-order-function will be needed,
 to select "free registers".

 pass in an array of register+offset+elwidths (ultimately, a bitfield
indicating which *bytes* of the regfile are in use), plus an array of
register+elwidths and the "preferred vector length", and the function
will, in a similar way to malloc, find the best match to the requested
register-allocation and return a new "allocated" array/bitfield.

> SV
> is unusual in that it will need a 2-pass register allocation algorithm:
> allocating each variable a range of successive registers from the 128 SV
> rename target registers, then allocating the registers used in the actual
> instruction encodings and adding csr writes wherever they are needed.

 it'll be more complex than that, due to the element widths and
offsets, and an optimiser would need to take into account the current
state (the CSR rename "stack"), using how many CSR writes it would
take to get from the current state to the desired state (and back) as
a cost guide.

> Note that I think it will be very important that we are able to write to a
> csr to change the rename cam and be able to use the new rename entry with
> the vectorized instruction able to start in the clock cycle immediately
> after the previous vectorized instruction starts, either through macro-op
> fusion or dual-issue or some other means.

 yyeah that's going to be hell when the 1D/2D/3D REMAP is active.
it'll be challenging enough as it is, because there's 28 (or so) bits
worth of state:

 * number of CSRs marked "active" (4 bits)
 * MAXVL (6 bits)
 * VL (6 bits)
 * src element offset (6 bits)
 * twin-predicated element offset (6 bits)

 that's an "ok" amount to associate with each instruction (so that
rollback can occur).  the 1D/2D/3D REMAP really isn't.

 also, neither, realistically, are the reg / predication CSR stacks.
now, we *might* be able to leverage the fact that some of them are
disabled through the "num_active" CSR (which will make a range of the
CSR stack "invisible" without actually modifying any of the entries).

 however the moment any of those CSR reg/pred stack entries are
modified, i really really strongly advocate that we stall and not
attempt to do any kind of speculative execution, not for a first
version, at least.

> We may want to add additional
> instructions that allow programming the rename table faster than the
> builtin csr instructions. I would suggest a csrrwi with a bigger immediate
> and a smaller csr selector.

 yes, ok, so if we can have "presets" - commonly-used read-only
configurations of CSR reg/pred entries - then the problem of which set
to use during speculative execution goes away.

 now, if those read-only configurations actually involve swapping in
and out alternative tables that can *also* be written to, then we have
a huge amount of flexibility, no stalling needed, the only penalty
being: now those alternative tables (16x 16-bit for reg, same for
pred) become part of the context-switch state.

l.