[libre-riscv-dev] GPU design

Sun Dec 9 10:46:08 GMT 2018

---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68

On Sun, Dec 9, 2018 at 7:33 AM Jacob Lifshay <programmerjake at gmail.com> wrote:

> If we decide to go with my design and we end up with problems, it would be
> trivial to convert back to a split design,

 ok, so this is unfortunately another massive alarm-bell when it comes
to hardware design.  an extremely experienced friend of mine, who has
worked in the VLSI industry for 30 years, was asked by one of the
sponsor of this project, "what was the highest cause of project
failures?" and he answered immediately, "changes to the specification
and requirements"

 so initially i was going to say, "yes that's a good idea, we can
always roll back"... except now we're committed to a
double-exploration path, that could even go as far as chip layout
before it gets [potentially] cancelled.

> we just have to double the
> register-file size, and add an extra fp/int bit to the register index
> busses and change the default location of the fp registers in the SV rename
> stage; everything else doesn't need to change.

 ... except the SV specification.  this is an impact on the SV
specification: i need to document it, mark it as an experimental
feature where there's no decision been made.  that is uncertainty that
will completely knock confidence in SV, because it will be months, if
not a couple of *years*, before enough information is available.

> Note that the fp/int bit would need to be generated in the decode stage in
> either design anyway since the default mapping for fp registers is to 32-63
> and the default mapping for int registers is 0-31.

 in effect, yes.  i would envisage the numbering being carried through
the scoreboard.

> >  consequently, it's extremely risky.  and, more than that, there's
> > alternative (non-risky) options that are much more "standard".
> >
> I think the ease of converting the hw back should we encounter problems
> should mitigate most of the risk.

 except... at what point can it be *definitively* determined that it's
ok to proceed?  and, how much effort will that take?

> We would only need to change the compiler
> to not eliminate fp/int bitcasts in the register allocator to end up with a
> sub-optimal but usable compiler. I expect that it would take less than a
> week to convert the register allocator to take advantage of the newly-split
> design should we end up switching.

 plus time spent on evaluation, plus time spent on an experimental
llvm compiler, plus time spent on a non-standard experimental gcc
compiler, plus time spent on project management, plus time spent on
risk assessment, plus time spent on documenting it as a major change
to the SV Specification.

 now, if it was a "standard" part of gcc (and llvm) to have a feature
like this, i would say "yes let's do it".

 the fact that there doesn't exist commodity hardware that has this
feature has me deeply concerned.  it means there *is* no well-tested
path in the compiler tools on which it is possible to base the idea.

> >  also, we cannot just assume that llvm will be the only compiler.  we
> > need gcc as well.
> >
> Since it's designed primarily as a GPU and is compatible with RV32GC-only
> code, we can leave GCC for later if we need to since GCC can already
> generate usable code and since LLVM is the compiler that I'm planning on
> using for the performance sensitive parts of the 3D and video-decode
> drivers.

it's a GPGPU (+VPU), and i envisioned (eventually) there being
standard linux kernels and applications compiled up with SV enabled
(not just GPU or VPU tasks).

so, at some point, switching over from how we initially discussed
things (top numbered regs ignored by linux kernel, no
context-switching, and a GPU - or VPU - process "pinned" per core), to
a model where the linux kernel can hypothetically context-switch the
entire lot, and other workloads get a chance to use those
higher-numbered registers.

> > > I'm assuming that register allocation is what you think will be most of
> > the
> > > compiler problems that you're nervous about.
> >
>
> For me, the register allocator is what I envision being the most
> problematic on the sw side, hence why I am going to try to implement that
> first to see if the software is workable.

 this is where my project management instincts are beginning to kick
in.  *before* spending the time, i'd like a complete evaluation, so
that we have a really, really thorough idea of where the time (and
money) will need to go.

 above i've started to outline the list of tasks needed... it's
already a massive list.

> > * the shared workload, between VPU and GPU
> >
> The way I was envisioning it to work, independent of how the registers are
> laid out, is that the SV-only registers wouldn't be normally used outside
> of shaders/video-codecs

 yes.  i liked this idea.  it was however predicated on being able to
switch over to a more "standard" general-purpose computing scheme,
later.

 thus, it would be possible to e.g. compile up standard parallel
compute library, and use it in general-purpose low-power parallel
processing scenarios.

 with a shared int/fp register file that idea is pretty much terminated.

 that in turn terminates all the *other* potential revenue areas in
which this processor could be marketed, because as i learned from
Aspex Semi, once you deviate from a standard compiler toolchain, the
market shrinks to a tiny handful of customers that can be counted on
the fingers of two or even one hand.

> > * the needs of the GPU for using the integer file (now reduced in
> > size) for storing pixels
> >
> I was never intending on storing more than a few pixels in the registers at
> any one time. I thought we had planned on the pixels being stored in an
> on-chip tile buffer, basically memory-mapped sram, probably shared between
> all GPU cores. One advantage of having a on-chip tile buffer is it can be
> used as memory for before DRAM initialization during boot.

 oh, ok - that didn't make it into the High-level architectural
requirements we discussed a couple months back: i've added a line for
it, now.

> > * that scalar RV hasn't done it
> >
> I presume that this is mostly because they wanted to save the bits in the
> instructions for specifying registers so they could use more than 24 or so
> fp registers at a time, we shouldn't have that problem as 128 is highly
> likely to be enough.

 whatever the reason (which we don't know, which is a problem, because
now we have to find *out* what the reason is, and that is time and
money spent), it's secondary to the fact that the *compilers* don't
support this feature.

 which in turn means: now we have to find out how to add this
completely unknown feature to the *compilers*, and, unlike SV, it is a
massive, massive and invasive change to the gcc and llvm codebase.

 which we have *no idea* how to do.

 which, in turn, means that we now have to spend time (and money) *finding out*.

> One other advantage of sharing int and fp register space is that, if 128
> turns out to not be enough, we can implement 256 registers and fp-heavy
> code can use like 220 of them and int-heavy code can likewise.

 SV is limited to 128 registers unless there's a massive redesign
carried out, which will take several weeks implementing in spike-sv,
plus require a major rethink on the way that the CSRs are compacted.

> > * that the compiler will need to generate "if else" blocks and/or
> > function calls on critical loop setup/teardown to dynamically cope
> > with runtime register allocation
> >
> The registers are allocated at compile time, not dynamically at runtime.

 that's even worse.  now the binaries are hard-coded to this one
specific core.  they're no longer portable.  whilst we have the source
code, so in theory "it's ok to recompile", actually it's *not* okay
because we're writing hand-coded assembler, initially.

 there are simply too many unknowns here, jacob.  it's time being
spent even just evaluating an idea that really, really is not needed.
i've already worked out that we can put rename-aware Reservation
Stations into the Function Units, in front of the ALUs.

> > please please think about that before committing the time.  can you do
> > the hardware *and* the compiler (both gcc and llvm) in the same 2-4
> > days?
> >
> I was planning on implementing the proof-of-concept in Rust, as a
> completely separate program, since I'm not familiar enough with either LLVM
> or GCC's register allocators to be able to do anything with them right now
> and it would take a while to get up-to-speed on them.

 *PRECISELY*.  that's *exactly* my point.  we *don't know* how much
time it will take, and, as a non-standard feature of both gcc and llvm
that *not one single successful mass-produced commodity hardware
processor has*, we're absolutely asking for trouble, here.

> I am not at all familiar of GCC's internals so I can't say for GCC.

 *EXACTLY*.  so now we need to spend the time evaluating that *before*
going ahead.

 this is an absolute nightmare time-sink, basically.

> I think
> GCC will require much more extensive modification irrespective of if we use
> my register file modifications since (at least the C/C++ frontend) only
> supports power-of-2 vector lengths.

 that's an interesting important thing to know... quite annoying...
and *thinks*... probably something that will get fixed by the RVV
working group.  given the huge benefits of variable-length vectors,
i'm fairly confident that they won't tolerate power-of-2 vector
lengths, because it requires SIMD-style corner-case cleanup (even
though the vectorisation engine could actually do NP-of-2)

> Note that if we use my suggested register file modifications, it won't
> affect non-SV code since the registers are mapped to disjoint ranges by
> default.
>
> If when we design the hardware, we keep both designs in mind, it should be
> easy to use either design.

 ... except from a project management and risk management perspective
it's a total nightmare.  we're already maintaining one major deviation
from standard hardware (and associated compiler technology).  i'm
nervous about that, as it is.

 which brings me on to an important point: SV is, in essence, based on
the premise that the underlying "scalar" engine does not change.  in
its simplest form it goes in between the execution and decode engine:
job done.  of course, changing branches to accumulate predicate bits
means that the branch logic changes slightly, and the element-widths
are a huge spanner in the works where, again, it can *just* about be
shoe-horned into "standard scalar" with some pre-and-post-processing
rules.

 the shared register file idea is a whole new level.  it sounds _so_
simple, just allow the register numbering to be shared.  yet there is
not a *single* commodity-hardware product out there (the only one we
know of being an unmitigated disaster: intel MMX) which has the gcc /
llvm support on which to *begin* basing code modifications.

 this alone tells me that it is a nightmare which could jeapordise the
project's viability.  it's just too much.

>  Outside of the code for the actual register
> file, I think the only change will be maybe needing one more bit in the
> register number busses if we go with the split design,

 slightly ahead of you :)  i'd envisaged it being there anyway.

> assuming we don't
> have the hardware split into a fp half and an integer half, which I would
> recommend against anyway as some functional units can be trivially
> interconverted between fp and int ops (such as int or fp compare).

 no, exactly.  i've been studying the diagrams from the two chapters
of the CDC 6600 that mitch wrote: those inter-conversion points result
in entries in the "Dependency Matrices" (which are the key strategic
resource that's completely left out of all the academic literature).

 splitting the hardware into fp and int would be a royal pain in the
neck, as that 2D vertical - horizontal set of wires expressing the
register dependencies between the INT Function Units and the FP
Function Units would be stretched out to breaking point, instead of
being a really really compact arrangement [just like a wire mesh
fence].

 so yes - bad idea.  group them _together_: yes.  split them out: no.

> > is it possible to reduce the feedback loop latency in *any* way?
> I'm not sure what you mean.

 is there a way to assess how long this proposal will take, including
bug-fixing, that does *not* span literally months if not years before
we have a definitive answer?

 as proposed, the amount of time between "implementing test regfile
scheme" and "applications being released that *use* that scheme" is,
at a guess, somewhere around 12 to 18 months away.

 part of the reason for such a massive estimate is: we have absolutely
no idea of the amount of time and effort the compiler modifications
(to gcc and llvm) will take [excluding adding them to the list of
projects that need funding]

 now, if you said "it will take 2-4 days, the compiler will be done in
that time", *that's* a reasonable and acceptable feedback loop. with
latency on both debugging and decision-making measured in days (not
months or years), it would be absolutely fine.

 so, can you envisage a way - on both llvm *and gcc* - in which the
proposed feature may be tested *right* the way through to an actual,
real-world compiled-up binary, within days rather than months or
years?

 that's not a rhetorical question: if the answer's "yes" then my
project-management "massive alarm bells" can be turned down a couple
hundred decibels.

 however, even once (if) the answer is "yes", it will still be
necessary to do a cost evaluation, requiring an assessment of the full
cost in time and effort to get to an actual real compiled binary, plus
running on simulated hardware (or spike-sv).

 in addition to *that* it will be necessary to do a full impact
assessment on other areas where SV could be deployed.

 it's a *hell* of a lot of work, jacob, just to even do the
*assessments*, for something that seems so very very simple.

 ... or, we could *not do it*, we could leverage pre-existing compiler
paradigms, and use an *existing solution* which is to have Reservation
Stations (queues of src1/src2 operands to be processed by an ALU,
basically)

 remember that the whole idea of a merged int/fp regfile was primarily
as a way to *not need* different sized Architectural-Physical Register
Files, which was believed to be necessary as part of
register-renaming: it turns out that it's not, at all.

 oh.  i forgot: we would also need to assess the impact of variable
element-widths in combination with merged int/fp.  as in, setting an
elwidth of 16 on FP now stops parts of the corresponding *integer*
register from being used, as well.   absolute total coding nightmare.

l.