[libre-riscv-dev] GPU design

Jacob Lifshay programmerjake at gmail.com
Sun Dec 9 12:08:47 GMT 2018

Ok, to save time and prevent putting off the decision for too long, we can
just go with the split reg file.

Note that AMDGPU doesn't split between int and fp registers and it is
supported by both gcc and llvm, so it isn't a problem for compilers.

We need to definitely implement proper sv context switching or prevent more
than 1 process using the sv registers at a time otherwise it's a security

I still think it's a good idea to build the prototype register allocator,
but targeting the split reg file, to get the algorithms worked out before
getting into the complexity of existing code in llvm.


On Sun, Dec 9, 2018, 02:46 Luke Kenneth Casson Leighton <lkcl at lkcl.net

> ---
> crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68
> On Sun, Dec 9, 2018 at 7:33 AM Jacob Lifshay <programmerjake at gmail.com>
> wrote:
> > If we decide to go with my design and we end up with problems, it would
> be
> > trivial to convert back to a split design,
>  ok, so this is unfortunately another massive alarm-bell when it comes
> to hardware design.  an extremely experienced friend of mine, who has
> worked in the VLSI industry for 30 years, was asked by one of the
> sponsor of this project, "what was the highest cause of project
> failures?" and he answered immediately, "changes to the specification
> and requirements"
>  so initially i was going to say, "yes that's a good idea, we can
> always roll back"... except now we're committed to a
> double-exploration path, that could even go as far as chip layout
> before it gets [potentially] cancelled.
> > we just have to double the
> > register-file size, and add an extra fp/int bit to the register index
> > busses and change the default location of the fp registers in the SV
> rename
> > stage; everything else doesn't need to change.
>  ... except the SV specification.  this is an impact on the SV
> specification: i need to document it, mark it as an experimental
> feature where there's no decision been made.  that is uncertainty that
> will completely knock confidence in SV, because it will be months, if
> not a couple of *years*, before enough information is available.
> > Note that the fp/int bit would need to be generated in the decode stage
> in
> > either design anyway since the default mapping for fp registers is to
> 32-63
> > and the default mapping for int registers is 0-31.
>  in effect, yes.  i would envisage the numbering being carried through
> the scoreboard.
> > >  consequently, it's extremely risky.  and, more than that, there's
> > > alternative (non-risky) options that are much more "standard".
> > >
> > I think the ease of converting the hw back should we encounter problems
> > should mitigate most of the risk.
>  except... at what point can it be *definitively* determined that it's
> ok to proceed?  and, how much effort will that take?
> > We would only need to change the compiler
> > to not eliminate fp/int bitcasts in the register allocator to end up
> with a
> > sub-optimal but usable compiler. I expect that it would take less than a
> > week to convert the register allocator to take advantage of the
> newly-split
> > design should we end up switching.
>  plus time spent on evaluation, plus time spent on an experimental
> llvm compiler, plus time spent on a non-standard experimental gcc
> compiler, plus time spent on project management, plus time spent on
> risk assessment, plus time spent on documenting it as a major change
> to the SV Specification.
>  now, if it was a "standard" part of gcc (and llvm) to have a feature
> like this, i would say "yes let's do it".
>  the fact that there doesn't exist commodity hardware that has this
> feature has me deeply concerned.  it means there *is* no well-tested
> path in the compiler tools on which it is possible to base the idea.
> > >  also, we cannot just assume that llvm will be the only compiler.  we
> > > need gcc as well.
> > >
> > Since it's designed primarily as a GPU and is compatible with RV32GC-only
> > code, we can leave GCC for later if we need to since GCC can already
> > generate usable code and since LLVM is the compiler that I'm planning on
> > using for the performance sensitive parts of the 3D and video-decode
> > drivers.
> it's a GPGPU (+VPU), and i envisioned (eventually) there being
> standard linux kernels and applications compiled up with SV enabled
> (not just GPU or VPU tasks).
> so, at some point, switching over from how we initially discussed
> things (top numbered regs ignored by linux kernel, no
> context-switching, and a GPU - or VPU - process "pinned" per core), to
> a model where the linux kernel can hypothetically context-switch the
> entire lot, and other workloads get a chance to use those
> higher-numbered registers.
> > > > I'm assuming that register allocation is what you think will be most
> of
> > > the
> > > > compiler problems that you're nervous about.
> > >
> >
> > For me, the register allocator is what I envision being the most
> > problematic on the sw side, hence why I am going to try to implement that
> > first to see if the software is workable.
>  this is where my project management instincts are beginning to kick
> in.  *before* spending the time, i'd like a complete evaluation, so
> that we have a really, really thorough idea of where the time (and
> money) will need to go.
>  above i've started to outline the list of tasks needed... it's
> already a massive list.
> > > * the shared workload, between VPU and GPU
> > >
> > The way I was envisioning it to work, independent of how the registers
> are
> > laid out, is that the SV-only registers wouldn't be normally used outside
> > of shaders/video-codecs
>  yes.  i liked this idea.  it was however predicated on being able to
> switch over to a more "standard" general-purpose computing scheme,
> later.
>  thus, it would be possible to e.g. compile up standard parallel
> compute library, and use it in general-purpose low-power parallel
> processing scenarios.
>  with a shared int/fp register file that idea is pretty much terminated.
>  that in turn terminates all the *other* potential revenue areas in
> which this processor could be marketed, because as i learned from
> Aspex Semi, once you deviate from a standard compiler toolchain, the
> market shrinks to a tiny handful of customers that can be counted on
> the fingers of two or even one hand.
> > > * the needs of the GPU for using the integer file (now reduced in
> > > size) for storing pixels
> > >
> > I was never intending on storing more than a few pixels in the registers
> at
> > any one time. I thought we had planned on the pixels being stored in an
> > on-chip tile buffer, basically memory-mapped sram, probably shared
> between
> > all GPU cores. One advantage of having a on-chip tile buffer is it can be
> > used as memory for before DRAM initialization during boot.
>  oh, ok - that didn't make it into the High-level architectural
> requirements we discussed a couple months back: i've added a line for
> it, now.
> > > * that scalar RV hasn't done it
> > >
> > I presume that this is mostly because they wanted to save the bits in the
> > instructions for specifying registers so they could use more than 24 or
> so
> > fp registers at a time, we shouldn't have that problem as 128 is highly
> > likely to be enough.
>  whatever the reason (which we don't know, which is a problem, because
> now we have to find *out* what the reason is, and that is time and
> money spent), it's secondary to the fact that the *compilers* don't
> support this feature.
>  which in turn means: now we have to find out how to add this
> completely unknown feature to the *compilers*, and, unlike SV, it is a
> massive, massive and invasive change to the gcc and llvm codebase.
>  which we have *no idea* how to do.
>  which, in turn, means that we now have to spend time (and money) *finding
> out*.
> > One other advantage of sharing int and fp register space is that, if 128
> > turns out to not be enough, we can implement 256 registers and fp-heavy
> > code can use like 220 of them and int-heavy code can likewise.
>  SV is limited to 128 registers unless there's a massive redesign
> carried out, which will take several weeks implementing in spike-sv,
> plus require a major rethink on the way that the CSRs are compacted.
> > > * that the compiler will need to generate "if else" blocks and/or
> > > function calls on critical loop setup/teardown to dynamically cope
> > > with runtime register allocation
> > >
> > The registers are allocated at compile time, not dynamically at runtime.
>  that's even worse.  now the binaries are hard-coded to this one
> specific core.  they're no longer portable.  whilst we have the source
> code, so in theory "it's ok to recompile", actually it's *not* okay
> because we're writing hand-coded assembler, initially.
>  there are simply too many unknowns here, jacob.  it's time being
> spent even just evaluating an idea that really, really is not needed.
> i've already worked out that we can put rename-aware Reservation
> Stations into the Function Units, in front of the ALUs.
> > > please please think about that before committing the time.  can you do
> > > the hardware *and* the compiler (both gcc and llvm) in the same 2-4
> > > days?
> > >
> > I was planning on implementing the proof-of-concept in Rust, as a
> > completely separate program, since I'm not familiar enough with either
> > or GCC's register allocators to be able to do anything with them right
> now
> > and it would take a while to get up-to-speed on them.
>  *PRECISELY*.  that's *exactly* my point.  we *don't know* how much
> time it will take, and, as a non-standard feature of both gcc and llvm
> that *not one single successful mass-produced commodity hardware
> processor has*, we're absolutely asking for trouble, here.
> > I am not at all familiar of GCC's internals so I can't say for GCC.
>  *EXACTLY*.  so now we need to spend the time evaluating that *before*
> going ahead.
>  this is an absolute nightmare time-sink, basically.
> > I think
> > GCC will require much more extensive modification irrespective of if we
> use
> > my register file modifications since (at least the C/C++ frontend) only
> > supports power-of-2 vector lengths.
>  that's an interesting important thing to know... quite annoying...
> and *thinks*... probably something that will get fixed by the RVV
> working group.  given the huge benefits of variable-length vectors,
> i'm fairly confident that they won't tolerate power-of-2 vector
> lengths, because it requires SIMD-style corner-case cleanup (even
> though the vectorisation engine could actually do NP-of-2)
> > Note that if we use my suggested register file modifications, it won't
> > affect non-SV code since the registers are mapped to disjoint ranges by
> > default.
> >
> > If when we design the hardware, we keep both designs in mind, it should
> be
> > easy to use either design.
>  ... except from a project management and risk management perspective
> it's a total nightmare.  we're already maintaining one major deviation
> from standard hardware (and associated compiler technology).  i'm
> nervous about that, as it is.
>  which brings me on to an important point: SV is, in essence, based on
> the premise that the underlying "scalar" engine does not change.  in
> its simplest form it goes in between the execution and decode engine:
> job done.  of course, changing branches to accumulate predicate bits
> means that the branch logic changes slightly, and the element-widths
> are a huge spanner in the works where, again, it can *just* about be
> shoe-horned into "standard scalar" with some pre-and-post-processing
> rules.
>  the shared register file idea is a whole new level.  it sounds _so_
> simple, just allow the register numbering to be shared.  yet there is
> not a *single* commodity-hardware product out there (the only one we
> know of being an unmitigated disaster: intel MMX) which has the gcc /
> llvm support on which to *begin* basing code modifications.
>  this alone tells me that it is a nightmare which could jeapordise the
> project's viability.  it's just too much.
> >  Outside of the code for the actual register
> > file, I think the only change will be maybe needing one more bit in the
> > register number busses if we go with the split design,
>  slightly ahead of you :)  i'd envisaged it being there anyway.
> > assuming we don't
> > have the hardware split into a fp half and an integer half, which I would
> > recommend against anyway as some functional units can be trivially
> > interconverted between fp and int ops (such as int or fp compare).
>  no, exactly.  i've been studying the diagrams from the two chapters
> of the CDC 6600 that mitch wrote: those inter-conversion points result
> in entries in the "Dependency Matrices" (which are the key strategic
> resource that's completely left out of all the academic literature).
>  splitting the hardware into fp and int would be a royal pain in the
> neck, as that 2D vertical - horizontal set of wires expressing the
> register dependencies between the INT Function Units and the FP
> Function Units would be stretched out to breaking point, instead of
> being a really really compact arrangement [just like a wire mesh
> fence].
>  so yes - bad idea.  group them _together_: yes.  split them out: no.
> > > is it possible to reduce the feedback loop latency in *any* way?
> > I'm not sure what you mean.
>  is there a way to assess how long this proposal will take, including
> bug-fixing, that does *not* span literally months if not years before
> we have a definitive answer?
>  as proposed, the amount of time between "implementing test regfile
> scheme" and "applications being released that *use* that scheme" is,
> at a guess, somewhere around 12 to 18 months away.
>  part of the reason for such a massive estimate is: we have absolutely
> no idea of the amount of time and effort the compiler modifications
> (to gcc and llvm) will take [excluding adding them to the list of
> projects that need funding]
>  now, if you said "it will take 2-4 days, the compiler will be done in
> that time", *that's* a reasonable and acceptable feedback loop. with
> latency on both debugging and decision-making measured in days (not
> months or years), it would be absolutely fine.
>  so, can you envisage a way - on both llvm *and gcc* - in which the
> proposed feature may be tested *right* the way through to an actual,
> real-world compiled-up binary, within days rather than months or
> years?
>  that's not a rhetorical question: if the answer's "yes" then my
> project-management "massive alarm bells" can be turned down a couple
> hundred decibels.
>  however, even once (if) the answer is "yes", it will still be
> necessary to do a cost evaluation, requiring an assessment of the full
> cost in time and effort to get to an actual real compiled binary, plus
> running on simulated hardware (or spike-sv).
>  in addition to *that* it will be necessary to do a full impact
> assessment on other areas where SV could be deployed.
>  it's a *hell* of a lot of work, jacob, just to even do the
> *assessments*, for something that seems so very very simple.
>  ... or, we could *not do it*, we could leverage pre-existing compiler
> paradigms, and use an *existing solution* which is to have Reservation
> Stations (queues of src1/src2 operands to be processed by an ALU,
> basically)
>  remember that the whole idea of a merged int/fp regfile was primarily
> as a way to *not need* different sized Architectural-Physical Register
> Files, which was believed to be necessary as part of
> register-renaming: it turns out that it's not, at all.
>  oh.  i forgot: we would also need to assess the impact of variable
> element-widths in combination with merged int/fp.  as in, setting an
> elwidth of 16 on FP now stops parts of the corresponding *integer*
> register from being used, as well.   absolute total coding nightmare.
> l.
> _______________________________________________
> libre-riscv-dev mailing list
> libre-riscv-dev at lists.libre-riscv.org
> http://lists.libre-riscv.org/mailman/listinfo/libre-riscv-dev

More information about the libre-riscv-dev mailing list