[libre-riscv-dev] building a simple barrel processor

Fri Mar 8 19:01:00 GMT 2019

On Fri, Mar 8, 2019, 02:23 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
wrote:

> On Fri, Mar 8, 2019 at 9:03 AM Jacob Lifshay <programmerjake at gmail.com>
> wrote:
> > >  we'd therefore need to completely change the design strategy to a
> > > dual (split) CPU + GPU,
> >
> > add forwarding and skip idle harts (defined as harts executing wfi),
> could
> > have the low registers have more ports or maybe a 4-8 reg-per-hart 3r1w
> > cache.
> >
> > could alternatively have only first hart in each core have a fast mode,
> > linux can handle that thanks to ARM's bigLITTLE support in the scheduler
> > (as of 5.0).
>
>  it's getting complicated, already, isn't it?  and, whatever it is: at
> its core, the design is single-issue in-order pipelining, which is
> already a red flag.  on top of the already-escalating complexity, we
> have to fit SV (as a non-SIMD architecture), *and* fit subdivisions of
> the register file and the ALUs down to the 8-bit variable-length
> vectorisation level
>
if we execute SV like it is a standard vector isa, having the registers
split to the 32-bit level and the same alu design we already have, that
won't make the scheduling circuitry any more complicated than that needed
for div/mod or any other instruction that takes more cycles to execute than
the number of harts per core. The instructions are simply passed through
the pipeline again in the next pipeline slot for the same hart.

>
>  i've spent something like 5 months research and planning the
> multi-issue OoO design, including weeks of research into a strategy
> that will allow us to handle instruction issue down to the 8-bit
> vector level, *without* interfering with the use of the exact same
> register file for 16, 32 and 64-bit: they co-exist.  that's incredibly
> unusual.
>
> in-order pipelines with SIMD-like underlying instructions were already
> rejected at a very early part of the decision tree.
>
> virtually *none* of the planning and research into the multi-issue OoO
> design will transfer over to a single-order in-order pipeline design.
>
> if the barrel processor had any significant advantages (aside from the
> uniformity) i'd be jumping at it with enthusiasm.  if we had known of
> its existence six months ago, i would have welcomed a comprehensive
> and full analysis.
>
> as things stand, however, i am not seeing any significant advantages:
> i am seeing instead primarily disadvantages, particularly compared to
> having to abandon literally months of design and research effort into
> a much more flexible, comprehensive, and expandable design, and
> starting almost entirely from scratch.
>
One advantage is having simple control hw and not having to worry about
spectre, since there is no speculative execution.

>
> examples of what constitutes a better design include the Q-Table
> "History" addition, which is an innovation even above and beyond what
> Intel, ARM and AMD have ever come up with.  the Q-Table "History"
> allows precise nameless register renaming, where the removal of
> register names provides an opportunity to skip register writes
> entirely [operand forwarding on steroids].
>
> the normal methods by which the same end-result is achieved is to either:
>
> * have a complete periodic [snapshot] "Historic State", to which the
> entire register and CSR state is "rolled back".
> * destroy the ENTIRETY of the current Function Unit Reservation State,
> roll back dozens of instructions, wait for the processor to stabilise,
> then proceed in SINGLE issue mode very slowly, switching off
> operand-forwarding and other critical power-reducing and performance
> optimisations.
>
> such a feature would be flat-out impossible to add to a single-issue
> in-order pipelined design, as the whole concept is critically
> dependent on the dynamic analysis of multiple in-flight instructions,
> based on the allocation to Reservation Stations / Function Units.  a
> single-issue in-order pipeline *has* no in-flight instructions, and
> *has* no Reservation Stations or Function Units.
>
> what i'm trying to get across here is: by comparison, a barrel
> processor is a huge technological step backwards, and is, i feel, a
> completely wrong fit for use as a hybrid CPU-GPU, and would be complex
> and require abandoning half a year's research if used as a dedicated
> GPU.
>
I am perfectly fine if we decide that a barrel processor is not a good fit,
I just think that the complexity is being over-estimated and we should at
least build a RV64IMA prototype, maybe including the fast hart 0 design in
it.

Note that I am not suggesting we need a complete processor to be supported
by the fast hart 0 mode, I think switching back to the barrel processor
mode to execute non-RV64GC instructions is a good idea, since that will
reduce the complexity of the fast pipeline by a lot. To switch modes, the
pipeline would be flushed allowing us to not have to implement all the
complexity of executing instructions during the transition. Basically, it's
a full-featured barrel processor that shares pipeline resources with a
minimal in-order RV64GC processor such that the pipeline runs in two
separate modes, without any temporal overlap.

For the mode selection algorithm, we could use the following:
if any hart other than hart 0 is not executing wfi, then use barrel mode.
else if a non-fast-mode instruction is encountered, then use barrel mode.
else if the number of clock cycles since the last non-fast-mode instruction
was encountered is more than 64, then use fast mode.
else, use barrel mode.

>
> by contrast, even with the low clock rate, it *is* however perfect as
> an IO bit-banging soft-implementation of peripherals, and that's
> traditionally exactly what it's been used for.
>
>
> also, i remember now: i discussed / evaluated the idea of the
> single-porting with mitch alsup.  he suggested using reduced-ported
> register files and extending the pipeline to read op1 as a first
> stage, op2 as a second stage, op3 as a third, then have a 4-stage FMAC
> and finally a write stage, for a total of 8 stages (instead of the
> normal 6, where ops 1/2/3 are potentially done in a single stage).
>
> the problem was: to get the equivalent performance, you needed *FOUR*
> times the data.  i.e. if you wanted to do 4-vectors, the concept
> forces you to do *four* 4-vectors at once.
>
for the barrel processor, that problem doesn't exist since each hart has an
independent register bank and you can run operations on different banks
simultaneously, allowing us to have each hart's register file have only a
single 128-bit (4x32-bit) r/w port for a 4x32-bit alu, rather than the
512-bit r/w port that you were probably thinking of. It would operate
similarly to how Hwacha reads and writes registers (Figure 4 on page 12 in
http://www.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-263.pdf) except
that it goes through harts rather than vector elements.

Jacob