[libre-riscv-dev] building a simple barrel processor

Fri Mar 29 23:03:11 GMT 2019

Sorry for taking so long to respond.

On Sat, Mar 9, 2019 at 5:19 AM Luke Kenneth Casson Leighton <lkcl at lkcl.net>
wrote:

>  the SIMD method has the known strong disadvantage that predicate
> masking results in reduced utilisation.  every element that is masked
> out is a wasted opportunity to do useful work.
>
I would argue that reduced utilization is not as strong of a disadvantage
because of 2 reasons:
1. If the design is multi-cycle SIMD, totally masked out cycles can be
skipped, increasing utilization again, for a slight increase in scheduling
cost.
2. For a increase in routing cost, elements can be packed when going
through the ALUs and unpacked again when in registers or memory.
SIMD also has the important advantage of being able to increase the
ALUs/core ratio without a large increase in scheduling costs compared to
adding more issue slots in an out-of-order processor. if the instruction
set is designed with a decent API (such as the SimpleV proposals), the
compiler won't be a problem.

 note that there is a real and distinct difference between what an OoO
> can be *used* for and what is *needed*.  speculative branch prediction
> and execution is not *needed* in an OoO design: it's just that it's
> *possible* in an OoO design (where it is flat-out impossible in an
> in-order one).
>
speculative branch prediction is certainly possible on an in-order design:
every instruction in the pipeline after a branch but before the execute
stage is speculative, unless the designers stopped filling the pipeline
after every instruction that could potentially trap or jump until it was
known if the instruction traps or jumps (which I hope any designer has a
VERY good reason for designing it that way as it would be very detrimental
to performance).

> I am perfectly fine if we decide that a barrel processor is not a good
> fit,
> > I just think that the complexity is being over-estimated and we should at
> > least build a RV64IMA prototype, maybe including the fast hart 0 design
> in
> > it.
>
>  i think we'll find that, hilariously, an in-order design is a
> degenerate case of the OoO one.
>
Yeah.

 the key (sole) advantage of a barrel processor - as separate and

distinct from the timing-attack resistance of a single-issue, uniform
> and zero-stall pipeline - is to provide real-time threading
> guarantees.
>
I agree that there is almost zero need for real-time hyperthreading
(assuming we don't care that much about timing-attack resistance). However
you are missing a key advantage of the barrel processor, removing the need
for most of the hardware that handles forwarding, checking for stalls, and
scheduling in general while retaining the same (or higher, because of not
needing to stall) pipeline throughput. Most of the scheduling HW needed is
just a flag that says run-next-instruction vs. continue-current-instruction
(for long instructions such as a load that missed the L1 cache or a
division instruction).

> > the problem was: to get the equivalent performance, you needed *FOUR*
> > > times the data.  i.e. if you wanted to do 4-vectors, the concept
> > > forces you to do *four* 4-vectors at once.
> > >
> > for the barrel processor, that problem doesn't exist since each hart has
> an
> > independent register bank and you can run operations on different banks
> > simultaneously, allowing us to have each hart's register file have only a
> > single 128-bit (4x32-bit) r/w port for a 4x32-bit alu, rather than the
> > 512-bit r/w port that you were probably thinking of. It would operate
> > similarly to how Hwacha reads and writes registers (Figure 4 on page 12
> in
> > http://www.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-263.pdf)
> except
> > that it goes through harts rather than vector elements.
>
>  that design involves an NxN crossbar.  N-wide, N-ported.  the
> resource utilisation is *enormous*.  if you recall we went to a heck
> of a lot of trouble to avoid NxN crossbars, by including operand
> forwarding, "de-naming" (and restoration), placing the lane-routing
> *after* the operand computation, and a few other things besides.
>
Assuming a predicated FMA doesn't take multiple times through the pipeline
to read all of it's inputs, for a 8-stage (8-hart) pipeline with a 64-bit
wide ALU (2 32-bit FMAs per cycle for 4 FLOP/cycle), we would have 8
single-ported srams with a 64-bit wide port and we would have 4 64-bit
buses from the srams to the rest of the core and 1 64-bit bus from the core
to the sram. We would have a 64x8 to 64x4 crossbar since the written back
values can go on a single 64-bit bus to all SRAMs.

>
>  it may not be clear that adding barrel slots simply cannot increase
> the total number of FLOPs: *all* it does is, interleave N execution
> threads and *correspondingly stretches the completion time of each by
> N as well*.

It's actually stretches the completion time less than N, since in a
traditional in-order design, we still have to wait for the execution units
to finish, reducing the per-thread performance penalty of barrel
processors. This effect particularly helps in cases where the program has a
lot of serial data dependencies, since the other threads still utilize the
ALUs while a thread is waiting for the ALUs to finish. I think this effect
is one of the key benefits of barrel processors.

>   increasing the total number of FLOPs can only be done by
> increasing the actual number of ALUs (and making sure that they're
> active), and adding extra barrel slots does *not* increase the number
> of ALUs.

 If we have the 4x32-bit ALUs per-core, we should still have plenty of
FLOPS no matter what architecture we use (as long as it can keep the ALUs
busy).

>
>  what it does instead is, if 1R1W SRAM is used for the register file,
> it increases the latency of the pipeline and multiplies (extends) the
> instruction completion time even further by the number of barrel
> slots.
>
Since we can use single-port SRAM, we can probably save more power than
would be needed for a few less 2R1W triple port SRAMs. Additionally, each
SRAM would be activated at most 5/8 of the time (for a 8-stage pipeline
executing predicated FMAs), allowing us to save more power by not needing
to activate the SRAMs every cycle.

>
>  to compensate for that, the amount of parallelism required in the
> incoming data stream will *also* need to be multiplied by the number
> of barrel slots.
>
>  that then results in increased latency for *single* process
> performance... bear in mind that this is a single-issue design, so now
> a branch point results in extended execution stalling.
>
Actually, the latency is not increased as much, most instructions complete
in a single rotation through all the threads. Additionally, since a branch
can be resolved earlier in the cycle due to not needing to write anything
to the register file, we can forward the next-pc information directly to
the fetch stage, resulting in 0 branch stalls even without branch
prediction.

Jacob