[libre-riscv-dev] building a simple barrel processor

Sat Mar 9 13:18:17 GMT 2019

On Fri, Mar 8, 2019 at 7:01 PM Jacob Lifshay <programmerjake at gmail.com> wrote:

> if we execute SV like it is a standard vector isa, having the registers
> split to the 32-bit level and the same alu design we already have, that
> won't make the scheduling circuitry any more complicated than that needed
> for div/mod or any other instruction that takes more cycles to execute than
> the number of harts per core. The instructions are simply passed through
> the pipeline again in the next pipeline slot for the same hart.

 an out-of-order design is in effect a parallel processor with a means
and method of preserving the order of instructions.  vectorisation may
be made synonymous with multi-issue through the simple process of
pushing element-based operations into the existing multi-issue
instruction queue.

 there are three main ways in which an in-order design may do multiple
operations, to increase the overall OPs/sec:

 * SIMD
 * multiplying up the number of [in-order] pipelines.
 * add more [SMP] cores.

note: increasing the number of slots in the barrel processor DOES NOT
increase the number of operations per second.

 the SIMD method has the known strong disadvantage that predicate
masking results in reduced utilisation.  every element that is masked
out is a wasted opportunity to do useful work.

 multiplying up the number of pipelines... i don't even want to think
about how that works!  the scheduling would be hell.  i appreciate
that there are teams that may have successfully done it: i know enough
to know it would not be worth pursuing.

 adding more SMP cores quickly runs into difficulties, particularly at
the L1 cache coherence level.  2 SMP cores is easy: there is only one
data bus to consider between the two L1 cores.  4 SMP cores is more
challenging: there are 3 paths to consider.  the LEON3 Gaisler
Research work showed that 8 cores is the point at which contention
becomes a serious problem, reducing the performance by a *massive*
margin.  the figure of 25% reduced performance springs to mind (i read
the Gaisler paper a long time ago).

 to meet a target of 5 to 6 GFLOPs, if the clock rate is 800mhz, then
we need either:

 * 8 cores each capable of 1 FLOP per clock OR
 * 4 cores each capable of 2 FLOPs per clock OR
 * 2 cores each capable of 4 SIMD FLOPs per clock.

note, again, that the barrel is *NOT* in the *LEAST* bit relevant,
here, because, again, the number of slots has NO relation to the
overall total number of FLOPs per clock.

the 8 core variant would need to be 8-way SMP, meaning that its actual
performance would be around 75% of the theoretical maximum figure.

the 4 core variant with a 2-issue (dual in-order pipeline) design
would be hell.  or, could *only* be activated for vectorisation.
going back to a standard "Lanes" pipeline design, which is effectively
no different from SIMD

the 2 core variant with a 4-wide SIMD ALU would have only a *maximum
theoretical* performance of 6.4 GFLOPs.  any elements predicated out
would be wasted resources.

hypothetically we could do a 4 core variant with a 2-wide SIMD ALU.

none of the options are particularly... satisfactory.

*none* of them are made "more performance-wise advantageous" by the
addition of hyperthreading (aka barrel slots).

the known disadvantages of SIMD become even more pronounced as the
size of the elements decreases.  we would still have to think through
how to jam 16-bit and 8-bit FP operations into SIMD, and i really *do
not* like SIMD.

overall i would therefore be mentally resisting a SIMD-based
architecture at every step of the way, with the feeling that the
opportunity had been lost to do something innovative that could
actually solve a long-standing serious problem.

> > as things stand, however, i am not seeing any significant advantages:
> > i am seeing instead primarily disadvantages, particularly compared to
> > having to abandon literally months of design and research effort into
> > a much more flexible, comprehensive, and expandable design, and
> > starting almost entirely from scratch.
> >
> One advantage is having simple control hw and not having to worry about
> spectre, since there is no speculative execution.

 that can be solved in an OoO design... by simply not adding any
speculative execution.

 the primary reason for going with an OoO architecture for SV is not
specifically to provide speculation, it's to take advantage of the
inherent parallel processing capabilities that go with having banks of
Function Units, dropping multiple vector elements per clock into the
matrices.

 note that there is a real and distinct difference between what an OoO
can be *used* for and what is *needed*.  speculative branch prediction
and execution is not *needed* in an OoO design: it's just that it's
*possible* in an OoO design (where it is flat-out impossible in an
in-order one).

 now, there *may* be circumstances where issuing of vector elements
results in over-allocation of internal resources, and contention for
bus bandwidth: if that occurs then we will have made a mistake in the
bus bandwidth specification, and it would need to be increased.

 this can easily be verified by adding statistics collection that
times how many cycles it takes from the moment of issue of any given
instruction, to the time the results are available.  if at any time
the difference is outside of expected behaviour, that is a timing
delay, and we know that we have an internal resource contention issue
to deal with.

> I am perfectly fine if we decide that a barrel processor is not a good fit,
> I just think that the complexity is being over-estimated and we should at
> least build a RV64IMA prototype, maybe including the fast hart 0 design in
> it.

 i think we'll find that, hilariously, an in-order design is a
degenerate case of the OoO one.

 it *should* actually be posslble to include an in-order design in the
OoO one, through setting the Function Unit Matrix size to the length
of the pipeline, and allocating *all* operations to route through the
one (single) ALU pipeline.

 or... ok, not just one: one FP ALU pipeline and one INT ALU pipeline.

 note that such a degenerate engine would be unable to perform more
vector computations than "standard" (scalar) instructions could,
because the (single pipeline) FUs would be the bottleneck.  i.e. the
total number of FLOPs and OPs would be the same, regardless of whether
they came from vector or from scalar instructions.

the reason for setting the FU Matrix size equal to the pipeline length
is because it is necessary to always have as many pending input
latches (and output latches waiting for their corresponding results)
as there are pipeline stages.

> Note that I am not suggesting we need a complete processor to be supported
> by the fast hart 0 mode, I think switching back to the barrel processor
> mode to execute non-RV64GC instructions is a good idea, since that will
> reduce the complexity of the fast pipeline by a lot.

 the key (sole) advantage of a barrel processor - as separate and
distinct from the timing-attack resistance of a single-issue, uniform
and zero-stall pipeline - is to provide real-time threading
guarantees.

 i am not convinced that there is any need - at all - for real-time
hyperthreading.  if there *is* a need, then we should discuss that as
a *separate* issue from whether a uniform zero-stall single-issue
design (with SIMD to provide parallelism) should be considered.

 i.e. if we need hyperthreading, that can just as easily be added on
top of the OoO core design as it can be on top of the uniform design.

 so my question is: does the GPU or VPU *really need* hyperthreading?
is there any GPU or VPU task that *requires* subdivision into
real-time threads with extreme low predictable latency?

> > the problem was: to get the equivalent performance, you needed *FOUR*
> > times the data.  i.e. if you wanted to do 4-vectors, the concept
> > forces you to do *four* 4-vectors at once.
> >
> for the barrel processor, that problem doesn't exist since each hart has an
> independent register bank and you can run operations on different banks
> simultaneously, allowing us to have each hart's register file have only a
> single 128-bit (4x32-bit) r/w port for a 4x32-bit alu, rather than the
> 512-bit r/w port that you were probably thinking of. It would operate
> similarly to how Hwacha reads and writes registers (Figure 4 on page 12 in
> http://www.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-263.pdf) except
> that it goes through harts rather than vector elements.

 that design involves an NxN crossbar.  N-wide, N-ported.  the
resource utilisation is *enormous*.  if you recall we went to a heck
of a lot of trouble to avoid NxN crossbars, by including operand
forwarding, "de-naming" (and restoration), placing the lane-routing
*after* the operand computation, and a few other things besides.

 it may not be clear that adding barrel slots simply cannot increase
the total number of FLOPs: *all* it does is, interleave N execution
threads and *correspondingly stretches the completion time of each by
N as well*.  increasing the total number of FLOPs can only be done by
increasing the actual number of ALUs (and making sure that they're
active), and adding extra barrel slots does *not* increase the number
of ALUs.

 what it does instead is, if 1R1W SRAM is used for the register file,
it increases the latency of the pipeline and multiplies (extends) the
instruction completion time even further by the number of barrel
slots.

 to compensate for that, the amount of parallelism required in the
incoming data stream will *also* need to be multiplied by the number
of barrel slots.

 that then results in increased latency for *single* process
performance... bear in mind that this is a single-issue design, so now
a branch point results in extended execution stalling.

 i really am not perceiving any strong advantages, here.  unless i've
missed something, the "simplicity" instead results in disadvantages at
pretty much every turn.

l.