[libre-riscv-dev] building a simple barrel processor

Sat Mar 30 03:40:51 GMT 2019

On Sat, Mar 30, 2019 at 12:29 AM Jacob Lifshay <programmerjake at gmail.com> wrote:

> > he'd tell me that he'll go use a 4-SMP rocket-chip, which has been
> > silicon-proven to do 1.5ghz a number of times, and he'll license
> > Vivante's GC800 3D engine for USD $250,000.
> >
> Yeah, makes sense. I'd probably make the same decision.

 nngggh :)

> > can you clarify?
> >
> Yeah, a barrel processor by definition follows strict round robin, so each
> thread effectively runs at 1/Nth of the clock frequency of the whole
> processor, even if all but 1 of the threads are sleeping.
>
> My proposal for increasing performance when single-threaded programs need
> higher speed is to create a hybrid design between a barrel processor and a
> simple RISC processor (so, not only a barrel processor) by overlaying on
> the barrel processor's pipeline a simple RISC pipeline,

 ok, so it solves the single-execution performance issue, however it's
no longer a simple straightforward barrel processor.

 a simple barrel processor, like the rv32 one, should be possible to
do in about 2-3 weeks flat.  especially if the migen code here is
debugged and used:

 https://git.libre-riscv.org/?p=rv32.git;a=tree;

my key concern is that the simple barrel processor concept has so much
scope creep beyond what a barrel processor is best suited to
(guaranteed-time I/O processing suited to doing peripheral
software-emulation) that it takes over and runs into a multi-man-month
project.

essentially at its heart, the barrel processor is a single-core
design.  there's no performance increase by having the time-slicing,
so it cannot be argued that the time-slicing is *essential* for our
needs (as a GPU or a VPU).

all that the barrel *actually* provides - as long as the rest of the
infrastructure also has strict time-critical guarantees - is strict
time-critical guarantees.

and, given that we will need L1 caches, L2 caches and TLB for virtual
memory for CPU, VPU and GPU tasks [ok, for *easy* software
implementations of those tasks, i.e. just being able to take ffmpeg
and hit "compile"], the strict time-critical aspects of the actual
barrel scheduling are out the window.

(i.e. if we wanted to use the proposed design for the purposes for
which a barrel processor is designed - as an I/O processor - we would
need to *bypass* the TLB *and* the L2 cache *and* the L1 cache
entirely, in order to get the required strict time guarantees, i.e.
would probably need to run applications strictly out of a small SRAM).

therefore, sad to say, there *are* no benefits to the barrel (as a CPU
/ GPU / VPU processor): it's merely an augmentation / "nice-to-have",
which, ultimately, makes the proposal a straightforward "single-issue
SIMD" one.

and we already discussed and evaluated single-issue SIMD designs, back
at the beginning of the evaluation process, and concluded that they
would not be a good way forward.

any other augmentations - partial register file access during "barrel"
mode - are additional complications that, again, go way outside of the
"simple" remit. and have me concerned particularly that we know that
there's no actual performance increase that may *ever* be achieved by
the addition of the barrel.

additional complexity for no significant gain does not seem to be a
good trade-off!

i know you keep coming back to a single-issue SIMD design (leaving the
barrel augmentation aside), and also have proposed use of 1R1W SRAMs
for the register file several times: SIMD *really* is not going to cut
it, and the use of 1R1W SRAMs have ramifications that took me a lot of
time to understand.

on the 1R1W SRAMs: mitch alsup recommended their use to me, by
flattening out the operand read and write phases.  he wrote out a
really good ASCII-art representation that i can't find at the moment.

it took me about three weeks to realise that the consequences of
extending FMAC from the usual 5 out to an 8-stage pipeline meant that
it was necessary to have FOUR separate banks of 1R1W SRAMs if you want
to do 4-quads in a single cycle (as e.g. a SIMD operation), which
means that you now have to have data parallelism granularity of
SIXTEEN floating-point numbers.

that is: once the 8-stages have been laid out and there are four
separate banks of 1R1W SRAMs [one per SIMD element, because if you do
not have 4 parallel separate banks of 1R1W, then clearly you CANNOT do
4 operations, you can only do ONE], if you do not have batches of
sixteen floating-point numbers to process, the performance *will* be
adversely affected.

it should also be clear from the above that those 4 banks are forced
to be separate lanes.  so, that would mean that if the vector
processing to be carried out is in r0, r4, r8 and r12, you are screwed
because modulo 4 on all of those is 0,0,0,0 which means that they're
all from the exact same 1R1W lane bank.

however if it was in r0, r5, r10 and r7, it would be fine, because the
modulo 4 of those register numbers is 0,1,2 and 3 (i.e. the lane banks
are all independent).

and there's absolutely no way to compensate for that, except by...
adding more ports to the SRAM.  it's no advantage to try to add
multi-ported lane-crossing multiplexers to gain access to another
bank, because there *is no extra port that the multiplexer could read
or write from*.

so, ultimately, the 1R1W SRAMs are great... *if* the workload is
*guaranteed* to be massively (and exclusively) parallel, hence why you
see 1R1W SRAMs in Vector Processors and GPUs.

... we chose to design a hybrid core, so if we stick to that, it
ultimately means that 1R1W SRAMs for the register file are not a good
design choice.

regarding SIMD: i don't know if it was clear or not: the 6600-derived
design i came up with has *internal* SIMD with
architecturally-transparent allocation from a dynamic-width Vector
front-end.  i mentioned it several weeks back, that the "tail" of the
SIMD operation (the bits that would not fit into a 4-wide or 8-wide
SIMD ALU because there's only 3 or 7 or 2 of them) may be allocated to
the 8/16-bit ALUs *transparently*.

this transparent-allocation trick is *NOT POSSIBLE* to perform on a
single-issue *HARD*/explicit SIMD processor.

actually, it's not even possible to do on a standard OoO design: it's
only the cascading register-blocking concept that allowed it to even
be considered.

regarding multi-issue instruction allocation: it's a misconception
that it's hard.  or problematic.  correction: *on a 6600-style
scoreboard* with mitch alsup's *augmentations*, it's not that hard
(however, on a strict Tomasulo design, it *is* hard)

basically, multi-issue on a scoreboard system, which already has
write-hazard dependency wires, is nothing more difficult than making
sure that the later-issued instructions get an additional write-hazard
(write-block) dependency on the PRIOR-issued instructions.

it's really that simple.

it's basically overloading the single-bit NxN Matrix of write hazards
to create a bit-based "linked list" that preserves the instruction
execution order, where normally (and certainly in the Tomasulo design)
one imagines that it is absolutely strictly necessary to preserve that
order by using a FIFO [or a round-robin SRAM with an incremental
head/tail memory counter, this is the "usual" way a Tomasulo ROB is
done: you don't move the data, you move the indices pointing *to* the
head/tail of the data].

now, in a "standard" multi-issue scoreboard design, normally there
would be a combinatorial block that allowed up to N-issue
"instruction-dependent" write-hazards to be DEASSERTed in any given
cycle (assuming that all other write hazards were clear, and the
instruction(s) had moved to the COMMIT phase)  i.e. we can chop the
last "N" entries off the end of that "linked list" rather than just
the one.

this *would* be slightly complex to implement... except by virtue of
the way that the Register File is subdivided (remember we discussed
that HI32/LO32 Odd/Even 4-way 32-bit subdivision scheme?), it is NOT
necessary to do such complex combinatorial logic... because the
Function Units are subdivided into 4 separate independent banks.

this subdivision is one where the instruction has been *pre-routed* to
the ALU / FU bank that handles access to that particular HI32/LO32
Odd/Even Register Bank.

therefore, *by design* it is possible to do up to 4-issue @ 32-bit
*WITHOUT* any additional complexity at the instruction commit phase.

it does however have some implications / ramifications, namely that a
64-bit operation is at maximum dual-issue, and that furthermore, if
there are two instructions issued one after the other, both with odd
*OR* both with even destination registers, those two instructions MUST
be routed to the same ALU / FU Bank, and consequently they will be
*SINGLE* issue at the commit phase.

(likewise for 32-bit vector elements, but the description in english
words is more involved so i'll leave it out).

this is where there is an issue with the design as far as timing
attacks are concerned.  i believe that these can be mitigated by
having a "single instruction issue" mode, where instruction issue is
deliberately curtailed - under the control of the kernel - and that
this mode can be requested by processes that need high security.

also, note, yes it's very similar to the description of the problem of
1R1W SRAMs above, except that it's not, because (as we discussed
several weeks ago), they're 2R1W (or 3R1W more likely).  where there
are *three* resource contention issues compounding each other with
single-issue 4-bank (4-lane) 1R1W SRAMs, there is only *one* resource
contention issue with a multi-issue 4-bank 2R1W (or 3R1W), reducing
the design pressure from "severe / critical" to "moderate /
tolerable".

why only 1 resource contention issue instead of 3?  because an OoO
design may have some instructions that work on other register banks,
and those can be allocated to Function Units even though there may be
a quite large batch that hit one bank really hard.... and that is NOT
POSSIBLE TO DO in a single-issue design, which, again, makes it the
*compiler writer's* problem to solve.

my point is, then, that i have confidence that there exists a design
which can do the job, where the only aspects that i am not certain
about are ones that we need to solve regardless of what architecture
is chosen: L1/L2/TLB, AMO, FENCE, LR/SC and so on.

the OoO aspects and the multi-issue aspects, i *am* confident on, and
i am confident that they will provide a really good flexible basis for
a really stonking-good design, where single-issue SIMD is *well-known*
right across the industry to be problematic, and expects the compiler
writer to "sort it out".

not least, i will really really enjoy the innovative aspects of the
design that i came up with: the "nameless history" Q-Table idea, the
cascading register sub-divisioning and transparent use of
progressively smaller SIMD units.  these are features that have never
been seen or tried in *any* modern processor design, because they'd
never been done before, not by Intel, not by ARM, not by AMD, not by
*anyone*!

whereas, straight hard/explicit SIMD has been done ad nauseam
*despite* driving compiler writers absolutely nuts for decades.  to
perpetuate that paradigm... i just... i can't bring myself to support
or endorse something that is known to be a failure, not when there's a
potential *really good* - and innovative - alternative.

so does that help explain?  ultimately: yes to the barrel processor,
as a *short* learning exercise, and *only* as its intended original
purpose: as an I/O co-processor.  no SIMD, no FPU: basically the rv32
migen conversion debugged, made operational, and then barrel-ified.
that *can* be justified, because if we don't write a migen one, we'll
have to do something like pick the kestrel 53000 or picorv32 (or the
original verilog rv32) as a management / boot co-processor.

bottom line: a boot / management / IO processor we *need*.  a
barrel-ified single-issue SIMD design we don't: it basically does not
fit the requirements,  and the sacrifices need to *make* a barrel
processor useful for general-purpose workloads make it not *be* a
barrel processor, introducing rather alarming complexity and
non-uniformity into the design to compensate... where the end-result
has no performance or other benefits that would make the additional
complexity worthwhile, that i can see.

l.