[Libre-soc-dev] SVP64 bclrl

Thu Apr 7 11:09:29 BST 2022

On Thu, Apr 7, 2022 at 5:15 AM Jacob Lifshay <programmerjake at gmail.com> wrote:

> In other words, so we can have some semblance of efficiency,

(being able to jump over fully-masked-out instructions that would
 be NOPs at runtime, yes)

> we need a SIMD branch unit...

or something that has the same end-result as one, or could
conceivably be implemented as such by a Hardware Architect
making an arbitrary implementation-dependent decision to do so:
yes

> that basically does exactly what I described,

in effect

> just that conceptually it's running VL separate branches rather than 1
> vector-reduce-branch.

basically, yes.  any Micro-Architect may *choose* to implement it
as one single instruction: that's entirely down to them.

> Also, all the modern CPU block diagrams I've seen generally have 1 or
> 2 branch units, not 8.

that'll be because it's easier to draw and also very few people in the
world have seen the inside of an x86 or AMD processor, or the Apple
M1 or the IBM POWER10.

i saw some numbers somewhere for the IBM POWER10: over 1,000
in-flight instructions and 8-way multi-issue.  there's no way that's going
to have only 1-2 branch units.  likewise i've heard from people who
know x86 and AMD, they also have 1,000+ in-flight instructions, and
much larger numbers of branch units.

On Thu, Apr 7, 2022 at 5:19 AM Jacob Lifshay <programmerjake at gmail.com> wrote:

> Another reason to treat branches internally like
> vector-reduce-branches

which isn't happening - not at the ISA level.  anything that causes
Hardware Architects to have to abandon their High-Performance
Multi-Issue OoO micro-architecture is completely out of the
question

at an architectural level, which is up to the Hardware Micro-Architect
to decide: maybe.

however as i keep reminding you: that decision is not something that
should propagate up to "us as SVP64 ISA designers".  be *aware*
of and think through: yes. make changes to SVP64 so that it badly
damages the SVP64 paradigm and causes Hardware Architects
who we can reasonably assume will be on the OPF ISA WG
voting Board to reject SVP64 outright as too disruptive and
unworkable to implement: absolutely not.

> is that otherwise the branch-predictor in the
> fetch pipeline has to predict exactly *how many* branches in a SVP64
> branch will be taken,

indeed.

> otherwise we'll stop the speculative shadow at
> the wrong spot and basically every SVP64 branch will always be a
> branch misprediction -- terrible.

this is easily thought through and not a problem.  or, more accurately:
"exactly the same problem as Scalar Branches".

Scalar Branch Prediction is based on the CR bit.  if you can predict
*or have a cached-copy of the CR* then there's no problem. if you
have a cached copy actually in the Issue Engine then there *is*
no predicton needed: you effectively turn the branch-conditional
directly into an unconditional branch.  likewise only if the top bit
of CTR is set, you know what's happening there, too.

if the cache of CR is invalid or you decide not to have such a cache,
then tough, you're into branch-prediction territory, and hence why
having 4 to 8 Branch Units (to cope with the 1,000+ in-flight instructions)
is not uncommon.

here's the thing: SVP64-Branch is in *exactly* the same position.

it still is based on CR

it still is based on CTR

the only addition is the predicate masks, which now also need to
be read (or cached - oh look, they're also CR Fields, yes also
one of r3/r10/r31)

bottom line is that there would be (apart from the addition of reading
or cacheing the predicate masks) exactly the same issue faced
by the internal micro-architecture if there were 8+ *Scalar*
branch-conditional instructions in a row:

     bc 0
     bc 1
     bc 2
     ...
     bc 7

SVP64 simply makes the expression of such "more compact":

    sv.bc {vector_of_CR_Fields_to_test}

l.