[Libre-soc-dev] svp64 review

Sun Jul 24 22:59:32 BST 2022

On Sun, Jul 24, 2022 at 10:11 PM Luke Kenneth Casson Leighton <lkcl at lkcl.net>
(actually, jacob bachmeyer, a few days ago) wrote:

> > https://ftp.libre-soc.org/simple_v_spec.pdf
>
> A few comments from a quick partial review:

always appreciated

>     In chapter 3, "vertical" vector mode as described is ridiculous --
> that is exactly equivalent to a software loop and therefore a complete
> waste to support in hardware.

rright. ok. so some background here is (a) Mitch Alsup's VVM Extension
for MyISA 66000. Mitch has spent something like the past... 3? years
on comp.arch explaining to anyone prepared to listen about the benefits
of Vertical-First loop constructs.

in Mitch's Vertical-First LOOP system, it is assumed that high-performance
systems will utilise GBOoO to store the entirety of the LOOP instructions
in in-flight registers.

thus it becomes possible, very easily, to go, "huh, we're doing yet another
loop, let's merge all the identically-issued *scalar* operations from the
previous loop in a zip-up with the new ones".  repeat, repeat, repeat, and
you can blat an entire sequence of *scalar* instructions into *vector*
(actually, SIMD) ALUs.

there are some limitations:

1) the input and output "vectors" can only be LDs and STs respectively
2) you can only do one inner loop
3) if there are not enough in-flight Reservation Stations you have to
    fall back to Scalar-only looping which is perfectly reasonable

the LOOP preamble instruction helps identify all loop-invariant
registers plus identifies the counter register.

it's extremely neat.

> Any optimizations that can be applied
> there could also be applied to ordinary "for" loops and "svstep.bc" is
> nothing more than a dedicated LOOP opcode (similar to the same
> instruction from the original 8086).

yes, isn't it great? a high-performance implementation can apply
the same trick above, but in the case of SVP64 is not limited to
Memory-only Vectors, it can use registers.

there's a hphint which when set ensures that up to that many
Scalar Registers (actually, Vector elements) are "safe" to read/write
in parallel.  example:

for (i = 0; i < 100; i++)
    a[i] = a[i+5]*a[i];

this can be done only up to batches of 5, safely, and hphint would
be set to 5 to make that clear to the underlying hardware which
performs the in-flight-merging trick described by Mitch Alsup.

>     In chapter 4, we finally start to get to the "meat" of the
> proposal.  You have a serious misunderstanding of the x86 "REP" prefix.

probably.  honestly it's a throw-away "analogous concept" comment.
if people understand "the thing afterwards can be repeated" then
that's enough to get them started. beyond that initial statement,
getting people to understand "repeating" by seeing it from an
existing ISA, there is absolutely no further use for x86 or REP
of any kind.

> The misunderstanding is that there is no "Sub-PC" in x86

there's no connection to x86 at this point.  the "REP" analogy
is already done and finished.

*no* ISA except SVP64 has Sub-Program-Counters.

> -- repeatable
> operations update the relevant general registers as they proceed, and
> the saved PC value on an interrupt or exception will point to the
> repeatable operation.

the Sub-PC "state" (SVSTATE) is likewise saveable and restorable
(relevant later)

> I have a change to Simple-V that would allow you to throw most of the
> current limits out of the proverbial window.

ah... by this point in time (over 2 years of development) we're just about
to put SVP64 into the OpenPOWER Foundation "External RFC Process",
and have a Simulator, thousands of unit tests, a working HDL
Reference Implementation, and 5 months of work on binutils.

major changes at this point would be... difficult, shall we say.
that said i'm happy to go through this because we have to demonstrate
completeness.

> Simple-V does *not* "march
> across the register file", instead Simple-V *replaces* selected ISA
> scalar registers with sliding windows onto the vector register memory
> during a vector loop.

so, a completely separate regfile / register-memory area for vectors
from scalars?  is that right?

if you're proposing a separate vector regfile / register-memory-area
the downside of that are that you then have to add inter-regfile transfer
instructions, in between the scalar regfile and the [new] vector
regfile/reg-mem-area.

with SVP64 being a bare-minimum RISC-paradigm extension
of the Scalar Power ISA, one of its key strengths is that it only
requires 5 (five) additional "management" instructions to turn
that Scalar Power ISA into a Scalable Vector ISA.

>  (Your current pseudocode still describes marching
> across the register file.)  This is very similar to the "vector tail"
> model I was proposing as "RVP lanes" a few years ago.

that was as far back as 2018, wasn't it? :)  i do remember you
using the phrase "RVP lanes".

> The proposed "sub-PC" represents a problem for exception handling,

ah!  no, amazingly, it doesn't!  i've been extremely strict about this,
and designed both the Simulator and the HDL to be precise-exception
capable.

anything - anything at all - that prevents or prohibits exceptions
in the middle of processing a Vector element batch is immediately
rejected.  the "State" information is kept to:

* SVSTATE (contains the element index sub-step counters)
* SVLR (the SVSTATE equivalent of LR)
* SVSHAPE0-3 (the "REMAP" SPRs for hardware index reforming)

if the REMAP areas are zero you do not need to save/restore
the four SVSHAPE SPRs.

SVSTATE is saved/restored into SVSRR1, just as PC is saved/restored
in SRR0 and MSR is saved/restored in SRR1.

in other words, i took the concept of "Sub-PC" very seriously and
treated it literally as part of the [absolutely] critical Context, aka
a peer of PC and MSR.

> but
> the 8086 "REP" prefix provides precedent for an easy solution:  use a
> general-purpose (scalar) register as the loop variable.  Actually, using
> a (programmer-chosen) scalar register as the control-flow loop variable

i thought about it, and realised that it made Register Hazard Management
for Multi-Issue OoO designs really, *really* complicated.

at least a separate SVSTATE SPR does not interfere with the
RaW/WaR Hazard Management of the GPRs.  it can be cached
and passed around at the peer-level of PC and MSR, which
is *really* important in a Multi-Issue context.

> With a few restrictions on allowed operations related to inter-lane data
> transfers, a vector loop can then, for most operations and on
> appropriate hardware, be unrolled (by hardware) across however many
> vector lanes are actually implemented, with the loop variable advanced
> by N (number of implemented vector lanes) on each pass through the loop.

deep breath: this is unfortunately a completely different design paradigm
for which i would have to rethink the entire implementation strategy that
i've been holding in my head for over 30 months.

> If Simple-V really is intended to march across the register file, then I
> propose an alternate "FlexiVec" as I previously described.  The
> interesting possibility with "FlexiVec" is that it can scale all the way
> down to the baseline scalar ISA (with MAXVL=1) and up to arbitrarily
> large "hybrid GPU" designs with thousands of vector lanes driven by a
> single control unit.

IBM POWER 9 and IBM POWER 10 took a different strategy: 8-way
multi-issue OoO execution.  POWER10 i think has two 128-bit SIMD
ALU pipelines per core, which is completely mad.

what i very much did not want to happen was IBM, who are (obviously)
on the OPF ISA WG and who have now handed over control of the ISA
to the OPF after being its custodians and designers for 25 years, to freak
out.

with IBM having already implemented such an astoundingly-powerful
multi-issue engine, it made prudent sense to propose SVP64 as "merely
leveraging what IBM already has".

it is telling that IBM did *not* extend the VSX ISA to 256 or 512 bit: instead
they increased the number of 128-bit multi-issue ALUs.

l.