[Libre-soc-dev] svp64 review and "FlexiVec" alternative

Tue Jul 26 12:37:33 BST 2022

On Tue, Jul 26, 2022 at 6:08 AM Jacob Bachmeyer <jcb62281 at gmail.com> wrote:
> Well then, it looks like you may actually have a chance to get Simple-V
> adopted after all, if the WG Chair favors it.

neutral and fair, i would say. Paul's extremely smart, i expect him to
properly...
you-know-what-i'm-saying.

> > what opcode(s) are duplicated?
>
> The loop branch.

ahh ok.

> The SVSTEP instruction is effectively equivalent to an
> ordinary branch instruction.  (In fact, FlexiVec would *use* the
> ordinary BC instruction at the end of a vector loop.)

right. ok, i initially considered merging svstep into branch,
but verrrry quickly backed out of that.  svstep is a separate
instruction that may be used to perform the same job as
"vmiota".

applying the RISC paradigm i kept svstep separate from sv.branches
as it really did get extremely complex.

> > in other words and this is extremely important the VVM ISA Extension
> > of MyISA66000 *does not in any way* impose require or punish a
> > specific micro-architecture.
>
> It does appear that I have basically drawn inspiration from Simple-V and
> reinvented VVM as "FlexiVec".  :-)

basically... yes :)  and i was in absolutely no position to appreciate
that, back in 2018, and i don't think anyone else was, either. it really
does take a... mental paradigm shift to understand VVM/FlexiVec.

> The main sticking point that I see with Simple-V is the way Simple-V
> uses the main register file.

i didn't - don't - want IBM freaking out about adding yet another
regfile to the Power ISA.  or, worse, having *our* time wasted
trying to fit on top of VSX.

> There is, of course, the alternative (unless you know something that I
> do not) that they simply reject Simple-V because they decide that the
> secondary program counter is too many architectural resources to allocate.

at which point, sadly, we have to say "we tried our best" and proceed
with following the set procedures on page xii of v3.1, and use Sandbox
opcodes

    Facilities described in proposals that are not adopted
    into the architecture may be implemented as Custom
    Extensions using the architecture sandbox.

which will of course get quickly out of hand but the key is that we
will at least have warned them and given them the chance.  and
as long as there is a "downgrade" mode (full strict Power ISA 3
compliance) we're not in violation of the EULA either

    https://openpowerfoundation.org/blog/final-draft-of-the-power-isa-eula-released/

> Right, this is the problem I see with Simple-V:  best performance
> requires multi-issue OoO.

given that it's well-known within the HPC Supercomputing world that
multi-issue OoO is "just what you do", i don't see this as a problem.

one of our team had it explained to them by an IBM Engineer, the
difference between A2I and A2O, why A2O was better. it went waaay
over their head but they got the primary take-away message: don't
for goodness do an in-order system if you want anything remotely
approaching decent resource utilisation.

>  FlexiVec can give optimal performance with
> multi-issue OoO or an in-order scalar unit driving a chain of vector
> lanes.

i'm going to jump ahead because i reviewed the rest of what you
wrote

> > these kinds of conversations also turn up wonderful gems such as
> > FlexiVec, which i am still getting to grips with, looking like it is a
> > Vertical-First ISA.
>
> If I understand the "Vertical-First" term correctly, FlexiVec is exactly
> so, just like the old Cray "classic" vector model.

ah, right, Cray-style Vectors is definitely "Horizontal".  it's definitely
"for i in range(VL): VEC[RT][i] = VEC[RA][i] + VEC[RB][i]"

> > can you express it in pseudocode at all? i like to make sure i
> > properly understand, and these are such subtle complex concepts it is
> > really challenging to be clear.
>
> I will try a program example, in this case adding two arrays of
> integers:  (register numbers made up without reference to standard ABI
> and code untested; I /think/ I have this right)

no problem

>         [...declaring R20, R21, R22 as vectors of words]
>         fvsetup R7
>         ; begin vector loop
>     1:  lwaux   R20, R3, R10
>         bdnz    1b
>         ; end vector loop

(shortening)

> FlexiVec is activated by a write to a vector register; in the example
> above, the "lwaux R20" instruction.  Activating FlexiVec clears the
> physical scalar registers configured for vector use; these are
> subsequently used for vector offset tracking and referred to as "pR20",
> "pR21", and "pR22" below.

ok.  this is enough for me to be able to say, definitively, that this is
Mitch Alsup's "VVM"...

> The vector length is VL := MIN(MAXVL, CTR).  Since CTR=32 (32 element

... adapted to use CTR as the counter loop variable :)

> For the next iteration, VL is now 12, since CTR<MAXVL.  Each instruction
> proceeds analogously, with vector offsets 0, 4, 8.  This time VL=12,
> CTR=12, 12 - 12 = 0 -> CTR, so the loop branch is not taken, and
> FlexiVec is deactivated.

yep.  it's VVM, pretty much exactly.

> For an out-of-order multi-issue implementation, the vector lanes are
> emulated by issuing the relevant element-wise operations to the
> available execution ports.

right.  this is where Mitch's expertise kicks in, and to be absolutely
honest i do not know the full details (the "whys") as well as he does.
i remember him saying: you need to hold the entire loop in in-flight
Reservation Stations of the OoO Engine in order to be able to safely
Vectorise VVM Loops.

beyond the reach of the in-flight RSes it is *not safe* to engage
the Vectorisation and you must - *must* - fall back to Scalar operation
[which is perfectly fine and safe to do].

VVM also explicitly identifies (in equivalent of fvsetup) those registers
that are loop-invariant, in order to save on RaW/WaR Hazards. this
is also extremely important

>  Here, N is the number of simultaneous issue
> ports available instead of the number of vector lanes and MAXVL is
> determined by the availability of scratch registers in the OoO
> microarchitecture to hold the vector elements.

yes. the correct term for scratch registers is "in-flight Reservation Stations".
if you are familiar with the Tomasulo Algorithm (most well-known one)
that should give an "ah ha!" moment.

> > it's sounding like a cross between VVM and the ETA-10 (CDC 205).
>
> Some implementations might be.  The idea is that FlexiVec is, well,
> flexible here.

:)

after you described the assembler i was able to tell it's definitely VVM
and not ETA-10-like.  ETA-10 was a "Memory-to-Memory" Vector ISA
where you had instructions which set the memory-location of where
RA and RB would load from, and where RT would store to.

https://groups.google.com/g/comp.arch/c/KoDjjzpomVI/m/J_3X2XrjAgAJ

there was also an explicit "operand-forwarding-chaining" instruction
to avoid the hit of memory-to-memory-to-memory which plagued the
ILLIAC-IV.

> > the moment it becomes LOAD-PARTPROCESS-SPILL-PARTPROCESS-STORE then
> > due to the insanely heavily repeated workloads you end up with a
> > noncompetitive unsaleable product due to its power consumption.
> >
> > we have to be similarly very very careful.
>
> The idea of FlexiVec for Power ISA is that every operation normally
> available in the Fixed-Point Facility, Floating-Point Facility, and
> Vector Facility (VMX/VSX) [(!!!)] would be available vectorized when
> those facilities are extended using FlexiVec.  (Yes, in theory, FlexiVec
> could extend VSX too!)

indeed.  the problem is that, like ILLIAC-IV, VVM and FlexVec rely
heavily - exclusively - on Memory as the "sole means to create the
concept of vectors".

to avoid the problem of write-back-to-memory-only-to-read-it-again
you have to have some extremely smart LD/ST in-flight buffer
infrastructure in order not to overload L1 cache: something that's
a high priority when engaging Virtual Memory and TLB lookups.

thus we come back to Jeff Bush's wisdom (and research) that for
GPU workloads it is more power-efficient to stick to
LOAD-INREGSCOMPUTE-STORE.

and that's really why SV exists.  if i hadn't spent several months
talking with Jeff and understanding his work, and how everything
he did was driven by performance/watt (pixels/watt) metric
measurement, i would not have known.

> > this is all standard fare, it has all been in place literally for
> > decades, now. SVSTATE and SVSRR1 (and HSVSRR1) therefore literally get
> > a "free ride" off the back of an existing
> > astonishingly-well-documented spec and associated implementation.
>
> There is still an incremental software cost.

yes. the biggest one is that on a context-switch you now have 128 GPRs,
128 FPRs and 16 32-bit CRs [actually will probably make it 8 64-bit ones]

sigh.  it is what it is.  we discussed "usage-tagging" to help cut that down.
i.e. using the predicated compress/expand ld/st you can avoid saving/restoring
those registers which haven't actually been used.  long story.  didn't finish it
yet.

>  To be fair, FlexiVec has
> similar costs, since it also adds thread context.  FlexiVec, however,
> can be ignored by the system unless a task switch is to be performed, so
> the runtime cost is very slightly lower.

a big advantage of VVM is that you only actually have Scalar regs
to save/restore because the Vectors aren't actually Vectors at all,
they're batched Memory operations.

> IBM may have known what they were doing, but I am fairly sure that Intel
> and AMD are stuck with condition codes because the original 8086 used a
> FLAGS register to control conditional branches.  If x86 condition codes
> are useful like that, I am convinced that Intel had a lucky guess all
> those years ago.

:)  yeah Power ISA 1 was... 1993? 94? that's the original IBM research
paper - they do a damn thorough job.

small snippet of insight: China ICT Loongson MIPS64 procesor achieved
70% of x86 native speed in a JIT compiler that had hard-emulation of 200
x86 instructions.  the biggest problem?  **TEN** MIPS64 instructions
required to emulate one single x86 branch instruction!

why?

because no Condition Codes in MIP64.

oink.

> On the other hand, I view RISC-V as an experimental architecture in "how
> simple can we make it?" and I am uncertain if we would even have
> OpenPOWER if RISC-V did not exist as competition.

ah.  right.   opening up the Power ISA was initiated by Mendy and
Hugh *well over ten years ago*.  several other people helped out
within IBM, but yes, they had to fend off multiple hyper-enthusiastic
enquiries of the form "hey! you know this RISC-V thing! IBM! should!
like, do the same thing!"

and Hugh had to amusingly explain to them that it had already started
loooong before RISC-V existed, but that IBM wanted to make absolutely
absolutely certain that everything was properly in place.

including a 20+ year MASSIVE patent portfolio.

which, ahem, cough cough, RISC-V doesn't have.  and is infringing
on IBM patents. cough.

> This does not change my views on Simple-V; just that Simple-V is too far
> along in development to meaningfully change at this point.

the most important take-away is the insights from Jeff Bush,
and his extremely in-depth focus on performance/watt (pixels/watt).

> > i'll add a preamble chapter.
>
> I suggest splitting the document.  Put Simple-V and its instructions in
> one document and the SVP64-independent instructions in a separate
> proposal -- or multiple proposals.  Break the huge block into more
> manageable chunks.

IBM has a problem with multiple documents (and with external websites
in general).  with the entire 384 page document being only 1.4 mb i
considered it prudent to just give them only the one "thing" to pass around
in email.

plus, i am following the style of Power ISA 3 itself, which is multiple
books.

> Eh, I am fairly sure that Linux was adapted to POWER, not the other way
> around.  :-)

well, the first versions would not have had VSX/VMX at all, because the
first Power CPUs simply didn't have it.  PackedSIMD came later, so even
there it was an incremental process.

> > what probably did it though was the fact that Microwatt, A2O and A2I
> > are all in exactly the same position: they are all on-track for SFS
> > and SFFS Compliancy Level and absolutely nowhere near Linux Compliancy
> > Level.
>
> Wait... are SFS and SFFS abbreviations for "Scalar Fixed" and "Scalar
> Float"?

Scalar Fixed Subset and Scalar Fixed Floatingpoint Subset, i believe.

> If so, then BE is mandatory for those after all, and there is a
> *different* (32-bit) Linux port that runs on the SFS/SFFS platform,
> separate from the Linux Compliancy Subset, which is for 64-bit LE
> Linux.  (Confused yet?)

you'll love that anyone can add anything else they want *on top*
of what's mandatory.

> > so what is happening instead (at last) is that the message has been
> > finally understood, future patches will be "#ifdef VSX" and "#ifdef
> > MMA" respecting the *Compliancy* Level, NOT the IBM product make/model
> > number. Tulio's libc6 patch dealing with that has landed, finally,
> > about 6 months ago.
>
> Really, the GNU libc maintainers should have been pushing back on that
> -- the GNU policy is supposed to be to test features rather than
> processor models.

yes.  except in this case nobody did the proper review.

> Is this the POWER9 service processor I have heard about that runs its
> very own Linux-based system, such that, after power-on, the machine
> "plays dead" for about a minute while the service processor boots?

yes.  it's called a Boot Management Console.  every AMD and x86 system
has one.  usually a Nuvoton WPCM450 or an ASpeed 2100, 2500 or 2600.
they's as insecure as you'd expect them to be.  Intel developed their
own BMC IC.

l.