[Libre-soc-dev] svp64 review and "FlexiVec" alternative

Wed Jul 27 01:10:39 BST 2022

lkcl wrote:
> On Tue, Jul 26, 2022 at 6:08 AM Jacob Bachmeyer <jcb62281 at gmail.com> wrote:
>   
> [...]
>> The main sticking point that I see with Simple-V is the way Simple-V
>> uses the main register file.
>>     
>
> i didn't - don't - want IBM freaking out about adding yet another
> regfile to the Power ISA.  or, worse, having *our* time wasted
> trying to fit on top of VSX.
>   

This is one of the key features of FlexiVec:  the vector registers are 
selected using existing architectural registers as handles.

>> There is, of course, the alternative (unless you know something that I
>> do not) that they simply reject Simple-V because they decide that the
>> secondary program counter is too many architectural resources to allocate.
>>     
>
> at which point, sadly, we have to say "we tried our best" and proceed
> with following the set procedures on page xii of v3.1, and use Sandbox
> opcodes
>
>     Facilities described in proposals that are not adopted
>     into the architecture may be implemented as Custom
>     Extensions using the architecture sandbox.
>
> which will of course get quickly out of hand but the key is that we
> will at least have warned them and given them the chance.  and
> as long as there is a "downgrade" mode (full strict Power ISA 3
> compliance) we're not in violation of the EULA either
>
>     https://openpowerfoundation.org/blog/final-draft-of-the-power-isa-eula-released/
>   

You could also roll SVP64 as a custom extension for the initial 
revisions of Libre-SOC hardware and propose FlexiVec as another 
solution.  :-)  (Or slip the hardware schedule ("oops, Simple-V turned 
out to be a blind alley") and propose FlexiVec as a Contribution.)

>> Right, this is the problem I see with Simple-V:  best performance
>> requires multi-issue OoO.
>>     
>
> given that it's well-known within the HPC Supercomputing world that
> multi-issue OoO is "just what you do", i don't see this as a problem.
>
> one of our team had it explained to them by an IBM Engineer, the
> difference between A2I and A2O, why A2O was better. it went waaay
> over their head but they got the primary take-away message: don't
> for goodness do an in-order system if you want anything remotely
> approaching decent resource utilisation.
>   

GPUs are OoO microarchitectures?  I had the impression that past a 
certain level of complexity, with certain (GPU-like) constraints on the 
processing model, OoO becomes infeasible.

It is also worth noting here that IBM is known for advanced CPUs and is 
/not/ known for advanced graphics hardware.

> [...]
>> FlexiVec is activated by a write to a vector register; in the example
>> above, the "lwaux R20" instruction.  Activating FlexiVec clears the
>> physical scalar registers configured for vector use; these are
>> subsequently used for vector offset tracking and referred to as "pR20",
>> "pR21", and "pR22" below.
>>     
>
> ok.  this is enough for me to be able to say, definitively, that this is
> Mitch Alsup's "VVM"...
>
>   
>> The vector length is VL := MIN(MAXVL, CTR).  Since CTR=32 (32 element
>>     
>
> ... adapted to use CTR as the counter loop variable :)
>   

So we are on the same page and FlexiVec is also a tenable solution.

>> For the next iteration, VL is now 12, since CTR<MAXVL.  Each instruction
>> proceeds analogously, with vector offsets 0, 4, 8.  This time VL=12,
>> CTR=12, 12 - 12 = 0 -> CTR, so the loop branch is not taken, and
>> FlexiVec is deactivated.
>>     
>
> yep.  it's VVM, pretty much exactly.
>
>   
>> For an out-of-order multi-issue implementation, the vector lanes are
>> emulated by issuing the relevant element-wise operations to the
>> available execution ports.
>>     
>
> right.  this is where Mitch's expertise kicks in, and to be absolutely
> honest i do not know the full details (the "whys") as well as he does.
> i remember him saying: you need to hold the entire loop in in-flight
> Reservation Stations of the OoO Engine in order to be able to safely
> Vectorise VVM Loops.
>
> beyond the reach of the in-flight RSes it is *not safe* to engage
> the Vectorisation and you must - *must* - fall back to Scalar operation
> [which is perfectly fine and safe to do].
>
> VVM also explicitly identifies (in equivalent of fvsetup) those registers
> that are loop-invariant, in order to save on RaW/WaR Hazards. this
> is also extremely important
>   

In practice, I think FlexiVec requires all non-vector registers to be 
either memory addresses (incremented as the loop works through the data) 
or invariant.  Otherwise, any change to scalar register values would 
have effects varying with VL, since scalar operations are only executed 
once per every VL elements.  I suspect that this is why Alsup requires 
the entire loop be in-flight, since VVM probably does not distinguish 
between vector and scalar registers in the way that FlexiVec does.

>>  Here, N is the number of simultaneous issue
>> ports available instead of the number of vector lanes and MAXVL is
>> determined by the availability of scratch registers in the OoO
>> microarchitecture to hold the vector elements.
>>     
>
> yes. the correct term for scratch registers is "in-flight Reservation Stations".
> if you are familiar with the Tomasulo Algorithm (most well-known one)
> that should give an "ah ha!" moment.
>   

The term "scratchpad" is much shorter and easier to type!  :-)

>>> it's sounding like a cross between VVM and the ETA-10 (CDC 205).
>>>       
>> Some implementations might be.  The idea is that FlexiVec is, well,
>> flexible here.
>>     
>
> :)
>
> after you described the assembler i was able to tell it's definitely VVM
> and not ETA-10-like.  ETA-10 was a "Memory-to-Memory" Vector ISA
> where you had instructions which set the memory-location of where
> RA and RB would load from, and where RT would store to.
>
> https://groups.google.com/g/comp.arch/c/KoDjjzpomVI/m/J_3X2XrjAgAJ
>
> there was also an explicit "operand-forwarding-chaining" instruction
> to avoid the hit of memory-to-memory-to-memory which plagued the
> ILLIAC-IV.
>   

Right.  Power ISA is a load/store architecture, so FlexiVec for Power 
ISA is a load/store model.

>>> the moment it becomes LOAD-PARTPROCESS-SPILL-PARTPROCESS-STORE then
>>> due to the insanely heavily repeated workloads you end up with a
>>> noncompetitive unsaleable product due to its power consumption.
>>>
>>> we have to be similarly very very careful.
>>>       
>> The idea of FlexiVec for Power ISA is that every operation normally
>> available in the Fixed-Point Facility, Floating-Point Facility, and
>> Vector Facility (VMX/VSX) [(!!!)] would be available vectorized when
>> those facilities are extended using FlexiVec.  (Yes, in theory, FlexiVec
>> could extend VSX too!)
>>     
>
> indeed.  the problem is that, like ILLIAC-IV, VVM and FlexVec rely
> heavily - exclusively - on Memory as the "sole means to create the
> concept of vectors".
>   

Yes and no.  Fundamentally, memory is the only place to ultimately store 
vectors.  The entire point of vectors is to enable (efficient) 
operations on blocks of data that do not fit in the register file.

> to avoid the problem of write-back-to-memory-only-to-read-it-again
> you have to have some extremely smart LD/ST in-flight buffer
> infrastructure in order not to overload L1 cache: something that's
> a high priority when engaging Virtual Memory and TLB lookups.
>   

Avoided in FlexiVec; the example was "A + B -> C" because that is 
simple, but FlexiVec is by no means limited to only 3 vector registers.  
The number of available vector registers in FlexiVec is limited by the 
availability of scalar registers to store the working pointers which 
must be in the fixed-point GPRs.  An implementation using only the 
fixed-point register file is thus limited to 14 vectors, but an 
implementation using FlexiVec with the floating-point or VSX register 
files would be able to use the entire GPR file to track addresses and 
use up to 30 vectors simultaneously with maximal efficiency or the 
entire 64-slot VSX regfile as vector handles if working pointers are 
spilled to the stack.  (The MAXVL=20 value used in the example would be 
more plausible on real hardware in this latter case of using all 64 VSX 
slots as vector handles.)  Intermediate values are expected to be held 
in the vector scratchpad (which is accessed as the FlexiVec vector 
registers) rather than being spilled to memory.

In fact, this is a limitation of function calls in FlexiVec loops:  you 
/cannot/ spill a vector register to the stack because you do not know 
its length, so functions must be specially written for the loops that 
will call them.  This may be usable to save on code size, but generally 
FlexiVec loops are expected to be flat, for best performance.

> thus we come back to Jeff Bush's wisdom (and research) that for
> GPU workloads it is more power-efficient to stick to
> LOAD-INREGSCOMPUTE-STORE.
>   

FlexiVec is intended for this model.  Generally, a FlexiVec loop has the 
form "load inputs; compute in vectors; store outputs".

> and that's really why SV exists.  if i hadn't spent several months
> talking with Jeff and understanding his work, and how everything
> he did was driven by performance/watt (pixels/watt) metric
> measurement, i would not have known.
>   

I hate to say this, but I do not think that you will get the performance 
you want with Simple-V and any existing CPU ISA.  You will probably need 
to develop a new GPU-type ISA, with very long register files.

>>> this is all standard fare, it has all been in place literally for
>>> decades, now. SVSTATE and SVSRR1 (and HSVSRR1) therefore literally get
>>> a "free ride" off the back of an existing
>>> astonishingly-well-documented spec and associated implementation.
>>>       
>> There is still an incremental software cost.
>>     
>
> yes. the biggest one is that on a context-switch you now have 128 GPRs,
> 128 FPRs and 16 32-bit CRs [actually will probably make it 8 64-bit ones]
>
> sigh.  it is what it is.  we discussed "usage-tagging" to help cut that down.
> i.e. using the predicated compress/expand ld/st you can avoid saving/restoring
> those registers which haven't actually been used.  long story.  didn't finish it
> yet.
>   

This is another likely sticking point for Simple-V that FlexiVec 
avoids.  (Admitted, FlexiVec avoids it by pushing the problem of 
save/restore vector state to hardware, but this is unavoidable because 
the actual full vector context is implementation-defined in FlexiVec.)

>>  To be fair, FlexiVec has
>> similar costs, since it also adds thread context.  FlexiVec, however,
>> can be ignored by the system unless a task switch is to be performed, so
>> the runtime cost is very slightly lower.
>>     
>
> a big advantage of VVM is that you only actually have Scalar regs
> to save/restore because the Vectors aren't actually Vectors at all,
> they're batched Memory operations.
>   

FlexiVec is a hybrid between VVM and "classic" Cray vectors, then.  
FlexiVec vectors *do* actually exist in hardware somewhere, although the 
null implementation uses the scalar registers to store single-element 
"vectors" and an OoO implementation can use scratchpads instead of 
dedicated vector storage.  Perhaps FlexiVec is effectively the VVM 
programming model applied to "classic" vectors.

This also explains why VVM vectorization requires the entire loop be 
concurrently in-flight.

> [...]
>> On the other hand, I view RISC-V as an experimental architecture in "how
>> simple can we make it?" and I am uncertain if we would even have
>> OpenPOWER if RISC-V did not exist as competition.
>>     
>
> ah.  right.   opening up the Power ISA was initiated by Mendy and
> Hugh *well over ten years ago*.

Ah yes, about the time Sun was doing OpenSPARC?

>> This does not change my views on Simple-V; just that Simple-V is too far
>> along in development to meaningfully change at this point.
>>     
>
> the most important take-away is the insights from Jeff Bush,
> and his extremely in-depth focus on performance/watt (pixels/watt).
>   

Which means that Simple-V may not be a suitable fit for Power ISA any 
more than it fit in RISC-V.  OpenSPARC or another high-register-count 
ISA might be useful, or possibly a dedicated Libre-SOC GPU architecture, 
with an OpenPOWER (sans vector facilities) control unit in the actual SOC.

>>> i'll add a preamble chapter.
>>>       
>> I suggest splitting the document.  Put Simple-V and its instructions in
>> one document and the SVP64-independent instructions in a separate
>> proposal -- or multiple proposals.  Break the huge block into more
>> manageable chunks.
>>     
>
> IBM has a problem with multiple documents (and with external websites
> in general).  with the entire 384 page document being only 1.4 mb i
> considered it prudent to just give them only the one "thing" to pass around
> in email.
>
> plus, i am following the style of Power ISA 3 itself, which is multiple
> books.
>   

OK, then split the document and recombine it into multiple "books" in a 
single PDF, with each book a freestanding sub-proposal.  Simple-V would 
be one of these, and the independent instructions would be one or more, 
in manageable, theme-oriented chunks.  Every chunk adopted reduces the 
pressure on EXT022 for a Custom Extension for the pieces the OpenPOWER 
Foundation does not adopt.

-- Jacob