[Libre-soc-dev] svp64 review and "FlexiVec" alternative

Thu Jul 28 17:17:10 BST 2022

finally got some time.

On Wed, Jul 27, 2022 at 1:10 AM Jacob Bachmeyer <jcb62281 at gmail.com> wrote:

> You could also roll SVP64 as a custom extension for the initial
> revisions of Libre-SOC hardware and propose FlexiVec as another
> solution.  :-)  (Or slip the hardware schedule ("oops, Simple-V turned
> out to be a blind alley") and propose FlexiVec as a Contribution.)

i would do so if i had not had over a year to think it through and had not come up with Vertical-First Mode.

VVM/FlexiVec specifically and fundamentally rely on LD/ST.

i feel you are drastically underestimating the power penalty of GPU/VPU memory accesses which are sustained *per clock* at least TEN TIMES that of CPU workloads.  plus reliance on LDST bandwidth increases the pressure.

let us take FFT or better DCT as an example because there are more cosine coefficients, you cannot re-use DCT coefficients per layer.

let us take the maxim from Jeff Bush's work to do as much in-regs as possible.

therefore i designed the FFT and DCT REMAP subsystem *specifically* to be in-place, in-regs, the entire triple-loop.

that means in Horizontal-Mode that *all* coefficients when performing multiple FFTs or DCTs are read-only and can be stored in-regs.

in VVM / FlexiVec on the other hand you have *no other option* but to either re-compute the cosine coefficients on-the-fly (needs about 8 instructions inside the inner loop) or you are forced to use yet another LD-as-a-vector-substitute which puts additional unnecessary pressure on an already-extreme LD/ST microarchitecture.

(get that wrong and you will have stalls due to LDST bottlenecks)

in SVP64 Vertical-First Mode you can also *reuse* the results of a Vector Computation not just as input but also as output, switching between Horizontal-First and Vertical-First at will and as necessary.

a good example is FFT for which complex fmadd/sub (four of them) is required.  we decided not to add complex-fmadd/sub right now because it is too much.

simply using Vertical-First it is possible to get away with using a batch of Scalar temporary regs, inputs sourced from Vector regs, outputs after going through *multiple* Scalar temporary regs, end up back in Vector regs.

after having done one phase of FFT in Vertical-First, you go back to completing the rest of the algorithm in Horizontal-First.

i *know* that VVM / FlexiVec cannot do that.

also, i am keenly aware that Mitch's expertise here led him to design VVM as it is, because of decades of experience and even then it was a good couple of years in the making.  no, function calls inside VVM loops are not permitted and he has endeavoured to explain why, and it would take many months to comprehend.

that "learning curve" i know is *so* in-depth that i can foresee that its proper assessment for inclusion in SVP64, including discussion, specification, simulator, unit tests, Compliance Validation Suite and so on will be of the order of a year.  that's a hell of a committment.

> GPUs are OoO microarchitectures?

not normally, no.  i mentioned we'll likely do wide.FAT further down in my previous reply.

>   I had the impression that past a
> certain level of complexity, with certain (GPU-like) constraints on the
> processing model, OoO becomes infeasible.

within the realm of 4-8 cores for embedded low to mid end SoCs typically MALI 400 MP or Vivante GC800/1000 where if you are handling 1920x1080 @ 30fps you're doing well, i believe it's feasible.

we are not aiming for 120 watts, here, as a first ASIC. we're aiming for a maximum *3.5* watts, the entire SoC including a 0.5 watt budget for the DDR Memory interface.

> It is also worth noting here that IBM is known for advanced CPUs and is
> /not/ known for advanced graphics hardware.

yes. and the silver lining on that is that they left the Scalar ISA pretty much untouched.  VSX was (is) the primary focus, but also you have to understand that their business revolves around IO throughput and handling massive data sets.

this makes it perfect for applying SVP64 precisely because the Scalar ISA is so lean.

> > ... adapted to use CTR as the counter loop variable :)
> >  
>
> So we are on the same page and FlexiVec is also a tenable solution.

i didn't say that :) i am not discouraging it, i just know how much work is involved.
now all that is needed is to put in an NLnet Grant proposal and someone to write it up.

> > VVM also explicitly identifies (in equivalent of fvsetup) those registers
> > that are loop-invariant, in order to save on RaW/WaR Hazards. this
> > is also extremely important
> >  
>
> In practice, I think FlexiVec requires all non-vector registers to be
> either memory addresses (incremented as the loop works through the data)
> or invariant.  Otherwise, any change to scalar register values would
> have effects varying with VL, since scalar operations are only executed
> once per every VL elements.  

no, i distinctly recall seeing assembler examples using scalar registers as intermediaries where Mitch outlined how the exact same Auto-SIMD-i-fication could be applied to them, *if* they were correspondingly identified as being useable as such, by the LOOP initialisation instruction.

this is down to his gate-level architectural expertise.

> I suspect that this is why Alsup requires
> the entire loop be in-flight, since VVM probably does not distinguish
> between vector and scalar registers in the way that FlexiVec does.

it does indeed.  at this point i would suggest getting onto comp.arch and asking him.  i do not recall enough to know precisely enough.

> In fact, this is a limitation of function calls in FlexiVec loops:  you
> /cannot/ spill a vector register to the stack because you do not know
> its length,

correct.  you have to let the OoO Engine flush and fall back to Scalar *or* you pull Shadow Cancellation on all but those elements not relevant and then fall back to scalar...

> so functions must be specially written for the loops that
> will call them.  

Mitch very specifically forbids functions within loops.  or, you *might* be able to have them but the LOOP will fall back to Scalar behaviour.

> This may be usable to save on code size, but generally
> FlexiVec loops are expected to be flat, for best performance.

yes.

> I hate to say this, but I do not think that you will get the performance
> you want with Simple-V and any existing CPU ISA.  You will probably need
> to develop a new GPU-type ISA, with very long register files.

Jacob answered this already.  MALI Broadcom VideoCore IV Vivante AMDGPU all have 128 registers.

> FlexiVec is a hybrid between VVM and "classic" Cray vectors, then.

it really isn't. a Cray Scalable Vector ISA is specifically defined as Horizontal-First Scalable (elements are 100% processed in full up to VL before moving to the next instruction).

VVM and FlexiVec are very specifically Memory-based Vertical-First (instructions are processed in a loopin full, before moving to the next element)

VVM/FlexiVec it is VL-based Index-incrementing that is the *outer* loop.

Cray traditional Vectors it is VL-based Index-incrementing that is the *inner* loop.

SVP64 has both modes.  the programmer may not only choose, they can even push the damn SVSTATE onto the stack and flip between one and the other in the middle!

> FlexiVec vectors *do* actually exist in hardware somewhere, although the
> null implementation uses the scalar registers to store single-element
> "vectors" and an OoO implementation can use scratchpads instead of
> dedicated vector storage.  Perhaps FlexiVec is effectively the VVM
> programming model applied to "classic" vectors.

my understanding from what you have explained in that assembler example is that they are exactly the same underlying concept.  VVM creates the appearance or effect of Vectors from LDST, so does FlexiVec.

if you have something different in mind, i need to see more assembler examples (apologies)

> > ah.  right.   opening up the Power ISA was initiated by Mendy and
> > Hugh *well over ten years ago*.
>
> Ah yes, about the time Sun was doing OpenSPARC?

sounds about right

> > the most important take-away is the insights from Jeff Bush,
> > and his extremely in-depth focus on performance/watt (pixels/watt).
> >  
>
> Which means that Simple-V may not be a suitable fit for Power ISA any
> more than it fit in RISC-V.  OpenSPARC or another high-register-count
> ISA might be useful, or possibly a dedicated Libre-SOC GPU architecture,
> with an OpenPOWER (sans vector facilities) control unit in the actual SOC.

again, as jacob explained, now you know why we increased the number of 64 bit regs to 128.

this is why there are *nine* bits in the EXTRA area of the precious 24 bit prefix dedicated to extending RA, RB, RC, RT and RS, and FRA...FRS, and CR Field numbering, from 32 entries to 128 entries.

combined with element-width overrides you get space for 256 FP32s or 512 FP16s *and* space for 256 INT32s, or 512 INT16s, or 1024 INT8s.

> > plus, i am following the style of Power ISA 3 itself, which is multiple
> > books.
> >  
>
> OK, then split the document and recombine it into multiple "books" in a
> single PDF, with each book a freestanding sub-proposal.  

that's what it is.  that's exactly how it is.  if you reload the pdf you'll see the wording which says precisely "these are independent".

> Simple-V would
> be one of these, and the independent instructions would be one or more,
> in manageable, theme-oriented chunks. 

yyep. that's exactly how it is.

> Every chunk adopted reduces the
> pressure on EXT022 for a Custom Extension for the pieces the OpenPOWER
> Foundation does not adopt.

yes.  they're all high-profile general-purpose.

l.

On July 27, 2022 8:31:15 AM GMT+01:00, lkcl <luke.leighton at gmail.com> wrote:
>
>
>On Wed, Jul 27, 2022 at 1:10 AM Jacob Bachmeyer <jcb62281 at gmail.com>
>wrote:
>
>> You could also roll SVP64 as a custom extension for the initial
>> revisions of Libre-SOC hardware and propose FlexiVec as another
>> solution.  :-)  (Or slip the hardware schedule ("oops, Simple-V
>turned
>> out to be a blind alley") and propose FlexiVec as a Contribution.)
>
>> GPUs are OoO microarchitectures?  I had the impression that past a
>> certain level of complexity, with certain (GPU-like) constraints on
>the
>> processing model, OoO becomes infeasible.
>>
>> It is also worth noting here that IBM is known for advanced CPUs and
>is
>> /not/ known for advanced graphics hardware.
>
>> > ... adapted to use CTR as the counter loop variable :)
>> >  
>>
>> So we are on the same page and FlexiVec is also a tenable solution.
>
>> > VVM also explicitly identifies (in equivalent of fvsetup) those
>registers
>> > that are loop-invariant, in order to save on RaW/WaR Hazards. this
>> > is also extremely important
>> >  
>>
>> In practice, I think FlexiVec requires all non-vector registers to be
>> either memory addresses (incremented as the loop works through the
>data)
>> or invariant.  Otherwise, any change to scalar register values would
>> have effects varying with VL, since scalar operations are only
>executed
>> once per every VL elements.  I suspect that this is why Alsup
>requires
>> the entire loop be in-flight, since VVM probably does not distinguish
>> between vector and scalar registers in the way that FlexiVec does.
>
>> In fact, this is a limitation of function calls in FlexiVec loops: 
>you
>> /cannot/ spill a vector register to the stack because you do not know
>> its length, so functions must be specially written for the loops that
>> will call them.  This may be usable to save on code size, but
>generally
>> FlexiVec loops are expected to be flat, for best performance.
>
>> I hate to say this, but I do not think that you will get the
>performance
>> you want with Simple-V and any existing CPU ISA.  You will probably
>need
>> to develop a new GPU-type ISA, with very long register files.
>
>
>> FlexiVec is a hybrid between VVM and "classic" Cray vectors, then.
>> FlexiVec vectors *do* actually exist in hardware somewhere, although
>the
>> null implementation uses the scalar registers to store single-element
>> "vectors" and an OoO implementation can use scratchpads instead of
>> dedicated vector storage.  Perhaps FlexiVec is effectively the VVM
>> programming model applied to "classic" vectors.
>>
>> This also explains why VVM vectorization requires the entire loop be
>> concurrently in-flight.
>
>> > ah.  right.   opening up the Power ISA was initiated by Mendy and
>> > Hugh *well over ten years ago*.
>>
>> Ah yes, about the time Sun was doing OpenSPARC?
>>
>
>> > the most important take-away is the insights from Jeff Bush,
>> > and his extremely in-depth focus on performance/watt (pixels/watt).
>> >  
>>
>> Which means that Simple-V may not be a suitable fit for Power ISA any
>> more than it fit in RISC-V.  OpenSPARC or another high-register-count
>> ISA might be useful, or possibly a dedicated Libre-SOC GPU
>architecture,
>> with an OpenPOWER (sans vector facilities) control unit in the actual
>SOC.
>
>> > plus, i am following the style of Power ISA 3 itself, which is
>multiple
>> > books.
>> >  
>>
>> OK, then split the document and recombine it into multiple "books" in
>a
>> single PDF, with each book a freestanding sub-proposal.  Simple-V
>would
>> be one of these, and the independent instructions would be one or
>more,
>> in manageable, theme-oriented chunks.  Every chunk adopted reduces
>the
>> pressure on EXT022 for a Custom Extension for the pieces the
>OpenPOWER
>> Foundation does not adopt.