[Libre-soc-dev] svp64 review and "FlexiVec" alternative

Sun Jul 31 08:32:37 BST 2022

[jacob, hi, just as a preamble: having been following Mitch Alsup's posts on comp.arch for over 3 years now, please trust me when i say FlexiVec === VVM and VVM === FlexiVec.  every advantage that you describe, every implementation option, every capability, and every limitation, everything you describe, they are exactly the same, i have seen all of them raised and discussed on comp.arch already, not just once but multiple times.

once i understood what VVM was i immediately also understood its limitations and set it aside because those limitations are unresolvable: the power consumption caused by the critical and complete dependence on LDST as its sole means of being able to autovectorise, whilst elegant and simple, is unfortunately also its biggest flaw.]

On Sun, Jul 31, 2022 at 2:57 AM Jacob Bachmeyer <jcb62281 at gmail.com> wrote:
>
> lkcl wrote:
> > VVM/FlexiVec specifically and fundamentally rely on LD/ST.
> >  
>
> Sort of; LD/ST are the input/output operations for a vector calculation,
> yes, but intermediate values are held in the vector unit registers. 

it is not possible to access vector elements outside of a lane, except by pushing down to ST and LD back again.

it is not possible to perform anything but the most rudimentary horizontal sums (a single scalar being an alias)

it is not possible to use a vector loop invariant array of constants except by LDing them.

> The
> FlexiVec model works for calculations that "run along" some number of
> elements, where each step uses the same offsets in the array. 

yes.  this is a severe limitation.

> For
> example, FlexiVec can trivially do A[i] + B[i] -> C[i] and (by shifting
> the base address of B) A[i] + B[i+k] -> C[i], but can only do A[i] +
> B[i] + B[i+k] -> C[i] by introducing a fourth vector D aliased to B with
> offset k and computing A[i] + B[i] + D[i] -> C[i].

exactly.  which puts pressure on LDST. that is a severe limitation (one that SV does not have).

> The only memory accesses in a proper FlexiVec loop are (1) loading the
> input operands (once per element) and (2) storing the results (again
> once per element).  How do you accomplish *any* calculation without
> loading inputs and storing results?  (Note that in a graphics
> application, that final store could be direct to the framebuffer.)

you missed the point completely that some intermediary results may remain in registers, saving power consumption by not hitting L1 Cache or TLB Virtual Memory lookups/misses.

> > let us take FFT or better DCT as an example because there are more cosine coefficients, you cannot re-use DCT coefficients per layer.
> > that means in Horizontal-Mode that *all* coefficients when performing multiple FFTs or DCTs are read-only and can be stored in-regs.
> >  
>
> This also limits the maximum size of the transform that you can perform
> with Simple-V to one where all data fits in the registers.

in practice for large Matrix Multiply, large DCT and large FFT this ia not a problem.

each of those algorithms has, for decades, solutions that perform small subkernels (using available registers).

> > in VVM / FlexiVec on the other hand you have *no other option* but to either re-compute the cosine coefficients on-the-fly (needs about 8 instructions inside the inner loop) or you are forced to use yet another LD-as-a-vector-substitute which puts additional unnecessary pressure on an already-extreme LD/ST microarchitecture.
> >  
>
> The expected implementation would have the coefficients precomputed
> (possibly at compile-time) and stored in memory somewhere.

exactly.  and the only way to get them? use LD.

as i said this increases power consumption and increases design pressure on LDST data pathways.

these are O(N^2) costs and they are VERY high.

(unacceptably high)

> > (get that wrong and you will have stalls due to LDST bottlenecks)
> >  
>
> Ah, but the LD/ST operations repeated during a FlexiVec loop are
> perfectly predictable.  

the predictability is not relevant, if the cost of even having the extra LD/STs is a 50% increase in overall power consumption pushing the product out of competitive margins with existing products.

> > in SVP64 Vertical-First Mode you can also *reuse* the results of a Vector Computation not just as input but also as output, switching between Horizontal-First and Vertical-First at will and as necessary.
> >  
>
> The price SVP64 pays for this flexibility is a hard limit on the maximum
> amount of data that can be involved.

really *really* high performance (NVIDIA GPUs, 120W+, TFLOPs going on PFLOPs) this becomes an issue.  we are ~7 years away from that level, from the moment funding drops in the door.  i can live with that, particularly given the initial focus on "more achievable" initial goals such as a 3.5W SoC.

>  Worse, that hard limit is
> determined by the ISA because it is based on the architectural register
> file size.

turns out that having an ISA-defined fixed amount means that binaries are stable.  RVV and ARM with SVE2 are running smack into this one, and i can guarantee it's going to get ugly (esp. for SVE2).

see the comparison table on p2 (reload latest pdf). or footnote 21
https://libre-soc.org/openpower/sv/comparison_table/

>  FlexiVec avoids this limit because the FlexiVec vector
> registers are hidden behind the architectural registers used as vector
> handles and may each hold any number of elements.

for a price (power consumption) that i judge to be too high.
VVM (aka FlexiVec) is off the table in SV until that power consumption
problem is fixed.

> > after having done one phase of FFT in Vertical-First, you go back to completing the rest of the algorithm in Horizontal-First.
> >
> > i *know* that VVM / FlexiVec cannot do that.
> >  
>
> Perhaps I misunderstand the problem, but FlexiVec should be able to do
> that by expanding the complex operations and allocating a few more
> vector registers.

how? the vector "registers" only exist as auto-allocated resources.  they are completely inaccessible outside of a particular "i" within any given loop.  "i" cannot *ever* access the elements of "i+1"

(except by ST of course and LD afterwards and we are *yet again* hammering memory beyond already-unacceptable limits) and further running into memory aliasing.

by contrast SVP64 is quite happy to do a Fibonnaci Series of adds by *deliberately* summing off-by-one on vector input/output.

> Complex FMADD is A*B+C -> D == (Ar + jAi)*(Br +jBi) + (Cr + jCi) -> (Dr
> + jDi).

from fixed coefficients, yes.

>     1:  lwaux   R20, R3, R12    ; REAL(A)
>         lwaux   R21, R4, R12    ; IMAG(A)
>         lwaux   R22, R5, R12    ; REAL(B)
>         lwaux   R23, R6, R12    ; IMAG(B)
>         lwaux   R24, R7, R12    ; REAL(C)
>         lwaux   R25, R8, R12    ; IMAG(C)

in SV these two (coefficients) or sorry it might be B may be stored in-regs (actual, real vector regs not in-flight autovectorised ones) which saves 25% LD/ST which is absolutely massive.

Mitch saves this 25% by having highly efficient hardware SIN/COS with low latency.

if he had not been able to do that (or a different algorithm had longer latency on re-computing loop-invariant coefficients) then AThe recalculation unfortunately becomes the critical path.

at which point *your only option* is LDing the loop-invariant coefficients and we are back to unacceptable power consumption pressure.

> In theory, spilling vector working addresses is minimal additional
> pressure, 

if you think it through i believe you will find that the Register Hazard Management for that is pretty hairy.

> Function calls are theoretically possible inside FlexiVec loops, but
> with some severe caveats, such as a /very/ non-standard ABI when
> FlexiVec is active.

i suggested the idea to Mitch, a year ago. he wouldn't go near it.
given his experience i take that to be a bad sign given that
FlexiVec===VVM / VVM===FlexiVec.

> These lead to some interesting possibilities for a vector coprocessor
> using Simple-V and a large architectural register file.  Consider the
> overall architecture of the CDC 6600, which had a big compute processor
> and a much smaller I/O processor (with 10 hardware threads on the I/O
> processor); analogously, the Power ISA core would be the I/O processor
> dispatching work units to the vector processor.

yes.  see
https://libre-soc.org/openpower/sv/SimpleV_rationale/

the concept you describe above is on the roadmap, after learning of Snitch and EXTRA-V. it's the next logical step and i'd love to talk about it with you at a good time.

when you get to create these "Coherent Synchronous Satellite" cores they need not be so fast as the main core, they can be much slower and more efficiently designed (no L2 cache needed, if they run at Memory speed. Snitch even bypasses L1 *and the register files entirely* knocking off an absolutely astonishing 85% power consumption).

read the paper above and the links to other academic works Snitch, ZOLC and EXTRA-V. the novel bit if there is one is the "remote Virtual Memory Management" over OpenCAPI.

> This is also why I suggest that FlexiVec is likely to be a better "fit"
> for the Power ISA.

really, trust me, i've done the analysis (over a year ago) - it really isn't once you get into details.

> Then this is a subtle difference between VVM and FlexiVec.  (I think.)

really, seriously, there isn't one.

> > correct.  you have to let the OoO Engine flush and fall back to Scalar *or* you pull Shadow Cancellation on all but those elements not relevant and then fall back to scalar...
> >  
>
> Not quite:  spilling a FlexiVec vector register is /not/ /possible/ --
> the called function /does/ /not/ /know/ how many words will be written
> by STORE.  The main FlexiVec loop does not know this either.

exactly.  so given speculative execution you could cancel (roll back, throw away) certain *lanes* to a known point and execute in scalar from that point, into the function.

of course if the function is called in every loop that exercise is pointless you might as well just do scalar.  but, to be honest, i stopped thinking this through when Mitch did not pick up on it.

> I think that Simple-V would work better with more than 128 registers.

128 registers is already at the point where a L1 *register* cache (yes, a cache for registers) is going to be a real need, in order to avoid slowing down max clock rate.

plus, Multi-Issue reg-renaming can help reuse regs.  certain computationally heavy workloads that doesn't work.  we just have to live with it for now.

> There is a small but important distinction here:  FlexiVec processes an
> /element group/ (defined as VL elements) on each iteration of the loop.

yes. this is the autovectorisation quantity.  you have to watch out for memory aliasing.  a = b+16,  a[i] *= b[i]

this one was the subject of a LOT of comp.arch discussion of VVM.

in SVP64 this "element group" quantity is the "hphint".  in Horizontal-First Mode it indicates the amount of the vector that is guaranteed to not have any Register Hazards cross-elements.  a[i] has no dependence on b[i+N] for any N within the "element group" even through memory aliasing.
(this is not a guess it is a declaration, from knowledge possessed by the compiler)

in Vertical First Mode it is the exact quantity of Horizontal elements executed. exactly equivalent to what you term VL in FlexiVec. but it is explicit.  the hardware is not permitted to choose the number of elements (except under explicitly requested Fail-First conditions, one of the Modes).

> Perhaps the example above (complex multiply-add) will shed some
> additional light on the issue?

confirming that FlexiVec===VVM and subject to the same pressure on LDST, yes.

> In fact, as far as I understand at the moment, VVM very likely *is* a
> valid FlexiVec implementation.  Another valid implementation is SIMT,
> where the scalar unit drives a chain of vector processing elements that
> operate in parallel.  The SIMT version is the long hardware explanation
> previously given, while the VVM "hold the vector loop in-flight" model
> is a valid implementation for an OoO processor.

no, again, apologies, i did not mention this before (repeating 3+ years of comp.arch discussion is impractical, it is often 40-80 messages a day *sustained*)

a third possible micro-architectural implementation available to VVM(===FlexiVec) again already discussed on comp.arch was indeed SIMT.  and SIMD.  and Scalar. and GBOoO. four implementations.

> I believe this to be an argument in favor of FlexiVec at the ISA level:
> radically different hardware implementations can both efficiently
> process the same software loop.

the exact same microarchitectures - all of them - may be used with SV as well.  without the inherent LDST pressure.

> > combined with element-width overrides you get space for 256 FP32s or 512 FP16s *and* space for 256 INT32s, or 512 INT16s, or 1024 INT8s.
> >  
>
> Is there another bit to allow REX-prefixing without changing the meaning
> of the prefixed opcode or would this also fill the entire prefixed
> opcode map in one swoop?

we're requesting 25% of the v3.1 64-bit EXT001 space already.
one more bit would require 50% and i would expect the OPF ISA WG to freak out at that.

> The part III preamble should probably be more specific that "each
> chapter in this part is a freestanding proposal" and III.1 should
> probably not be named "SV Vector ops" then.

yes good point, done. i'd renamed SV Vector ops yesterday.  standalone, good idea.

l.

On July 31, 2022 6:52:30 AM GMT+01:00, lkcl <luke.leighton at gmail.com> wrote:
>
>
>
>On Sun, Jul 31, 2022 at 2:57 AM Jacob Bachmeyer <jcb62281 at gmail.com>
>wrote:
>>
>> lkcl wrote:
>> > VVM/FlexiVec specifically and fundamentally rely on LD/ST.
>> >  
>>
>> Sort of; LD/ST are the input/output operations for a vector
>calculation,
>> yes, but intermediate values are held in the vector unit registers. 
>The
>> FlexiVec model works for calculations that "run along" some number of
>> elements, where each step uses the same offsets in the array.  For
>> example, FlexiVec can trivially do A[i] + B[i] -> C[i] and (by
>shifting
>> the base address of B) A[i] + B[i+k] -> C[i], but can only do A[i] +
>> B[i] + B[i+k] -> C[i] by introducing a fourth vector D aliased to B
>with
>> offset k and computing A[i] + B[i] + D[i] -> C[i].
>
>> The only memory accesses in a proper FlexiVec loop are (1) loading
>the
>> input operands (once per element) and (2) storing the results (again
>> once per element).  How do you accomplish *any* calculation without
>> loading inputs and storing results?  (Note that in a graphics
>> application, that final store could be direct to the framebuffer.)
>
>> > let us take FFT or better DCT as an example because there are more
>cosine coefficients, you cannot re-use DCT coefficients per layer.
>> > that means in Horizontal-Mode that *all* coefficients when
>performing multiple FFTs or DCTs are read-only and can be stored
>in-regs.
>> >  
>>
>> This also limits the maximum size of the transform that you can
>perform
>> with Simple-V to one where all data fits in the registers.
>
>
>> > in VVM / FlexiVec on the other hand you have *no other option* but
>to either re-compute the cosine coefficients on-the-fly (needs about 8
>instructions inside the inner loop) or you are forced to use yet
>another LD-as-a-vector-substitute which puts additional unnecessary
>pressure on an already-extreme LD/ST microarchitecture.
>> >  
>>
>> The expected implementation would have the coefficients precomputed
>> (possibly at compile-time) and stored in memory somewhere.
>
>
>> > (get that wrong and you will have stalls due to LDST bottlenecks)
>> >  
>>
>> Ah, but the LD/ST operations repeated during a FlexiVec loop are
>> perfectly predictable.  The loop itself is perfectly predictable,
>> trivially:  if CTR > VL, the loop will run another iteration, so
>> hardware can already be prefetching the next set of inputs while the
>> computation proceeds, and simply stack the store operations into a
>> write-behind queue, flushed while the computation proceeds on the
>next
>> loop iteration.
>>
>> > in SVP64 Vertical-First Mode you can also *reuse* the results of a
>Vector Computation not just as input but also as output, switching
>between Horizontal-First and Vertical-First at will and as necessary.
>> >  
>>
>> The price SVP64 pays for this flexibility is a hard limit on the
>maximum
>> amount of data that can be involved.  Worse, that hard limit is
>> determined by the ISA because it is based on the architectural
>register
>> file size.  FlexiVec avoids this limit because the FlexiVec vector
>> registers are hidden behind the architectural registers used as
>vector
>> handles and may each hold any number of elements.
>
>> > after having done one phase of FFT in Vertical-First, you go back
>to completing the rest of the algorithm in Horizontal-First.
>> >
>> > i *know* that VVM / FlexiVec cannot do that.
>> >  
>>
>> Perhaps I misunderstand the problem, but FlexiVec should be able to
>do
>> that by expanding the complex operations and allocating a few more
>> vector registers.
>>
>> Complex FMADD is A*B+C -> D == (Ar + jAi)*(Br +jBi) + (Cr + jCi) ->
>(Dr
>> + jDi).
>>
>> Expanding the multiplication yields (Ar*Br + Ar*jBi + jAi*Br +
>jAi*jBi)
>> + (Cr + jCi) -> (Dr + jDi).
>>
>> Collecting terms and reducing j^2 yields ((Ar*Br - Ai*Bi) + j(Ar*Bi +
>> Ai*Br)) + (Cr + jCi) -> (Dr + jDi).
>>
>> Splitting into independent real and imaginary sub-calculations
>yields:
>>
>>     REAL:  Ar*Br - Ai*Bi + Cr -> Dr
>>     IMAG:  Ar*Bi + Ai*Br + Ci -> Di
>>
>>
>> We can now assign vector registers.  We will need 6 inputs, 2
>> intermediate temporaries, and 2 outputs also used as temporaries, for
>a
>> total of 8 vector registers.  Assuming that we are working in
>> fixed-point fractional multiplication (such that the high halves of
>the
>> product words can be ignored; otherwise I do not know how to write it
>in
>> Power assembler), the code using FlexiVec is along these lines:
>>
>> ----
>>
>>         [address of REAL(A) in R3, IMAG(A) in R4]
>>         [address of REAL(B) in R5, IMAG(B) in R6]
>>         [address of REAL(C) in R7, IMAG(C) in R8]
>>         [address of REAL(D) in R9, IMAG(D) in R10]
>>         [vector length in elements in R2]
>>         [assemble vector configuration in memory at address in
>R11...]
>>         [...declaring R20 - R29 as vectors of words]
>>         fvsetup R11
>>         ; we are using 32-bit elements, so 4 bytes per element
>>         li      R12, 4
>>         ; the initial accesses will start 4 bytes from the base
>>         addi    R3, -4
>>         addi    R4, -4
>>         addi    R5, -4
>>         addi    R6, -4
>>         addi    R7, -4
>>         addi    R8, -4
>>         addi    R9, -4
>>         addi    R10, -4
>>         ; load counter
>>         mtctr   R2
>>         ; begin vector loop
>>     1:  lwaux   R20, R3, R12    ; REAL(A)
>>         lwaux   R21, R4, R12    ; IMAG(A)
>>         lwaux   R22, R5, R12    ; REAL(B)
>>         lwaux   R23, R6, R12    ; IMAG(B)
>>         lwaux   R24, R7, R12    ; REAL(C)
>>         lwaux   R25, R8, R12    ; IMAG(C)
>>         ; interleave real/imaginary calculations
>>         ;  real parts in even vector registers
>>         ;  imaginary parts in odd vector registers
>>         mullw   R28, R20, R22   ; Ar*Br -> Dr (accumulating)
>>         mullw   R29, R20, R23   ; Ar*Bi -> Di (accumulating)
>>         mullw   R26, R21, R23   ; Ai*Bi -> Tr
>>         mullw   R27, R21, R22   ; Ai*Br -> Ti
>>         subf    R28, R26, R28   ; Dr - Tr -> Dr
>>         add     R29, R27, R29   ; Di + Ti -> Di
>>         add     R28, R28, R24   ; Dr + Cr -> Dr
>>         add     R29, R29, R25   ; Di + Ci -> Di
>>         ; store outputs
>>         stwux   R28, R9, R12    ; REAL(D)
>>         stwux   R29, R10, R12   ; IMAG(D)
>>         bdnz    1b
>>         ; end vector loop
>>
>> ----
>>
>> This is using about half of the upper limit for efficient
>simultaneous
>> vectors with the fixed-point register file, but if FlexiVec were to
>be
>> used with VSX, we could use still more vectors efficiently.  You can
>> reach a bit farther by spilling vector working addresses (note: 
>*not*
>> the actual vector data!) to the stack instead of keeping them all in
>> registers throughout the loop, but optimal performance requires
>holding
>> the working addresses in registers.  In the above example, R3 - R10
>are
>> used for this purpose.
>>
>> In theory, spilling vector working addresses is minimal additional
>> pressure, since the vector unit is expected to have its own (wider)
>path
>> to memory anyway, so an OoO implementation could spill/load working
>> addresses to/from the stack while the vector unit is calculating or
>even
>> also accessing memory, since the relevant part of the stack should
>fit
>> in L1 cache and the scalar unit probably has a private L1 cache
>w.r.t.
>> the vector unit.
>
>> Function calls are theoretically possible inside FlexiVec loops, but
>> with some severe caveats, such as a /very/ non-standard ABI when
>> FlexiVec is active.
>
>
>> These lead to some interesting possibilities for a vector coprocessor
>> using Simple-V and a large architectural register file.  Consider the
>> overall architecture of the CDC 6600, which had a big compute
>processor
>> and a much smaller I/O processor (with 10 hardware threads on the I/O
>> processor); analogously, the Power ISA core would be the I/O
>processor
>> dispatching work units to the vector processor.
>
>https://libre-soc.org/openpower/sv/SimpleV_rationale/
>
>> This is also why I suggest that FlexiVec is likely to be a better
>"fit"
>> for the Power ISA.
>
>> > no, i distinctly recall seeing assembler examples using scalar
>registers as intermediaries where Mitch outlined how the exact same
>Auto-SIMD-i-fication could be applied to them, *if* they were
>correspondingly identified as being useable as such, by the LOOP
>initialisation instruction.
>> >
>> > this is down to his gate-level architectural expertise.
>> >  
>>
>> Then this is a subtle difference between VVM and FlexiVec.  (I
>think.)
>
>
>> > correct.  you have to let the OoO Engine flush and fall back to
>Scalar *or* you pull Shadow Cancellation on all but those elements not
>relevant and then fall back to scalar...
>> >  
>>
>> Not quite:  spilling a FlexiVec vector register is /not/ /possible/
>--
>> the called function /does/ /not/ /know/ how many words will be
>written
>> by STORE.  The main FlexiVec loop does not know this either.
>>
>> >> so functions must be specially written for the loops that
>> >> will call them.
>> >>    
>> >
>> > Mitch very specifically forbids functions within loops.  or, you
>*might* be able to have them but the LOOP will fall back to Scalar
>behaviour.
>> >  
>>
>> A function call within a FlexiVec loop would be better described as
>an
>> out-of-line assembler macro than a proper function.
>
>> I think that Simple-V would work better with more than 128 registers.
>
>> > Cray traditional Vectors it is VL-based Index-incrementing that is
>the *inner* loop.
>> >  
>>
>> There is a small but important distinction here:  FlexiVec processes
>an
>> /element group/ (defined as VL elements) on each iteration of the
>loop.
>> Each vector instruction completes VL elements and may iterate to do
>so.
>> This iteration may be variable, for example, memory access
>instructions
>> may always stop at page boundaries when issued as parallel operations
>> across a vector chain and restart in the middle of the element group
>> after the next page is resolved.  Each instruction processes VL
>elements
>> before the PC advances to the next instruction.
>
>> Perhaps the example above (complex multiply-add) will shed some
>> additional light on the issue?
>>
>> In fact, as far as I understand at the moment, VVM very likely *is* a
>> valid FlexiVec implementation.  Another valid implementation is SIMT,
>> where the scalar unit drives a chain of vector processing elements
>that
>> operate in parallel.  The SIMT version is the long hardware
>explanation
>> previously given, while the VVM "hold the vector loop in-flight"
>model
>> is a valid implementation for an OoO processor.
>
>
>> I believe this to be an argument in favor of FlexiVec at the ISA
>level:
>> radically different hardware implementations can both efficiently
>> process the same software loop.
>
>> > combined with element-width overrides you get space for 256 FP32s
>or 512 FP16s *and* space for 256 INT32s, or 512 INT16s, or 1024 INT8s.
>> >  
>>
>> Is there another bit to allow REX-prefixing without changing the
>meaning
>> of the prefixed opcode or would this also fill the entire prefixed
>> opcode map in one swoop?
>
>> The part III preamble should probably be more specific that "each
>> chapter in this part is a freestanding proposal" and III.1 should
>> probably not be named "SV Vector ops" then.
>
>yes good point, done.
>
>l.