[Libre-soc-dev] svp64 review and "FlexiVec" alternative

Mon Aug 1 05:35:54 BST 2022

lkcl wrote:
> [jacob, hi, just as a preamble: having been following Mitch Alsup's posts on comp.arch for over 3 years now, please trust me when i say FlexiVec === VVM and VVM === FlexiVec.  every advantage that you describe, every implementation option, every capability, and every limitation, everything you describe, they are exactly the same, i have seen all of them raised and discussed on comp.arch already, not just once but multiple times.
>
> once i understood what VVM was i immediately also understood its limitations and set it aside because those limitations are unresolvable: the power consumption caused by the critical and complete dependence on LDST as its sole means of being able to autovectorise, whilst elegant and simple, is unfortunately also its biggest flaw.]
>   

Since you evidently know more about what VVM is than I do; I will take 
your word for it that FlexiVec is my quasi-independent (by way of 
Simple-V) reinvention of VVM.

> On Sun, Jul 31, 2022 at 2:57 AM Jacob Bachmeyer <jcb62281 at gmail.com> wrote:
>   
>> lkcl wrote:
>>     
>>> VVM/FlexiVec specifically and fundamentally rely on LD/ST.
>>>  
>>>       
>> Sort of; LD/ST are the input/output operations for a vector calculation,
>> yes, but intermediate values are held in the vector unit registers. 
>>     
>
> it is not possible to access vector elements outside of a lane, except by pushing down to ST and LD back again.
>   

Correct; FlexiVec is intended for driving bulk calculations.  Individual 
elements are insignificant in these workloads.

> it is not possible to perform anything but the most rudimentary horizontal sums (a single scalar being an alias)
>   

Strictly speaking, FlexiVec cannot perform horizontal sums *at* *all*.  
A horizontal sum requires ordinary scalar computation.  A multiple-issue 
OoO implementation could use the same hardware pathways to perform a 
horizontal sum, but that is ordinary OoO loop optimization, not FlexiVec.

> it is not possible to use a vector loop invariant array of constants except by LDing them.
>   

Correct; FlexiVec works on "strings" of elements taken in parallel.  
(But LOAD-GATHER and STORE-SCATTER are easily expressed using vectors of 
addresses.)

>> The
>> FlexiVec model works for calculations that "run along" some number of
>> elements, where each step uses the same offsets in the array. 
>>     
>
> yes.  this is a severe limitation.
>   

A severe limitation shared by every other current vector computing model 
I have seen.  (I am still considering Simple-V to be a "future" model as 
there is no hardware yet.)

>> For
>> example, FlexiVec can trivially do A[i] + B[i] -> C[i] and (by shifting
>> the base address of B) A[i] + B[i+k] -> C[i], but can only do A[i] +
>> B[i] + B[i+k] -> C[i] by introducing a fourth vector D aliased to B with
>> offset k and computing A[i] + B[i] + D[i] -> C[i].
>>     
>
> exactly.  which puts pressure on LDST. that is a severe limitation (one that SV does not have).
>   

Admitted; this limitation is necessary to ensure that a Hwacha-like SIMT 
hardware implementation is possible.  (That is the direction that I 
believe maximum performance systems will ultimately need to go.  There 
are few hard limits on how large an SIMT array can get, only latency 
tradeoffs.)

>> The only memory accesses in a proper FlexiVec loop are (1) loading the
>> input operands (once per element) and (2) storing the results (again
>> once per element).  How do you accomplish *any* calculation without
>> loading inputs and storing results?  (Note that in a graphics
>> application, that final store could be direct to the framebuffer.)
>>     
>
> you missed the point completely that some intermediary results may remain in registers, saving power consumption by not hitting L1 Cache or TLB Virtual Memory lookups/misses.
>   

I believe that we are talking past each other here:  FlexiVec can have 
intermediate values that are never written to memory, as in the complex 
multiply-add sample below.

>>> let us take FFT or better DCT as an example because there are more cosine coefficients, you cannot re-use DCT coefficients per layer.
>>> that means in Horizontal-Mode that *all* coefficients when performing multiple FFTs or DCTs are read-only and can be stored in-regs.
>>>  
>>>       
>> This also limits the maximum size of the transform that you can perform
>> with Simple-V to one where all data fits in the registers.
>>     
>
> in practice for large Matrix Multiply, large DCT and large FFT this ia not a problem.
>
> each of those algorithms has, for decades, solutions that perform small subkernels (using available registers).
>   

Which introduces the same memory bandwidth pressure that plagues 
FlexiVec, does it not?  :-)

>>> in VVM / FlexiVec on the other hand you have *no other option* but to either re-compute the cosine coefficients on-the-fly (needs about 8 instructions inside the inner loop) or you are forced to use yet another LD-as-a-vector-substitute which puts additional unnecessary pressure on an already-extreme LD/ST microarchitecture.
>>>  
>>>       
>> The expected implementation would have the coefficients precomputed
>> (possibly at compile-time) and stored in memory somewhere.
>>     
>
> exactly.  and the only way to get them? use LD.
>
> as i said this increases power consumption and increases design pressure on LDST data pathways.
>
> these are O(N^2) costs and they are VERY high.
>
> (unacceptably high)
>   

I disagree on the exponent there:  I expect that pipelining (which any 
high performance system will use extensively) reduces that to O(N) in 
number of loads per loop iteration.  The predictability of the loop 
(mentioned next) is important here, because it assures that not only 
will pipelining be effective, but that hardware prefetching and the use 
of a store queue can keep the memory interface busy while the vector 
unit is calculating.  Further, prefetch buffers for FlexiVec can be 
independent of the regular caches, aside from assuring coherency.

>>> (get that wrong and you will have stalls due to LDST bottlenecks)
>>>  
>>>       
>> Ah, but the LD/ST operations repeated during a FlexiVec loop are
>> perfectly predictable.  
>>     
>
> the predictability is not relevant, if the cost of even having the extra LD/STs is a 50% increase in overall power consumption pushing the product out of competitive margins with existing products.
>   

How do existing products avoid that power consumption?  (I expect that 
they do not, therefore FlexiVec would indeed be competitive.)

>>> in SVP64 Vertical-First Mode you can also *reuse* the results of a Vector Computation not just as input but also as output, switching between Horizontal-First and Vertical-First at will and as necessary.
>>>  
>>>       
>> The price SVP64 pays for this flexibility is a hard limit on the maximum
>> amount of data that can be involved.
>>     
>
> really *really* high performance (NVIDIA GPUs, 120W+, TFLOPs going on PFLOPs) this becomes an issue.  we are ~7 years away from that level, from the moment funding drops in the door.  i can live with that, particularly given the initial focus on "more achievable" initial goals such as a 3.5W SoC.
>   

Fair enough.  FlexiVec is something that can scale easily all the way 
from "tiny SoC" (likely using "FlexiVec-Null") to huge wafer-scale GPUs.

>>  Worse, that hard limit is
>> determined by the ISA because it is based on the architectural register
>> file size.
>>     
>
> turns out that having an ISA-defined fixed amount means that binaries are stable.  RVV and ARM with SVE2 are running smack into this one, and i can guarantee it's going to get ugly (esp. for SVE2).
>   

Interesting way to view that as a tradeoff.  Precise programmer-level 
optimization opportunities versus wide hardware scalability with fixed 
program code...

> see the comparison table on p2 (reload latest pdf). or footnote 21
> https://libre-soc.org/openpower/sv/comparison_table/
>   

...and ARM managed to bungle both of those with SVE2 if that is correct.

FlexiVec and RVV align on the issue of VL-independence, however -- I see 
that as an important scalability feature.  Instead of trying to tune for 
the hardware MAXVL (which is actually variable in FlexiVec; the hardware 
commits to a MAXVL at each FVSETUP), FlexiVec (and probably also RVV) 
takes the approach of "I have this much data; process it as quickly as 
hardware resources permit."  The use of an ISA-provided loop counter in 
the Power ISA probably helps here, since it allows hardware to predict 
the loop exactly.

This scalability is an important feature for FlexiVec:  the programmer 
will get the optimal performance from each hardware implementation 
(optimal for that hardware) with the /exact/ /same/ /loop/, including 
the null case where the loop simply runs on the scalar unit.

>>  FlexiVec avoids this limit because the FlexiVec vector
>> registers are hidden behind the architectural registers used as vector
>> handles and may each hold any number of elements.
>>     
>
> for a price (power consumption) that i judge to be too high.
> VVM (aka FlexiVec) is off the table in SV until that power consumption
> problem is fixed.
>   

You already have the null implementation.  :-)  (System software could 
simply resume upon the illegal instruction trap for FVSETUP.)

Also, since when has power consumption /ever/ been a concern for IBM?  :-)

>>> after having done one phase of FFT in Vertical-First, you go back to completing the rest of the algorithm in Horizontal-First.
>>>
>>> i *know* that VVM / FlexiVec cannot do that.
>>>  
>>>       
>> Perhaps I misunderstand the problem, but FlexiVec should be able to do
>> that by expanding the complex operations and allocating a few more
>> vector registers.
>>     
>
> how? the vector "registers" only exist as auto-allocated resources.  they are completely inaccessible outside of a particular "i" within any given loop.  "i" cannot *ever* access the elements of "i+1"
>   

Correct, but the example of complex multiply-add does not need such 
accesses, even if the vectors are stored interleaved (as struct Complex 
{ uint32_t real; uint32_t imag; } or so).  In the latter case, simply 
adjust the memory access strides to skip every other element.  In the 
example given, this would be accomplished by loading 8 into R12 instead 
of 4 and deriving the base addresses for the imaginary vectors from the 
base addresses of the combined vectors.

> (except by ST of course and LD afterwards and we are *yet again* hammering memory beyond already-unacceptable limits) and further running into memory aliasing.
>
> by contrast SVP64 is quite happy to do a Fibonnaci Series of adds by *deliberately* summing off-by-one on vector input/output.
>   

This limitation in FlexiVec corresponds to permitting SIMT hardware 
implementation:  moving data between lanes is simply infeasible in the 
general case for those.

>> Complex FMADD is A*B+C -> D == (Ar + jAi)*(Br +jBi) + (Cr + jCi) -> (Dr
>> + jDi).
>>     
>
> from fixed coefficients, yes.
>
>   
>>     1:  lwaux   R20, R3, R12    ; REAL(A)
>>         lwaux   R21, R4, R12    ; IMAG(A)
>>         lwaux   R22, R5, R12    ; REAL(B)
>>         lwaux   R23, R6, R12    ; IMAG(B)
>>         lwaux   R24, R7, R12    ; REAL(C)
>>         lwaux   R25, R8, R12    ; IMAG(C)
>>     
>
> in SV these two (coefficients) or sorry it might be B may be stored in-regs (actual, real vector regs not in-flight autovectorised ones) which saves 25% LD/ST which is absolutely massive.
>
> Mitch saves this 25% by having highly efficient hardware SIN/COS with low latency.
>
> if he had not been able to do that (or a different algorithm had longer latency on re-computing loop-invariant coefficients) then AThe recalculation unfortunately becomes the critical path.
>
> at which point *your only option* is LDing the loop-invariant coefficients and we are back to unacceptable power consumption pressure.
>   

Is the problem here that you are trying to minimize memory accesses 
categorically?

>> In theory, spilling vector working addresses is minimal additional
>> pressure, 
>>     
>
> if you think it through i believe you will find that the Register Hazard Management for that is pretty hairy.
>   

No worse than spilling ordinary pointers to the stack.  The vector 
working addresses themselves are just ordinary pointers.

>> Function calls are theoretically possible inside FlexiVec loops, but
>> with some severe caveats, such as a /very/ non-standard ABI when
>> FlexiVec is active.
>>     
>
> i suggested the idea to Mitch, a year ago. he wouldn't go near it.
> given his experience i take that to be a bad sign given that
> FlexiVec===VVM / VVM===FlexiVec.
>   

Like I said, severe caveats.  While /technically/ possible (and it 
cannot be forbidden due to the existence of FlexiVec-Null) to call 
functions in a FlexiVec loop, the most likely result would be a program 
that works fine on a scalar-only processor and crashes horrendously on a 
vector-capable machine.  (Of course, this would be 100% the programmer's 
fault as attempting to spill a vector register (or calling a function 
that will do so) while FlexiVec is active would be documented as a 
programming error, even if it works on some processors.)

>> These lead to some interesting possibilities for a vector coprocessor
>> using Simple-V and a large architectural register file.  Consider the
>> overall architecture of the CDC 6600, which had a big compute processor
>> and a much smaller I/O processor (with 10 hardware threads on the I/O
>> processor); analogously, the Power ISA core would be the I/O processor
>> dispatching work units to the vector processor.
>>     
>
> yes.  see
> https://libre-soc.org/openpower/sv/SimpleV_rationale/
>
> the concept you describe above is on the roadmap, after learning of Snitch and EXTRA-V. it's the next logical step and i'd love to talk about it with you at a good time.
>   

I have been turning an outline for a "Libre-SOC VPU ISA strawman" over 
in my head for a while.  Are you suggesting that I should write it down 
after all?

> when you get to create these "Coherent Synchronous Satellite" cores they need not be so fast as the main core, they can be much slower and more efficiently designed (no L2 cache needed, if they run at Memory speed. Snitch even bypasses L1 *and the register files entirely* knocking off an absolutely astonishing 85% power consumption).
>
> read the paper above and the links to other academic works Snitch, ZOLC and EXTRA-V. the novel bit if there is one is the "remote Virtual Memory Management" over OpenCAPI.
>   

I was thinking of just having a VPU interrupt hit the Power core when 
the VPU encounters a TLB miss.  The Power hypervisor (or hardware) then 
installs the relevant mapping and sends the VPU back on its way.  (The 
VPU can run in a different address space from its host processor's 
problem state, allowing the Power core's OS to task-switch a different 
program in when a thread blocks waiting for the VPU.)

> [...]
>>> correct.  you have to let the OoO Engine flush and fall back to Scalar *or* you pull Shadow Cancellation on all but those elements not relevant and then fall back to scalar...
>>>  
>>>       
>> Not quite:  spilling a FlexiVec vector register is /not/ /possible/ --
>> the called function /does/ /not/ /know/ how many words will be written
>> by STORE.  The main FlexiVec loop does not know this either.
>>     
>
> exactly.  so given speculative execution you could cancel (roll back, throw away) certain *lanes* to a known point and execute in scalar from that point, into the function.
>
> of course if the function is called in every loop that exercise is pointless you might as well just do scalar.  but, to be honest, i stopped thinking this through when Mitch did not pick up on it.
>   

No, FlexiVec does not require speculative execution -- an attempt to 
spill a vector register with a nonincrementing store will cause only the 
last element to be written to that address.  (Well, every element is 
written, but the last element is the "last overwrite" in this case...)  
The subsequent attempt to reload the vector (with a nonincrementing 
load) will instead splat a single value across the vector.  Oops.  If an 
incrementing store is used, all VL elements will be written, causing 
memory corruption on a sufficiently-large machine.  Oops.

I suppose you could reserve enough space for the /entire/ block of data 
(since VL is bounded by CTR) the vector loop processes, but once again, 
non-standard ABI (and saving/restoring effectively the entire vector 
state inside the vector loop just flushed your performance down the 
proverbial toilet).

>> I think that Simple-V would work better with more than 128 registers.
>>     
>
> 128 registers is already at the point where a L1 *register* cache (yes, a cache for registers) is going to be a real need, in order to avoid slowing down max clock rate.
>
> plus, Multi-Issue reg-renaming can help reuse regs.  certain computationally heavy workloads that doesn't work.  we just have to live with it for now.
>   

Extend the main pipeline with a multi-segment register file; each 
segment may pass the value it was given from the previous segment or 
read a value which will be passed along in turn.  Since the registers 
are wider than the number of bits required to select them, this only 
requires one additional bit to indicate if the value on the ALU input 
bus is a register number or a register value.  Decode simply submits an 
immediate as a value, otherwise register numbers are sent down the 
chain.  When the correct register bank is reached, the register number 
is replaced with its contents, the flag is changed to "value", and the 
data sent on.  A Machine Check Exception occurs if the ALU receives a 
register number instead of a value on either input port.  (This is a 
"cannot happen".)

If, instead, the full register number is sent down the chain along with 
its value, register writes can propagate the other direction through the 
register banks and each stage can also replace a "forward" value with 
the "writeback" value if the register numbers match.  Alternately, 
decode can track register hazards and stall the pipeline as needed.  
These are solved problems, although the solutions do increase latency.

>> There is a small but important distinction here:  FlexiVec processes an
>> /element group/ (defined as VL elements) on each iteration of the loop.
>>     
>
> yes. this is the autovectorisation quantity.  you have to watch out for memory aliasing.  a = b+16,  a[i] *= b[i]
>   

Yes, aliasing between inputs and outputs is a programming error in 
FlexiVec.  Aliasing between /inputs/ is fine, as long as none of them 
are written in the loop.  (Such aliasing is an error because the 
visibility of the updates depends on VL.)

> this one was the subject of a LOT of comp.arch discussion of VVM.
>
> in SVP64 this "element group" quantity is the "hphint".  in Horizontal-First Mode it indicates the amount of the vector that is guaranteed to not have any Register Hazards cross-elements.  a[i] has no dependence on b[i+N] for any N within the "element group" even through memory aliasing.
> (this is not a guess it is a declaration, from knowledge possessed by the compiler)
>
> in Vertical First Mode it is the exact quantity of Horizontal elements executed. exactly equivalent to what you term VL in FlexiVec. but it is explicit.  the hardware is not permitted to choose the number of elements (except under explicitly requested Fail-First conditions, one of the Modes).
>   

In FlexiVec, MAXVL is /the/ hardware scalability parameter.  VL is 
simply MIN(MAXVL,CTR); i.e. either the maximum that the hardware can 
provide, or all remaining elements.  This avoids the need for an 
explicit "SETVL" instruction.

> [...]
>>> combined with element-width overrides you get space for 256 FP32s or 512 FP16s *and* space for 256 INT32s, or 512 INT16s, or 1024 INT8s.
>>>  
>>>       
>> Is there another bit to allow REX-prefixing without changing the meaning
>> of the prefixed opcode or would this also fill the entire prefixed
>> opcode map in one swoop?
>>     
>
> we're requesting 25% of the v3.1 64-bit EXT001 space already.
> one more bit would require 50% and i would expect the OPF ISA WG to freak out at that.
>   

Then I suspect they may freak out either way:  if I am reading the 
opcode maps correctly, simply adding REX prefixes without that 
additional "REX-only" bit would cause /collisions/ between existing 
64-bit opcodes and REX-prefixed versions of some 32-bit opcodes.  
Perhaps I am misreading the Power ISA opcode maps?

-- Jacob