[Libre-soc-dev] svp64 review and "FlexiVec" alternative

Sun Jul 31 02:57:40 BST 2022

lkcl wrote:
> finally got some time.
>
> On Wed, Jul 27, 2022 at 1:10 AM Jacob Bachmeyer <jcb62281 at gmail.com> wrote:
>
>   
>> You could also roll SVP64 as a custom extension for the initial
>> revisions of Libre-SOC hardware and propose FlexiVec as another
>> solution.  :-)  (Or slip the hardware schedule ("oops, Simple-V turned
>> out to be a blind alley") and propose FlexiVec as a Contribution.)
>>     
>
> i would do so if i had not had over a year to think it through and had not come up with Vertical-First Mode.
>
> VVM/FlexiVec specifically and fundamentally rely on LD/ST.
>   

Sort of; LD/ST are the input/output operations for a vector calculation, 
yes, but intermediate values are held in the vector unit registers.  The 
FlexiVec model works for calculations that "run along" some number of 
elements, where each step uses the same offsets in the array.  For 
example, FlexiVec can trivially do A[i] + B[i] -> C[i] and (by shifting 
the base address of B) A[i] + B[i+k] -> C[i], but can only do A[i] + 
B[i] + B[i+k] -> C[i] by introducing a fourth vector D aliased to B with 
offset k and computing A[i] + B[i] + D[i] -> C[i].

> i feel you are drastically underestimating the power penalty of GPU/VPU memory accesses which are sustained *per clock* at least TEN TIMES that of CPU workloads.  plus reliance on LDST bandwidth increases the pressure.
>   

The only memory accesses in a proper FlexiVec loop are (1) loading the 
input operands (once per element) and (2) storing the results (again 
once per element).  How do you accomplish *any* calculation without 
loading inputs and storing results?  (Note that in a graphics 
application, that final store could be direct to the framebuffer.)

> let us take FFT or better DCT as an example because there are more cosine coefficients, you cannot re-use DCT coefficients per layer.
>
> let us take the maxim from Jeff Bush's work to do as much in-regs as possible.
>
> therefore i designed the FFT and DCT REMAP subsystem *specifically* to be in-place, in-regs, the entire triple-loop.
>
> that means in Horizontal-Mode that *all* coefficients when performing multiple FFTs or DCTs are read-only and can be stored in-regs.
>   

This also limits the maximum size of the transform that you can perform 
with Simple-V to one where all data fits in the registers.

> in VVM / FlexiVec on the other hand you have *no other option* but to either re-compute the cosine coefficients on-the-fly (needs about 8 instructions inside the inner loop) or you are forced to use yet another LD-as-a-vector-substitute which puts additional unnecessary pressure on an already-extreme LD/ST microarchitecture.
>   

The expected implementation would have the coefficients precomputed 
(possibly at compile-time) and stored in memory somewhere.

> (get that wrong and you will have stalls due to LDST bottlenecks)
>   

Ah, but the LD/ST operations repeated during a FlexiVec loop are 
perfectly predictable.  The loop itself is perfectly predictable, 
trivially:  if CTR > VL, the loop will run another iteration, so 
hardware can already be prefetching the next set of inputs while the 
computation proceeds, and simply stack the store operations into a 
write-behind queue, flushed while the computation proceeds on the next 
loop iteration.

> in SVP64 Vertical-First Mode you can also *reuse* the results of a Vector Computation not just as input but also as output, switching between Horizontal-First and Vertical-First at will and as necessary.
>   

The price SVP64 pays for this flexibility is a hard limit on the maximum 
amount of data that can be involved.  Worse, that hard limit is 
determined by the ISA because it is based on the architectural register 
file size.  FlexiVec avoids this limit because the FlexiVec vector 
registers are hidden behind the architectural registers used as vector 
handles and may each hold any number of elements.

> a good example is FFT for which complex fmadd/sub (four of them) is required.  we decided not to add complex-fmadd/sub right now because it is too much.
>
> simply using Vertical-First it is possible to get away with using a batch of Scalar temporary regs, inputs sourced from Vector regs, outputs after going through *multiple* Scalar temporary regs, end up back in Vector regs.
>
> after having done one phase of FFT in Vertical-First, you go back to completing the rest of the algorithm in Horizontal-First.
>
> i *know* that VVM / FlexiVec cannot do that.
>   

Perhaps I misunderstand the problem, but FlexiVec should be able to do 
that by expanding the complex operations and allocating a few more 
vector registers.

Complex FMADD is A*B+C -> D == (Ar + jAi)*(Br +jBi) + (Cr + jCi) -> (Dr 
+ jDi).

Expanding the multiplication yields (Ar*Br + Ar*jBi + jAi*Br + jAi*jBi) 
+ (Cr + jCi) -> (Dr + jDi).

Collecting terms and reducing j^2 yields ((Ar*Br - Ai*Bi) + j(Ar*Bi + 
Ai*Br)) + (Cr + jCi) -> (Dr + jDi).

Splitting into independent real and imaginary sub-calculations yields:

    REAL:  Ar*Br - Ai*Bi + Cr -> Dr
    IMAG:  Ar*Bi + Ai*Br + Ci -> Di

We can now assign vector registers.  We will need 6 inputs, 2 
intermediate temporaries, and 2 outputs also used as temporaries, for a 
total of 8 vector registers.  Assuming that we are working in 
fixed-point fractional multiplication (such that the high halves of the 
product words can be ignored; otherwise I do not know how to write it in 
Power assembler), the code using FlexiVec is along these lines:

----

	[address of REAL(A) in R3, IMAG(A) in R4]
	[address of REAL(B) in R5, IMAG(B) in R6]
	[address of REAL(C) in R7, IMAG(C) in R8]
	[address of REAL(D) in R9, IMAG(D) in R10]
	[vector length in elements in R2]
	[assemble vector configuration in memory at address in R11...]
	[...declaring R20 - R29 as vectors of words]
	fvsetup	R11
	; we are using 32-bit elements, so 4 bytes per element
	li	R12, 4
	; the initial accesses will start 4 bytes from the base
	addi	R3, -4
	addi	R4, -4
	addi	R5, -4
	addi	R6, -4
	addi	R7, -4
	addi	R8, -4
	addi	R9, -4
	addi	R10, -4
	; load counter
	mtctr	R2
	; begin vector loop
    1:	lwaux	R20, R3, R12	; REAL(A)
	lwaux	R21, R4, R12	; IMAG(A)
	lwaux	R22, R5, R12	; REAL(B)
	lwaux	R23, R6, R12	; IMAG(B)
	lwaux	R24, R7, R12	; REAL(C)
	lwaux	R25, R8, R12	; IMAG(C)
	; interleave real/imaginary calculations
	;  real parts in even vector registers
	;  imaginary parts in odd vector registers
	mullw	R28, R20, R22	; Ar*Br -> Dr (accumulating)
	mullw	R29, R20, R23	; Ar*Bi -> Di (accumulating)
	mullw	R26, R21, R23	; Ai*Bi -> Tr
	mullw	R27, R21, R22	; Ai*Br -> Ti
	subf	R28, R26, R28	; Dr - Tr -> Dr
	add	R29, R27, R29	; Di + Ti -> Di
	add	R28, R28, R24	; Dr + Cr -> Dr
	add	R29, R29, R25	; Di + Ci -> Di
	; store outputs
	stwux	R28, R9, R12	; REAL(D)
	stwux	R29, R10, R12	; IMAG(D)
	bdnz	1b
	; end vector loop

----

This is using about half of the upper limit for efficient simultaneous 
vectors with the fixed-point register file, but if FlexiVec were to be 
used with VSX, we could use still more vectors efficiently.  You can 
reach a bit farther by spilling vector working addresses (note:  *not* 
the actual vector data!) to the stack instead of keeping them all in 
registers throughout the loop, but optimal performance requires holding 
the working addresses in registers.  In the above example, R3 - R10 are 
used for this purpose.

In theory, spilling vector working addresses is minimal additional 
pressure, since the vector unit is expected to have its own (wider) path 
to memory anyway, so an OoO implementation could spill/load working 
addresses to/from the stack while the vector unit is calculating or even 
also accessing memory, since the relevant part of the stack should fit 
in L1 cache and the scalar unit probably has a private L1 cache w.r.t. 
the vector unit.

> also, i am keenly aware that Mitch's expertise here led him to design VVM as it is, because of decades of experience and even then it was a good couple of years in the making.  no, function calls inside VVM loops are not permitted and he has endeavoured to explain why, and it would take many months to comprehend.
>   

Function calls are theoretically possible inside FlexiVec loops, but 
with some severe caveats, such as a /very/ non-standard ABI when 
FlexiVec is active.

> [...]
>>   I had the impression that past a
>> certain level of complexity, with certain (GPU-like) constraints on the
>> processing model, OoO becomes infeasible.
>>     
>
> within the realm of 4-8 cores for embedded low to mid end SoCs typically MALI 400 MP or Vivante GC800/1000 where if you are handling 1920x1080 @ 30fps you're doing well, i believe it's feasible.
>
> we are not aiming for 120 watts, here, as a first ASIC. we're aiming for a maximum *3.5* watts, the entire SoC including a 0.5 watt budget for the DDR Memory interface.
>   

These lead to some interesting possibilities for a vector coprocessor 
using Simple-V and a large architectural register file.  Consider the 
overall architecture of the CDC 6600, which had a big compute processor 
and a much smaller I/O processor (with 10 hardware threads on the I/O 
processor); analogously, the Power ISA core would be the I/O processor 
dispatching work units to the vector processor.

>> It is also worth noting here that IBM is known for advanced CPUs and is
>> /not/ known for advanced graphics hardware.
>>     
>
> yes. and the silver lining on that is that they left the Scalar ISA pretty much untouched.  VSX was (is) the primary focus, but also you have to understand that their business revolves around IO throughput and handling massive data sets.
>
> this makes it perfect for applying SVP64 precisely because the Scalar ISA is so lean.
>   

This is also why I suggest that FlexiVec is likely to be a better "fit" 
for the Power ISA.

> [...]
>>> VVM also explicitly identifies (in equivalent of fvsetup) those registers
>>> that are loop-invariant, in order to save on RaW/WaR Hazards. this
>>> is also extremely important
>>>  
>>>       
>> In practice, I think FlexiVec requires all non-vector registers to be
>> either memory addresses (incremented as the loop works through the data)
>> or invariant.  Otherwise, any change to scalar register values would
>> have effects varying with VL, since scalar operations are only executed
>> once per every VL elements.  
>>     
>
> no, i distinctly recall seeing assembler examples using scalar registers as intermediaries where Mitch outlined how the exact same Auto-SIMD-i-fication could be applied to them, *if* they were correspondingly identified as being useable as such, by the LOOP initialisation instruction.
>
> this is down to his gate-level architectural expertise.
>   

Then this is a subtle difference between VVM and FlexiVec.  (I think.)

> [...]
>> In fact, this is a limitation of function calls in FlexiVec loops:  you
>> /cannot/ spill a vector register to the stack because you do not know
>> its length,
>>     
>
> correct.  you have to let the OoO Engine flush and fall back to Scalar *or* you pull Shadow Cancellation on all but those elements not relevant and then fall back to scalar...
>   

Not quite:  spilling a FlexiVec vector register is /not/ /possible/ -- 
the called function /does/ /not/ /know/ how many words will be written 
by STORE.  The main FlexiVec loop does not know this either.

>> so functions must be specially written for the loops that
>> will call them.  
>>     
>
> Mitch very specifically forbids functions within loops.  or, you *might* be able to have them but the LOOP will fall back to Scalar behaviour.
>   

A function call within a FlexiVec loop would be better described as an 
out-of-line assembler macro than a proper function.

> [...]
>> I hate to say this, but I do not think that you will get the performance
>> you want with Simple-V and any existing CPU ISA.  You will probably need
>> to develop a new GPU-type ISA, with very long register files.
>>     
>
> Jacob answered this already.  MALI Broadcom VideoCore IV Vivante AMDGPU all have 128 registers.
>   

I think that Simple-V would work better with more than 128 registers.

>> FlexiVec is a hybrid between VVM and "classic" Cray vectors, then.
>>     
>
> it really isn't. a Cray Scalable Vector ISA is specifically defined as Horizontal-First Scalable (elements are 100% processed in full up to VL before moving to the next instruction).
>
> VVM and FlexiVec are very specifically Memory-based Vertical-First (instructions are processed in a loopin full, before moving to the next element)
>
> VVM/FlexiVec it is VL-based Index-incrementing that is the *outer* loop.
>
> Cray traditional Vectors it is VL-based Index-incrementing that is the *inner* loop.
>   

There is a small but important distinction here:  FlexiVec processes an 
/element group/ (defined as VL elements) on each iteration of the loop.  
Each vector instruction completes VL elements and may iterate to do so.  
This iteration may be variable, for example, memory access instructions 
may always stop at page boundaries when issued as parallel operations 
across a vector chain and restart in the middle of the element group 
after the next page is resolved.  Each instruction processes VL elements 
before the PC advances to the next instruction.

> [...]
>> FlexiVec vectors *do* actually exist in hardware somewhere, although the
>> null implementation uses the scalar registers to store single-element
>> "vectors" and an OoO implementation can use scratchpads instead of
>> dedicated vector storage.  Perhaps FlexiVec is effectively the VVM
>> programming model applied to "classic" vectors.
>>     
>
> my understanding from what you have explained in that assembler example is that they are exactly the same underlying concept.  VVM creates the appearance or effect of Vectors from LDST, so does FlexiVec.
>
> if you have something different in mind, i need to see more assembler examples (apologies)
>   

Perhaps the example above (complex multiply-add) will shed some 
additional light on the issue?

In fact, as far as I understand at the moment, VVM very likely *is* a 
valid FlexiVec implementation.  Another valid implementation is SIMT, 
where the scalar unit drives a chain of vector processing elements that 
operate in parallel.  The SIMT version is the long hardware explanation 
previously given, while the VVM "hold the vector loop in-flight" model 
is a valid implementation for an OoO processor.

I believe this to be an argument in favor of FlexiVec at the ISA level:  
radically different hardware implementations can both efficiently 
process the same software loop.

> [...]
>>> the most important take-away is the insights from Jeff Bush,
>>> and his extremely in-depth focus on performance/watt (pixels/watt).
>>>  
>>>       
>> Which means that Simple-V may not be a suitable fit for Power ISA any
>> more than it fit in RISC-V.  OpenSPARC or another high-register-count
>> ISA might be useful, or possibly a dedicated Libre-SOC GPU architecture,
>> with an OpenPOWER (sans vector facilities) control unit in the actual SOC.
>>     
>
> again, as jacob explained, now you know why we increased the number of 64 bit regs to 128.
>
> this is why there are *nine* bits in the EXTRA area of the precious 24 bit prefix dedicated to extending RA, RB, RC, RT and RS, and FRA...FRS, and CR Field numbering, from 32 entries to 128 entries.
>
> combined with element-width overrides you get space for 256 FP32s or 512 FP16s *and* space for 256 INT32s, or 512 INT16s, or 1024 INT8s.
>   

Is there another bit to allow REX-prefixing without changing the meaning 
of the prefixed opcode or would this also fill the entire prefixed 
opcode map in one swoop?

>>> plus, i am following the style of Power ISA 3 itself, which is multiple
>>> books.
>>>  
>>>       
>> OK, then split the document and recombine it into multiple "books" in a
>> single PDF, with each book a freestanding sub-proposal.  
>>     
>
> that's what it is.  that's exactly how it is.  if you reload the pdf you'll see the wording which says precisely "these are independent".
>   

The part III preamble should probably be more specific that "each 
chapter in this part is a freestanding proposal" and III.1 should 
probably not be named "SV Vector ops" then.

-- Jacob