[Libre-soc-dev] svp64 review and "FlexiVec" alternative

Sun Aug 7 00:20:09 BST 2022

lkcl wrote:
> On Wed, Aug 3, 2022 at 10:31 PM Jacob Bachmeyer <jcb62281 at gmail.com> wrote:
>   
>> Why would that same parallel tree reduction mode (invisibly selected by
>> hardware)
>>     
>
> ... and made abundantly and absolutely clear in the spec that it is
> 100% without fail absolute guaranteed absolute without fail 100%
> deterministic under absolute all and any circumstances as specifically
> laid out in this executable pseudocode:
> https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/preduce.py;hb=HEAD
>   

The requirement for FlexiVec is that all parallel implementations must 
produce the same results as the null implementation.  There is always 
the option of doing a reduction in VL cycles and each step in N cycles, 
simply shifting the values lane-by-lane towards the scalar unit, which 
does the actual calculation.

Certain unavoidable deviations would be ruled out in the spec as 
programming errors.

>> There are other possible hardware tricks, such as using
>> wider-than-normal floating point for the invisible intermediate sums to
>> avoid rounding errors,
>>     
>
> the hard and inviolate rule has been set that the sub-vector
> element enumeration shall without fail be 100% Precise-Interruptible
> at any point in time and saveable/restorable.
>   

FlexiVec has always met this -- this is the reason that scalar registers 
are suggested to be internally used to track the progress of vector 
operations.

> an invisible wider-than-normal FP register has absolutely no
> possible place to be saved and therefore has no place in any
> ISA of this type.
>   

Wider-than-normal FP values would only exist in the relevant pipeline 
latches during a reduction.

> other Vector ISAs make the conscious decision to have such
> intermediary hardware and usually the penalties are that (a)
> the instructions are explicit vector-sum operations and (b)
> it is prohibited to interrupt the hardware in the middle of
> such summations OR it must be necessary to roll-back
> and re-begin the entire instruction.
>   

The latter would be expected; the reduction collects sums across all 
vector lanes, holding a temporary until the instruction has actually 
completed uninterrupted (and can then commit) would not be an issue.

> none of these things i judged to be acceptable hence the
> hard rule of sticking to element-based operations.  if you
> want wider intermediate results use wider scalar elements.
>   

It turns out that using wider intermediates for parallel FP reduction 
may not work anyway, since the wider intermediate results could also 
avoid rounding that /would/ occur in a scalar calculation...

...the other possibility is to simply declare FP "fuzzy" as it typically 
has been.  The issue here for FlexiVec is how strictly its host 
architecture specifies FP.  (I suspect Power ISA is quite exact here but 
have not checked.)

-- Jacob