[Libre-soc-dev] svp64 review and "FlexiVec" alternative

Wed Aug 3 22:31:31 BST 2022

Jacob Lifshay wrote:
> On Tue, Aug 2, 2022 at 9:53 PM Jacob Bachmeyer via Libre-soc-dev
> <libre-soc-dev at lists.libre-soc.org> wrote:
>   
>> lkcl wrote:
>>     
>>> i have a feeling that Mitch worked out how to do it.  FMAC
>>> having in effect a Scalar accumulator (src==dest) whilst
>>> other operands get tagged as vectors, HW can detect that and
>>> go "ah HA! what you *actually* want here is a horizontal
>>> sum, let me just microcode that for you".
>>>
>>>       
>> Well, now that I think about it, yes, FlexiVec *can* express a
>> horizontal sum by accumulating into a scalar register.  Hardware
>> recognizes this very simply:  an ADD targeting a scalar register RX,
>> using that same RX and a vector register RY.  This will also work with
>> the null implementation.
>>     
>
> Do note that this trick only works well for integer add, floating
> point add is not associative so must be run serially (assuming the
> semantics are equivalent to running the code serially from element 0
> to the end). SVP64 specifically has an O(log N) parallel tree
> reduction mode to work around that.

Why would that same parallel tree reduction mode (invisibly selected by 
hardware) not be suitable for each VL-element group, followed by serial 
accumulation of group sums into a scalar register?

There are other possible hardware tricks, such as using 
wider-than-normal floating point for the invisible intermediate sums to 
avoid rounding errors, or simply running a FP accumulate serially, 
shifting the values across the vector lanes (access to the adjacent lane 
is feasible in an SIMT vector unit) and accumulating them in the scalar 
unit.

-- Jacob