[Libre-soc-dev] svp64 review and "FlexiVec" alternative

Wed Aug 3 22:48:59 BST 2022

On Wed, Aug 3, 2022, 14:31 Jacob Bachmeyer <jcb62281 at gmail.com> wrote:

> Jacob Lifshay wrote:
> > Do note that this trick only works well for integer add, floating
> > point add is not associative so must be run serially (assuming the
> > semantics are equivalent to running the code serially from element 0
> > to the end). SVP64 specifically has an O(log N) parallel tree
> > reduction mode to work around that.
>
> Why would that same parallel tree reduction mode (invisibly selected by
> hardware) not be suitable for each VL-element group, followed by serial
> accumulation of group sums into a scalar register?
>

because it gives a different answer due to rounding in a different order.

svp64 is designed such that every implementation will give the bit-exact
same answer, which is a very useful property.

the tree reduction mode has to be explicitly specified so the results will
be guaranteed to match rounding in a tree-reduction pattern, rather than a
serial pattern, so, no, invisibly selecting tree reduction won't work for
fp ops.

>
> There are other possible hardware tricks, such as using
> wider-than-normal floating point for the invisible intermediate sums to
> avoid rounding errors,

those also give a third set of results, so are unsuitable unless the
instruction is explicitly specified to do that.

or simply running a FP accumulate serially,

because serial fp accumulate is really slow because you need the output
from the previous element's add before the current element's add can be
run, giving execution time for a 64-element vector on the order of 192
clock cycles (for 3 clock fadd latency), whereas the tree reduction
algorithm has latency (assuming wide enough execution units) of 18 clock
cycles (log2(64)*3-cycle-fadd-latency) -- more than a factor of 10x

Jacob