[Libre-soc-dev] SVP64 parallel map-reduce idea

Mon Jun 14 15:37:31 BST 2021

On Sun, Jun 13, 2021, 15:26 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
wrote:

> On 6/13/21, Jacob Lifshay <programmerjake at gmail.com> wrote:
>
> > The example actually runs the reduce twice, once with a non-trivial
> > predicate, followed by once with an all-true predicate. you probably
> > confused the two outputs with each other.
>
> ah ok thank you.
>
> hm.
>
> then... a couple of options present themselves:
>
> 1) the algorithm you wrote is used as-is (which is a good one) as the
> canonical definition for parallel mapreduce
>
> 2) given that there is a serial mapreduce we don't have a parallel
> one, instead replace it with prefix-sum.
>

I'd say this option is a no-go, precisely because serial reduce gives a
different result than tree reduce for most fp ops (in particular fp add and
fp mul), meaning our cpu can't replace serial fp reduction with tree
reduction precisely because it gives a different result. Also, for input
numbers that are about the same size, tree reduce fp add gives numerically
superior results to serial reduce, since late in the serial addition we're
adding numbers around x and VL*x, resulting in losing the log2(VL) lsb
bits. tree reduction (for all-true predicate) ends up adding numbers around
2^N * x and 2^N * x, reducing precision loss.

I'd say go with option #3:
3) use the algorithm I provided for tree reduction, replace serial
reduction with tree prefix-sum.

for tree prefix-sum, maybe use:
https://en.wikipedia.org/wiki/Prefix_sum#Algorithm_2:_Work-efficient
O(n) operations, O(log n) serial depth
-- basically it's an algorithm A which is a vector-index-reversed version
of the tree reduction algorithm I gave followed by something like a
time-reversed version of algorithm A. I only thought about the predicate
all-true version, it may need modification for non-all-true.

One other important note for the tree reduce operation: muladds need to
have the multiplies done as part of the initial copy-to-temp loop, only
adds are done in the tree-reduce loop.

>
> the serial one has an advantage that it inherently can and has to cope
> with carry-in, carry-out.
>

for carry-in/out (which is much less important imho) we just use:
adde r10.v, r20.v, r30.v # vector elements ordered from lsb to msb
since each vector element's adde is semantically executed after the
previous one, the carry is propagated from the previous element's adde and
to the next element's adde.
no serial reduce needed.

>
> question: does the following result in prefix sum behaviour:
>
> gpr(RT) = gpr(RA)
> for j = 0 to vl-1:
>    for i = 0 to j:
>       gpr(rt+j) = OP( gpr(ra+i), gpr(rb+i) )
>

I'd have to think through if that's a prefix sum (got vaccinated yesterday
so not feeling the best), however:
I think it doesn't matter because it's O(n^2) -- prefix sum serial is n
operations, tree-based prefix sum is at most O(n*log n) (when using the
least efficient tree prefix sum, O(n) for better ones; if assuming inner
loop elements are run in parallel: O(log n))

why would you pick an algorithm that takes 4096 steps when there are better
(faster) algorithms that take less than 1/50 as many steps (both serial
prefix-sum and work-reduced tree prefix sum (wikipedia link above))?!

Jacob