[Libre-soc-dev] SVP64 parallel map-reduce idea

Sat Jun 12 00:31:50 BST 2021

---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68

On Sat, Jun 12, 2021 at 12:05 AM Jacob Lifshay <programmerjake at gmail.com>
wrote:

>
> nope -- remember there's 1 predicate bit for each whole subvector, not one
> per subvector element. so f32x4x64 needs 64 predicate bits, each bit
> predicates a whole f32x4 subvector.

the value - the temporary result - not the SPR.  i wasn't referring to
the predicate.  ah you thought i was.  ok.  right.  yes, that's solved
already: CR bits you can read on-the-fly, and for INT pred, well,
it's just one INT read.  which is tolerable.

A reduce over subvectors should produce
> a subvector as the result,

that's an additional separate option - reminder of the spec:
https://libre-soc.org/openpower/sv/svp64/

00 1 SVM CRM subvector reduce mode, SUBVL>1

we could get away with recalculating just the predicate bits -- that should
> be waay easier to do.
>

yes, and if progress has got past the first layer there's no need for
the predicate.  which, in turn, tells us that it should be possible to
have a re-entrant algorithm which reads only the bits of the predicate
as needed.

you may be somewhat behind on development, Jacob: even the
TestIssuer reads predicate bits in advance.  CRs shouldn't have
to, but Cesar found it easier to "accumulate" all CR bits into a
mask.  it has to be re-created every time on interrupt return,
horribly slow, but the code's clear and readable, which is more
important.

> there is a specific advantage of allowing the target (destination)
> > vector to be used as temporary storage: combined with a fixed
> > (predictable) algorithm there *might* be use-cases for the *whole*
> > temporary results rather than throwing them away.
> >
>
> yup! I can imagine how the temporary results are used to produce something
> like a prefix-sum.
>

yehyeh.  it gets particularly complex if the sources overlap, and... hmmm...
overlapping with the destination, that's... it gets complicated there.

for "standard" SVP64, Program Order is used to make iterative sums.
x[i+1] = y[i] + x[i] by issuing an instruction where destination overlaps
source.

but allowing the same thing for parallel reduce? can't get my head round it.

> >
> > if you read the sections on Reduce mode you'll see it's clearly
> > laid out, the ability to use the vector result for temporary
> > calculations.
> >
> yup! that's what the pseudo-code I wrote does.
>

ah, star.  it's quite late, i'm going to have to read it several times,
possibly even make it executable and print out some "results"
(plus indices) to get it fully.

l.