[Libre-soc-dev] SVP64 parallel map-reduce idea

Sat Jun 12 00:05:07 BST 2021

On Fri, Jun 11, 2021, 15:44 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
wrote:

> ---
> crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68
>
>
> On Fri, Jun 11, 2021 at 10:29 PM Jacob Lifshay <programmerjake at gmail.com>
> wrote:
>
> > On Fri, Jun 11, 2021, 11:06 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
> > wrote:
> >
> > > after implementing the scalar map-reduce in ISACaller, it occurred to
> me
> > > that a fixed (predictable, useful, non-ambiguous, clear) algorithm for
> > the
> > > Program Order in which the reductions has to take place might be a good
> > > idea.
> > >
> >
> > Here's my proposal:
> > we say all reductions behave as-if the following tree-reduction algorithm
> > with handling for predication is executed, even if the reduction
> operation
> > is not associative and/or not commutative (e.g. we use the same algorithm
> > for reduction with subtraction or division or atan2, every instruction
> that
> > can encode reduction in the SVP64 prefix):
> >
> >
> >
> https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=3e49f5f64a00edcfc47d8114e23c76ed
>
>
> recorded here
> https://libre-soc.org/openpower/sv/svp64/appendix/?updated#index14h1
>
>
> /// reference implementation of proposed SimpleV reduction semantics.
> > ///
> > /// `temp_pred` is a non-user-visible register that can be stored in some
> > /// SPR if the reduction is interrupted,
>
>
> i'd strongly like this not to be the case.  there are an alarmingly high
> number of SPRs in OpenPOWER and we are going to need to add
> dozens more.
>

this would be one 64-bit register (maybe multiplexed into some other
register?) that is only required if we want to save redoing a cheap (just
predicate bits) calculation --- we probably want to just redo the
calculation and not need another SPR.

>
> also SPRs are 64-bit maximum.

yup.

  if using subvectors, now that's 4x 64-bit
> SPRs needed.
>

nope -- remember there's 1 predicate bit for each whole subvector, not one
per subvector element. so f32x4x64 needs 64 predicate bits, each bit
predicates a whole f32x4 subvector. A reduce over subvectors should produce
a subvector as the result, not a single element (though perhaps we need a
way to switch between the different possible reduction modes -- single
subvector result (where each op in the reduction tree is a f32x4 + f32x4 ->
f32x4 op, going f32x4x64 -> f32x4), single element result (going f32x4x64
-> f32), or reduction within subvectors (going f32x4x64 -> f32x64)).

>
> or we can just restart the
> > reduction
> > /// from the beginning since it will produce the same results.
> >
>
> yeah i don't like this either.
>

we could get away with recalculating just the predicate bits -- that should
be waay easier to do.

>
> the current draft allows the implementation to use the target
> (destination) register(s) - all of them - as temporary storage.
>
> it also states clearly that the srcstep and dststep may be used
> for the purposes of storing whatever indices state is required.
>
>     while step < vl {
>         step *= 2;
>         for i in (0..vl).step_by(step) {
>             let other = i + step / 2;
>
> intuitively i expect these (step and other) should be perfectly fine to fit
> into srcstep and dststep respectively.
>
> there is a specific advantage of allowing the target (destination)
> vector to be used as temporary storage: combined with a fixed
> (predictable) algorithm there *might* be use-cases for the *whole*
> temporary results rather than throwing them away.
>

yup! I can imagine how the temporary results are used to produce something
like a prefix-sum.

>
> if you read the sections on Reduce mode you'll see it's clearly
> laid out, the ability to use the vector result for temporary
> calculations.
>
yup! that's what the pseudo-code I wrote does.

Jacob