[Libre-soc-dev] SVP64 parallel map-reduce idea

Fri Jun 11 23:44:15 BST 2021

---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68

On Fri, Jun 11, 2021 at 10:29 PM Jacob Lifshay <programmerjake at gmail.com>
wrote:

> On Fri, Jun 11, 2021, 11:06 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
> wrote:
>
> > after implementing the scalar map-reduce in ISACaller, it occurred to me
> > that a fixed (predictable, useful, non-ambiguous, clear) algorithm for
> the
> > Program Order in which the reductions has to take place might be a good
> > idea.
> >
>
> Here's my proposal:
> we say all reductions behave as-if the following tree-reduction algorithm
> with handling for predication is executed, even if the reduction operation
> is not associative and/or not commutative (e.g. we use the same algorithm
> for reduction with subtraction or division or atan2, every instruction that
> can encode reduction in the SVP64 prefix):
>
>
> https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=3e49f5f64a00edcfc47d8114e23c76ed

recorded here
https://libre-soc.org/openpower/sv/svp64/appendix/?updated#index14h1

/// reference implementation of proposed SimpleV reduction semantics.
> ///
> /// `temp_pred` is a non-user-visible register that can be stored in some
> /// SPR if the reduction is interrupted,

i'd strongly like this not to be the case.  there are an alarmingly high
number of SPRs in OpenPOWER and we are going to need to add
dozens more.

also SPRs are 64-bit maximum.  if using subvectors, now that's 4x 64-bit
SPRs needed.

or we can just restart the
> reduction
> /// from the beginning since it will produce the same results.
>

yeah i don't like this either.

the current draft allows the implementation to use the target
(destination) register(s) - all of them - as temporary storage.

it also states clearly that the srcstep and dststep may be used
for the purposes of storing whatever indices state is required.

    while step < vl {
        step *= 2;
        for i in (0..vl).step_by(step) {
            let other = i + step / 2;

intuitively i expect these (step and other) should be perfectly fine to fit
into srcstep and dststep respectively.

there is a specific advantage of allowing the target (destination)
vector to be used as temporary storage: combined with a fixed
(predictable) algorithm there *might* be use-cases for the *whole*
temporary results rather than throwing them away.

if you read the sections on Reduce mode you'll see it's clearly
laid out, the ability to use the vector result for temporary
calculations.
https://libre-soc.org/openpower/sv/svp64/appendix/

l.