[Libre-soc-isa] [Bug 697] SVP64 Reduce Modes

Wed Feb 2 11:44:09 GMT 2022

https://bugs.libre-soc.org/show_bug.cgi?id=697

--- Comment #8 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
(In reply to Jacob Lifshay from comment #4)
> (In reply to Luke Kenneth Casson Leighton from comment #3)
> > there is already a "reverse gear" bit in SVP64
> 
> i don't think that does what we might need here...

the bit is available so can be made to mean anything
that is necessary or needed.

https://libre-soc.org/openpower/sv/normal/

0-1     2       3 4     description
00      0       dz sz   normal mode
00      1       0 RG    scalar reduce mode (mapreduce), SUBVL=1
00      1       1 /     parallel reduce mode (mapreduce), SUBVL=1

so there's one bit spare which can still be made available
for a parallel-reduce reverse-gear mode

> reversed means it reduces like so (@ are adds;
> this one is more efficient on arm):

the scalar-reduce is a misnomer, it basically - note the careful/clumsy
negation - "prevents the vector-loop from stopping just because the
result is a scalar"

normally when a result is a scalar, the vector-loop terminates at the
very first element (because it's a scalar, duh).

however in scalar-reduce mode the looping does *not* stop, and so you
can do this:

 ADD r3.s, r3.s, r10,v  

and that will do:

 for i in range(VL):
      iregs[3+i] = iregs[3+i] + iregs[10+i]

the scalar-reduce bit tells the hardware-loop *NOT* to stop just because
there has been one result dropped into r3 already.  actual code:

 for i in range(VL):
      iregs[3+i] = iregs[3+i] + iregs[10+i]
      if not scalar_reduce_mode:
          if RT is scalar:
              break

the other trick is partial sum, by specifying the same register offset
by one in src and dest you get the *effect* of a reduction:

 ADD r4.v, r3.v, r10,v  

 for i in range(VL):
      iregs[4+i] = iregs[3+i] + iregs[10+i]

thus you end up with *partial results* (partial sum) and, oh
look, the last element happens to be the full reduced sum.

reverse-gear is *needed* here because you might want the result
to be at the start.

 for i in reverse(range(VL)):
      iregs[4+i] = iregs[3+i] + iregs[10+i]

also you might want the reverse-gear even when the result is a
scalar because by starting at the opposite end of the results,
FMACs (or FADDs) may end up producing different results depending
on the accuracy *and* order of the numbers being operated on.

note that the pseudocode above says how things must *look* as
far as *programmers* are concerned.  it does *NOT* dictate how
the hardware is actually implemented, just that it must have the
same net effect *as* the pseudocode.

if hardware implementors discover that the above pseudo-code
algorithms can be implemented as parallel tree-reduce *great*
[but what they can't do is implement a hardware algorithm that,
 under certain circumstances, produces a different result from
 that which would be produced by the pseudo-code]

so that's scalar-reduce.  moving on to parallel-reduce...

here's the algorithm (bottom of the page):
https://libre-soc.org/openpower/sv/svp64/appendix/

with that bit spare (bit 4) it can be allocated for anything-that-is-needed
reverse-gear, reverse-mapping, or banana-fishing. something useful preferably.

my concern with that algorithm is that it critically relies on creating
state (the vi nonpredicated array) where there's actually no space to
put that state if there is an interrupt to be serviced in the middle
of the reduction operation

(all SVP64 operations are currently designed and required to be re-entrant)

with vi being indices, that's a hell of a lot of state. HOWEVER i *believe*
it may be possible to shoe-horn this algorithm into the SVREMAP system.

jacob could you please try morphing the algorithm into something that
looks pretty much exactly like this:

def reduction_yielder_for_RA():
    for something in something:
       yield offset_idx_for_RA

def reduction_yielder_for_RB():
    for something in something:
       yield offset_idx_for_RB

def reduction_yielder_for_RT():
    for something in something:
       yield offset_idx_for_RT

for i in range(some_function_of(VL)):
     RA_offs = yield reduction_yielder_for_RA()
     RB_offs = yield reduction_yielder_for_RB()
     RT_offs = yield reduction_yielder_for_RT()
     regs[RT+RT_offs] = regs[RA+RA_offs] + regs[RB+RB_offs]

note that that is quite literally the standard SVP64 vector loop,
but oh look! there happen to be some offsets added, how did those
get there? :)

the separation of the offsets from the computation itself will
fit directly into the SVREMAP system, and allow the front-end
multi-issue engine to run (independently) creating multiple
sequences of offsets in one clock cycle, to hit the back-end
parallel SIMD ALUs with as many operations as they can handle.

SVREMAP *can* store state (and is saved on an interrupt
context-switch) so you *can* do analysis of the predicate
bits and do-something-else-because-of-them().

but please please note jacob that what we *cannot* have is *two*
different operations (unless one of them can be "faked-up" by
one of the operands being zero or one)

what we *cannot* have is:

for i in range(some_function_of(VL)):
     RA_offs = yield reduction_yielder_for_RA()
     RB_offs = yield reduction_yielder_for_RB()
     RT_offs = yield reduction_yielder_for_RT()

     if (some_condition)
        regs[RT+RT_offs] = regs[RA+RA_offs] + regs[RB+RB_offs]
     else
        regs[RT+RT_offs] = regs[RA+RA_offs]

the *only* way that could conceivably be done - and even this
is a bit iffy:

     if (some_condition)
        src2 = regs[RB+RB_offs]
     else
        src2 = 0

        regs[RT+RT_offs] = regs[RA+RA_offs] + src2

i.e. use zeroing (which would have to be assumed/implied, because
there's just no room in the SV RM Mode table for another bit)

it would still be iffy for multiply-reduce because the definition
of zeroing is, well, to put a zero in, not a one.

-- 
You are receiving this mail because:
You are on the CC list for the bug.