[Libre-soc-dev] twin predication and svp64

Fri Dec 11 05:36:34 GMT 2020

On Thu, Dec 10, 2020, 21:00 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
wrote:

> On 12/11/20, Jacob Lifshay <programmerjake at gmail.com> wrote:
> > On Thu, Dec 10, 2020, 19:12 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
> > wrote:
> >
> >> jacob it took me a while to spot the predication table added to svp64,
> >> i made some more notes and some questions.
> >>
> >
> > Ok. I probably should have mentioned I added it.
>
> got there in the end.
>
> > Please don't insert a text block in the middle of a table, instead you
> > could add * or similar to items and add a footnote at the end, since that
> > doesn't block flow.
>
> tired, late, signalled "you remove it after review".
>
> >>
> >> twin predication is critically important, it is how we cover vgather
> >> scatter reduce splat insert, masses more.
> >>
> >
> > actually twin predication may be used quite a bit less than you might
> think:
> > 1. splat: covered by a vector dest and scalar src
>
> except single bit in scalar twin pred src is equivalent to macro-op
> mv.x merged in.
>

that's a different kind of splat that's waay less common. what I meant was
like:
v = ...;
for i in 0..23 {
    dest[i] = v; // v is a scalar
}

what you seem to have meant is:
src = ...; // src is a vector
for i in 0..23 {
    // splat since n doesn't depend on i
    dest[i] = src[n]; // get a specific element from the src vec
}

>
> > 2. gather/scatter (register to register, not load/store): twin
> predication
> > isn't actually powerful enough for a lot of what scatter/gather is used
> for
> > (majority of scatter/gather?) -- e.g. twin predication can't do:
> > dest = [src[3], src[7], src[2], src[5], src[1], src[0], src[4], src[6]];
> >
> > so, mv.x would be used instead.
>
> and twin pred applies to mv.x
>

yeah, but it's not necessary for mv.x to work and you can't emulate mv.x
using twin predication (well, technically you could, just like you can
emulate it without using vector ops at all, it would take many instructions
and be inefficient).

>
> > 3. vector compaction, expansion (take all elements with mask bit set to 1
> > and move to a compact list, as well as the inverse op): twin predication
> is
> > good at this, since twin predication is exactly equivalent to a
> compaction
> > followed by an expansion.
> >
> > In fact, twin predication with either the src or dest set to ALWAYS is a
> > simple way to encode expansion or compaction respectively, assuming we
> just
> > stop when either the src or dest index reaches VL
>
> mandatory to do so.
>
> > and don't error if they
> > have differing numbers of set bits.
>
> you're starting to get it.
>
> > My idea for how twin predication would work is that one or the other mask
> > could come from an integer reg, so I wouldn't worry as much about
> needing 2
> > cr-based predicates, though that wouldn't be a reserved encoding.
>
> two int regs as additional dependency hazards are not such a big deal
> (actuslly they are vector chain blockers we established that last
> month)
>

they only block chaining if the mask comes from a vector compare or similar
instead of a scalar op, I'd expect the mask to usually come from a scalar
li or other alu op more often. I don't anticipate GPU shaders using twin
predication all that often, shaders will mostly just be masked using
compare results and boolean combinations of compare results where src and
dest masks are always identical.

>
> > I'd imagine the masks would often be computed using a `li` or `andi`
> right
> > before the twin-predicated instruction, so that would work out well.
> >
> >>
> >> oh.  i have an idea for the reserved encoding in predication: 1<<r3.
> >> single bit.
> >>
> >
> > sounds like a really good idea!
>
> no idea why it took so long to think up.
>
> > This would allow us to optimize it to a
> > single element op at the decode stage, single-cycle adding directly to
> the
> > register numbers after adjusting for elwidth and turning into a nop if r3
> >>= VL, preventing issuing many useless elements only to be masked out.
>
> unfortunately this assumes that reafing the regfile is possible and
> acceptable at the decode phase, which it most definitely is not.
>

no, what I meant is it would be a single compound op that is issued to a
FU, the FU reads r3, then computes the reg numbers and reads the src regs,
then does the underlying element op, then writes the dest regs. I'd expect
r3 to change all the time, so stalling in decode won't work.

doing it in the decode pipe is only a good idea for VL since VL rarely
changes and is worth a pipe flush.

Jacob