[Libre-soc-dev] twin predication and svp64

Fri Dec 11 07:36:11 GMT 2020

On Thu, Dec 10, 2020, 22:10 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
wrote:

> On 12/11/20, Jacob Lifshay <programmerjake at gmail.com> wrote:
>
> > what you seem to have meant is:
> > src = ...; // src is a vector
> > for i in 0..23 {
> >
> > }
>
> yes.  no.  dest[i] = src[i]
>
> now apply predication to the src, where one bit only is set in that
> src predicate.
>
> pred is:
>
>      if srcpred & (1<<i)
>             dest[i] = src[i]
>
> and if srcpred *equals* 1<<n *then* it is *as if* the op was
> macro-fused with mv.x n
>

can you fully write that out in pseudo-code since that sounds like a
single-element mv operation (a "vector extract" op, basically
scalar_dest=vec_src[index]) and not a splat at all. the key part of what
makes a splat is that one value/element is duplicated and written into
every element of the dest vector (except masked-off vector elements, of
course).

>
>
>
> >>
> >> > 2. gather/scatter (register to register, not load/store): twin
> >> predication
> >> > isn't actually powerful enough for a lot of what scatter/gather is
> used
> >> for
> >> > (majority of scatter/gather?) -- e.g. twin predication can't do:
> >> > dest = [src[3], src[7], src[2], src[5], src[1], src[0], src[4],
> src[6]];
>

that is just 1 vector mv.x instruction.

>> >
> >> > so, mv.x would be used instead.
> >>
> >> and twin pred applies to mv.x
> >>
> >
> > yeah, but it's not necessary for mv.x to work and you can't emulate mv.x
> > using twin predication (well, technically you could,
>
> see above.
> set src or dest pred equal to 1<<r3 and it is exactly equivalent.
>

only for that specific mask, I was taking about the fully general vector
case.

>
> very confusing though.
>
>
> >> two int regs as additional dependency hazards are not such a big deal
> >> (actuslly they are vector chain blockers we established that last
> >> month)
> >>
> >
> > they only block chaining if the mask comes from a vector compare or
> similar
> > instead of a scalar op,
>
> need to think this through.
>
>
> > no, what I meant is it would be a single compound op that is issued to a
> > FU, the FU reads r3,
>
> again: it is very important that you understand how the architecture works.
>
> Arithmetic FUs may *not* read or write regfiles.  ever.  (you may be
> referring to a Predicate FU)
>
> they are supplied with all the operands rhey need, and proceed onnnly
> when they have them all (one exception: AGEN in LDST).
>

It would work like agen in ld/st in that it would wait for the input
corresponding to a3, then wait for the inputs corresponding to the selected
element (computed by using the r3 value) ignoring all others, then do the
op, then write outputs. if it doesn't need a particular input, it can
totally finish and write the output before that input becomes ready.

>
> they produce results and wait until.the DMs tell them they are free
> and clear to write them.
>
>
>
> > then computes the reg numbers and reads the src regs,
>
> doesn't work that way.
>
> the reg numbers must be calculated externally by a special "predicate
> manager" which receives the scalar int, breaks it into bits and pulls
> ahadow cancel or proceed lines on the FUs which were allocated
> elements.
>
> (leave aside that SIMD units would need multiple predicate bits for now).
>
>
> > then does the underlying element op, then writes the dest regs.
>
> ok so you are referring to the arith FU which means you definitely are
> not aware of the Predicate FU for INT preds and its connection to
> shadows.
>

it would be an *augumented* arith FU -- those are also useful for
conditional move and int/fp select operations.

For those ops it is waay more efficient to calculate the input needed then
do the regfile read (if that element is not the output of another in-flight
instruction), rather than read all possible inputs and have 64 input
latches. Those augumented FUs could also be quite useful for vector mv.x,
since each FU is 1 element of a mv.x.

>
> they are completely separate entities.
>
> CR predicates on the other hand, one CR pred is wired direct to one FU
> shadow ( or even simpler just the write-reg result mask, no need for
> Pred FU, but this risks FU running empty if CR is zero)
>
>
> > I'd expect
> > r3 to change all the time, so stalling in decode won't work.
> >
> > doing it in the decode pipe is only a good idea for VL since VL rarely
> > changes
>
> mmm this is not quite true.

It's true for GPU code -- basically the entire shader will execute with
fixed VL.

  for small loops let us say 10 elements if
> MVL=8 then it gets set to 8 then 2 very quickly, esp. if the loop is
> only a few insns.
>
> most memcpys are less than 16 bytes.

for memcpy with compile-time constant size (vast majority, e.g. struct
copy), we can use setvli, which can be executed in-order in the decode
pipe, no pipe flush needed. This is part of the reason I advocated for
setvli to be non-complicated. If it's just a little smaller, it can compile
directly to a 64-bit load and a 64-bit store or similar code for other
sizes.

>
> > and is worth a pipe flush.
>
> which is the only reason why VL can be added to "state", alongside PC and
> MSR.

> yes hypothetically we can do speculative VL issue/execution by
> assuming VL will be a certain value just like with PC on branch
> prediction.
>
> i would prefer thst we not do this initially l, stalling instead at
> changes to VL.
>

yup, that's basically what I meant by pipe flush.

Jacob