[Libre-soc-dev] twin predication and svp64

Fri Dec 11 06:09:33 GMT 2020

On 12/11/20, Jacob Lifshay <programmerjake at gmail.com> wrote:

> what you seem to have meant is:
> src = ...; // src is a vector
> for i in 0..23 {
>
> }

yes.  no.  dest[i] = src[i]

now apply predication to the src, where one bit only is set in that
src predicate.

pred is:

     if srcpred & (1<<i)
            dest[i] = src[i]

and if srcpred *equals* 1<<n *then* it is *as if* the op was
macro-fused with mv.x n

>>
>> > 2. gather/scatter (register to register, not load/store): twin
>> predication
>> > isn't actually powerful enough for a lot of what scatter/gather is used
>> for
>> > (majority of scatter/gather?) -- e.g. twin predication can't do:
>> > dest = [src[3], src[7], src[2], src[5], src[1], src[0], src[4], src[6]];
>> >
>> > so, mv.x would be used instead.
>>
>> and twin pred applies to mv.x
>>
>
> yeah, but it's not necessary for mv.x to work and you can't emulate mv.x
> using twin predication (well, technically you could,

see above.
set src or dest pred equal to 1<<r3 and it is exactly equivalent.

very confusing though.

>> two int regs as additional dependency hazards are not such a big deal
>> (actuslly they are vector chain blockers we established that last
>> month)
>>
>
> they only block chaining if the mask comes from a vector compare or similar
> instead of a scalar op,

need to think this through.

> no, what I meant is it would be a single compound op that is issued to a
> FU, the FU reads r3,

again: it is very important that you understand how the architecture works.

Arithmetic FUs may *not* read or write regfiles.  ever.  (you may be
referring to a Predicate FU)

they are supplied with all the operands rhey need, and proceed onnnly
when they have them all (one exception: AGEN in LDST).

they produce results and wait until.the DMs tell them they are free
and clear to write them.

> then computes the reg numbers and reads the src regs,

doesn't work that way.

the reg numbers must be calculated externally by a special "predicate
manager" which receives the scalar int, breaks it into bits and pulls
ahadow cancel or proceed lines on the FUs which were allocated
elements.

(leave aside that SIMD units would need multiple predicate bits for now).

> then does the underlying element op, then writes the dest regs.

ok so you are referring to the arith FU which means you definitely are
not aware of the Predicate FU for INT preds and its connection to
shadows.

they are completely separate entities.

CR predicates on the other hand, one CR pred is wired direct to one FU
shadow ( or even simpler just the write-reg result mask, no need for
Pred FU, but this risks FU running empty if CR is zero)

> I'd expect
> r3 to change all the time, so stalling in decode won't work.
>
> doing it in the decode pipe is only a good idea for VL since VL rarely
> changes

mmm this is not quite true.  for small loops let us say 10 elements if
MVL=8 then it gets set to 8 then 2 very quickly, esp. if the loop is
only a few insns.

most memcpys are less than 16 bytes.

> and is worth a pipe flush.

which is the only reason why VL can be added to "state", alongside PC and MSR.

yes hypothetically we can do speculative VL issue/execution by
assuming VL will be a certain value just like with PC on branch
prediction.

i would prefer thst we not do this initially l, stalling instead at
changes to VL.

l.