[Libre-soc-dev] twin predication and svp64

Fri Dec 11 20:18:55 GMT 2020

On 12/11/20, Jacob Lifshay <programmerjake at gmail.com> wrote:

>> * the src pred is set to "all 1s"
>> * the src is set to "scalar"
>> * the dest pred is set to "all 1s"
>> * the dest is set to "vector"
>>
>
> yes, but that happens for any scalar -> vector op, not just twin predicated
> ones, so I wouldn't call it a benefit of twin predication specifically.

the pseudocode for arithmetic ops is completely different: single
predication only.  one predicate applies to *both* srcs *and* dest.
the reason is that when you also throw elwidths into the mix the
routing gets really hairy.

yes twin predication in some cases covers cases that single predicated
arithmetic src1/2 also covers.

for example, addi r3, r4, 0 is an *arithmetic* version of "mr".  so is
ori r3, r4, 0

should we start stripping out potential combinations of
twin-predicated mv just because there are single-predicated pseudo-ops
that do exactly the same job?

i do not feel that this would a productive use of our time when we
have so much to do.

>> * the src pred is set to "all 1s"
>> * the src is set to "scalar"
>> * the dest pred is set to "1<<r3"
>> * the dest is set to "vector"
>>
>
> this happens for any scalar -> vector with a mask of 1<<r3, twin
> predication is not necessary for it to work.

i hear what you are saying: it is good to have identified that there
are redundant ways to achieve the same thing (like there are multiple
ways to do mr, such as addi r3, r4, 0 and ori as well). however i do
not believe that any action should be taken.

>>      ireg[rs] = ireg[rd+ireg[r3]]
>>
>
> no, you get:
> ireg[rd+ireg[r3]] = ireg[rs]

sigh.  logic dyslexia kicking in.  well spotted.

>>
> Ok, I think the issue is that when I was saying mv.x, I meant the
> vectorized version:
> for i in 0..VL {
>     let idx = reg[ra + i];
>     if idx >= VL {
>         trap();
>     }
>     reg[rd + i] = reg[rb + idx];
> }

ahh riiight.   yes.  this one.  i don't even know what to call it. i
do remember we last discussed it... a year ago?

yes absolutely, twin predication cannot, "on its own" cover this case
[ it can however augment it, in very interesting - read completely
loopy mind-corkscrewing - ways].

twin predication is like an ordered sequenced back-to-back VGATHER VSCATTER.

as such, because the predicate is a single read (in INT pred) ok ok 2x
INTs, it is less impact on the OoO execution engine and less
resources.

the vectorised mv.x where the index/offset is also vectorised, this is
*very* hard on the OoO engine, and the DMs.

the discussions we had, i recall we decided to make the offset
relative, as you outline in the pseudocode:

     let idx = reg[ra + i];
     reg[rd + i] = reg[rb + idx];

most mv.x would be:

    let idx = reg[ra + i];
    reg[rd + i] = reg[idx];

this latter you have to reserve the entire regfile.  i.e. stall the
entire frickin execution.

with the "relative" version you can at least get away with:

   * read hazards on all regs from RA to
      RA+VL-1
   * write hazards RT to RT+VL-1

and start dropping the read and write hazards when the predicate(s)
have been read (by cancelling zero-bit-predicate ops by pulling their
Shadow "fail" flag).

> also, by setting ra[0..VL] to [5, 5, 3, 3, 4, 4, 4]
> you can get in 1 vector mv.x instruction:
> dest = [src[5], src[5], src[3], src[3], src[4], src[4], src[4]];
> which isn't possible if the mv.x adds idx to rd instead of rb.

ok.  so turning 1<<r3 into a mv.x cannot be achieved in some cases.  i
can live with that for a first implementation.

optimise later.

>>     ireg[rs+ireg[r3]] = ireg[rd]
>>
>
> no, you get:
> ireg[rd] = ireg[rs+ireg[r3]]

sigh.  thank you for spotting this.

> which is scalar mv.x except it doesn't trap if r3 is out of range and just
> doesn't write rd

excessive ranges should have already been checked.  yes that involves
a pre-analysis of the predicate bits, or it is simply the case that
the exception is thrown back at the issue phase:

    if reg# + VL >= Len(regfile) trap()

this ensures that even when predicate is 1<<r3 there is no possibility
for trying to access beyond end of regfile.

question: if r3 is greater than VL, should a trap be thrown?

>>
>> which is a different *type* of mv.x operation, but it is still a mv.x
>> operation.
>>
>> it gets exceptionally weird if we apply twin-predication *to* mv.x.  i'm
>> not going to go there quite just yet :)
>>
>>
>>
>> > only for that specific mask, I was taking about the fully general vector
>> > case.
>> >
>>
>> you've lost me,
>
>
> what I meant was the pseudo-code I wrote earlier which is the vector mv.x.

yehyeh got the contex now.

> You can't replace the fully general vector mv.x with a single
> twin-predicated vector mv no matter how hard you try,

i don't (never have).

tpred however fits into the Dep Matrices nicely with less blocking
resources, where vectorised mv.x is veeery heavy.

> Replacing a scalar mv.x with twin predicated vector mv is possible, but
> seems less efficient unless we have the special hw support for reading r3
> then the selected input I mentioned earlier.

... jacob i really really do not wish to get into discussions of
alternative designs of the level of complexity and associated time
that is involved.

we are at least 8 months behind schedule and simply do not have the
funding available to cover it.

1<<r3 is a dead simple idea that the Predication Unit can do with a
Binary to Unary Encoder, taking r3, turning it to an unary mask, and
chucking it at the Shadow Cancel/Success wires that lead to the FUs
under Shadow Conditions.

in the case where r3 is a straight INT predicate, the *binary* bits
(as-is) are chucked straight at the same Shadow wires.  where
inversion is applied, r3 is simply bitinverted before being chucked at
Shadow Wires.

this is brain dead simple.   suboptimal in the case of 1<<r3, but
trivial to add.

*later* we can add macro-op detection and optimisation.

but not right now.  we simply don't have time.

> and because you're not familiar with SV and
>> twin-predication, can you come back to this once it's clear?
>>
>
> AFAIK I am familiar with SV and twin predication...

not at the hardware level.  Predication Units need to be separated
from the Arithmetic Units that they cover, with Shadows.

this is the only sane way to do it.

the INT Predication Units have the predicate src reg(s) as Read Hazard
Dependencies.  they are also given absolute top priority on reg read
to INT regfile.

Shadows are raised across all Arithmetic FUs covered by the predicate.

THE FUs DO NOT RECEIVE THE INT PREDICATE DIRECTLY.

that would be insanely expensive because every FU in the vector would
need the entire int reg.

the *Predicate FU* receives the one and only one read copy of the INT predicate.

it then *distributes* those bits out to Shadow cancel/success.

this is very simple.

please understand however that i am barely managing to keep a map of
this in my head.

i DO NOT wish to go over alternative designs at this incredibly late
stage when we should have been implementing this one over eight months
ago.

it was hard enough adding CRs and thank god that CR predication is a
simplification not a complication.

i really, *really* want to get focussed on getting the current designs
in my head out onto HDL so that we at least have one design.  later it
can be analysed.

nuts.  i managed to accidentally delete something on this phone and
can't undo edit.  will reread.

l.