[Libre-soc-dev] twin predication and svp64

Luke Kenneth Casson Leighton lkcl at lkcl.net
Fri Dec 11 04:59:50 GMT 2020

On 12/11/20, Jacob Lifshay <programmerjake at gmail.com> wrote:
> On Thu, Dec 10, 2020, 19:12 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
> wrote:
>> jacob it took me a while to spot the predication table added to svp64,
>> i made some more notes and some questions.
> Ok. I probably should have mentioned I added it.

got there in the end.

> Please don't insert a text block in the middle of a table, instead you
> could add * or similar to items and add a footnote at the end, since that
> doesn't block flow.

tired, late, signalled "you remove it after review".

>> twin predication is critically important, it is how we cover vgather
>> scatter reduce splat insert, masses more.
> actually twin predication may be used quite a bit less than you might think:
> 1. splat: covered by a vector dest and scalar src

except single bit in scalar twin pred src is equivalent to macro-op
mv.x merged in.

> 2. gather/scatter (register to register, not load/store): twin predication
> isn't actually powerful enough for a lot of what scatter/gather is used for
> (majority of scatter/gather?) -- e.g. twin predication can't do:
> dest = [src[3], src[7], src[2], src[5], src[1], src[0], src[4], src[6]];
> so, mv.x would be used instead.

and twin pred applies to mv.x

> 3. vector compaction, expansion (take all elements with mask bit set to 1
> and move to a compact list, as well as the inverse op): twin predication is
> good at this, since twin predication is exactly equivalent to a compaction
> followed by an expansion.
> In fact, twin predication with either the src or dest set to ALWAYS is a
> simple way to encode expansion or compaction respectively, assuming we just
> stop when either the src or dest index reaches VL

mandatory to do so.

> and don't error if they
> have differing numbers of set bits.

you're starting to get it.

> My idea for how twin predication would work is that one or the other mask
> could come from an integer reg, so I wouldn't worry as much about needing 2
> cr-based predicates, though that wouldn't be a reserved encoding.

two int regs as additional dependency hazards are not such a big deal
(actuslly they are vector chain blockers we established that last

> I'd imagine the masks would often be computed using a `li` or `andi` right
> before the twin-predicated instruction, so that would work out well.
>> oh.  i have an idea for the reserved encoding in predication: 1<<r3.
>> single bit.
> sounds like a really good idea!

no idea why it took so long to think up.

> This would allow us to optimize it to a
> single element op at the decode stage, single-cycle adding directly to the
> register numbers after adjusting for elwidth and turning into a nop if r3
>>= VL, preventing issuing many useless elements only to be masked out.

unfortunately this assumes that reafing the regfile is possible and
acceptable at the decode phase, which it most definitely is not.

doing so requires the entire issue to grind to a total stop, wait for
all hazards on r3 to clear, read it, *then* continue.

the best that can be done (and it is a pretty awful, complex and
fraught solution) is to treat the predicate as a cached piece of
"state" (like VL or MSR) and to only block (grind to a total halt) on
a write hazard.

if you rember rhe mv.x discussion  we had we decided to make mv.x
relative so that VL could be used to eliminate the majority of

similar logic applies here because the 1<<r3 is effectively equivalent
to making rhe loop counter 0..VL-1 exactly equal to r3.

> wish I had thought of that :)



More information about the Libre-soc-dev mailing list