[Libre-soc-dev] twin predication and svp64
programmerjake at gmail.com
Fri Dec 11 03:57:58 GMT 2020
On Thu, Dec 10, 2020, 19:12 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
> jacob it took me a while to spot the predication table added to svp64,
> i made some more notes and some questions.
Ok. I probably should have mentioned I added it.
Please don't insert a text block in the middle of a table, instead you
could add * or similar to items and add a footnote at the end, since that
doesn't block flow.
> twin predication is critically important, it is how we cover vgather
> scatter reduce splat insert, masses more.
actually twin predication may be used quite a bit less than you might think:
1. splat: covered by a vector dest and scalar src
2. gather/scatter (register to register, not load/store): twin predication
isn't actually powerful enough for a lot of what scatter/gather is used for
(majority of scatter/gather?) -- e.g. twin predication can't do:
dest = [src, src, src, src, src, src, src, src];
so, mv.x would be used instead.
3. vector compaction, expansion (take all elements with mask bit set to 1
and move to a compact list, as well as the inverse op): twin predication is
good at this, since twin predication is exactly equivalent to a compaction
followed by an expansion.
In fact, twin predication with either the src or dest set to ALWAYS is a
simple way to encode expansion or compaction respectively, assuming we just
stop when either the src or dest index reaches VL and don't error if they
have differing numbers of set bits.
My idea for how twin predication would work is that one or the other mask
could come from an integer reg, so I wouldn't worry as much about needing 2
cr-based predicates, though that wouldn't be a reserved encoding.
I'd imagine the masks would often be computed using a `li` or `andi` right
before the twin-predicated instruction, so that would work out well.
> oh. i have an idea for the reserved encoding in predication: 1<<r3.
> single bit.
sounds like a really good idea! This would allow us to optimize it to a
single element op at the decode stage, single-cycle adding directly to the
register numbers after adjusting for elwidth and turning into a nop if r3
>= VL, preventing issuing many useless elements only to be masked out.
wish I had thought of that :)
More information about the Libre-soc-dev