[Libre-soc-isa] [Bug 527] SV Predication needs to be defined

Sun Nov 15 22:42:11 GMT 2020

https://bugs.libre-soc.org/show_bug.cgi?id=527

--- Comment #1 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
continuing from https://bugs.libre-soc.org/show_bug.cgi?id=238#c12

PowerISA decision-making is fundamentally designed around Condition registers
(with outliers such as the ternary FP select instruction).

cmp, all Rc=1 instructions: all of them create CRs which give zero, nonzero,
GE, LT, GT, LE at the time that the result is created.  CR operations allow
those to be combined to produce more compkex comparisions. Branch and isel can
then be used to make decisions.

the first observation: to try to work "outside" of this 25 year established
framework seems... unwise.

the simplest logical way forward would be to simply extend the pre-existing use
of CR - with little to no modification - into the parallel (vector) domain,
esp. given the use of a multi-issue OOO Engine.

* vectorisation of crand (etc) provides a means to not have to flip to INT
regs, saving on instruction count.
* NOT vectorising CR operations at all would actually require special-casing in
the decoder!
* vectorisation of cmp produces a vector of CRs: conceptually this is a
no-brainer and fits perfectly with the SV paradigm.
* vectorisation of isel likewise fits perfectly.  a vector of CRs chooses
elements of a vector of INTs.

by contrast getting CR results out of a vector of 64 CRs and into an integer y4
bitpredicate is actually quite complex: the data routing has to involve up to
64x4 bits of data, selecting every 4th bit and putting them into one int:

* VL=4 cmp ra, rb, cr8 produces 4 CRs
* 4 sets of CRs contain 4 sets of eq/+ve/-ve
* to turn that into a scalar int predicate requires selecting every 4th bit
  (but which bit? bit 0 of every CR? bit 3?)
* performed multiple times if vectorised LE/GT/GE/LT is needed, which because
Rc=1 has been destroyed and replaced with a meagre 1-bit CR per element
effectively means that the arithmetic op has to be repeated just to get a 2nd
bit

modifying instructions to abandon 25 years of Rc=1 and try to force-fit one bit
per element into a scalar integer is a nonstarter (which of the 4 bits? bear in
mind that 2 bits are used to discern "LE" from "LT")

everything about trying to abandon CRs and shove things into a scalar int reg
(then specialcase that int as a new regtype) adds extra work, has problems,
unanswered questions and introduces data bottlenecks that get worse and worse
the more they are investigated.

by total contrast the use of crand etc which are *specifically* designed for
the purpose of manipulating CRs at the bitlevel are dead easy to vectorise and
dead easy to use to produce "combined" (complex) vector decisions. 

by using those same CRs as predicates where one bit of each CR is applied (bit
0) per element operation, crand etc. can be used as-is: zero modifications or
additions required other than SV vector for-looping.

CRs are so obviously easy to fit onto predication (in the easy case) it's
actually difficult to justify any other option.

this just leaves leveraging of operations suxh as cntlz and other bitmanip
operations, which is where things come unstuck, no matter what scheme is used.

* standard Cray-style predication requires new opcodes
* RVV style predication (LSB of standard vectors) is very wasteful and also
requires new opcodes
* no opcodes exist in PowerISA to select individual CR bits to jam into int
scalar regs (or back): mfocr is mask-based (FXM)

even if opcodes are created which transfer every 4th bit of the main CR reg
(its full vector version) into a scalar INT (and back) the use of popcount
(etc) bitmanip scalar opcodes prevents Vector Chaining
(although to be honest this is probably true in other Vector ISAs as well)

two hypothetical new instructions wouls do this job:

   (VL=5) mfcr4 r1, CR8, bit0
   =>
      r1[0] <= CR8[0]
      r1[1] <= CR9[0]
      r1[2] <= CR10[0]
      r1[3] <= CR11[0]
      r1[4] <= CR12[0]

i.e. up to 64 CR bit0s get copied into the one scalar intreg (and mtcr4 copies
them back).

implementation-wise the full 32x CR regfile read/write ports would be used
(just like they are for mfcr/mtcr).  it would be a variant of mfocr / mtocr
using an alternative meaning for the FXM field.

let us imagine that there is a scenario where complex multi-CR data selection
is required.  first element must be selected from a vector within a certain
range:

* vector compare greater than 50 produces CR vector 1
* vector compare less than 100 produces CR vector 2
* vector crand (and a few other cr ops) on CRV1 and CRV2 produces CRV3

then to find which is the first of those:

* CRV3 is transferred into a scalar int using mfcr4
* scalar INT "findfirst" locates the 1st bit set to 1 in that intreg
* mtcr4 transfers the scalar int back to CRs, spreading the bits out to be
applied as predicate bits (one CR, one element).

not that i actually like the number of instructions needed here, to the extent
where i am seriously considering proposing mask operations based on CRs.  which
i like even less due to their serious specialisation and that they would still
be operating on every 4th bit.

bitmanip "bitselect" operations (shuffle/bperm) hypothetically could help here
although even efficient implementations are known to be as expensive as
shift/rot.

bottom line is, there really are no good options here.

-- 
You are receiving this mail because:
You are on the CC list for the bug.