[Libre-soc-bugs] [Bug 558] gcc SV intrinsics concept

Thu Jan 14 19:25:52 GMT 2021

https://bugs.libre-soc.org/show_bug.cgi?id=558

--- Comment #60 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
preamble 1 this is so long that i am going to need to do a followup summary. 
this alone may take me half an hour.  patience appreciated.

preamble 2

found this:
https://gcc.gnu.org/onlinedocs/gccint/Condition-Code.html

which is not hugely informative but at least gives hints

and this:
https://gcc.gnu.org/onlinedocs/gccint/Machine-Independent-Predicates.html#Machine-Independent-Predicates

which shows that lt/gt/le... etc (BO Branch style CR tests) are representable
in CRs.

preamble 3: the following is very relevant, and important to note that where
RVV gcc work leads, we will have an easier (less expensive) time following.

https://www.embecosm.com/2018/09/09/supporting-the-risc-v-vector-extension-in-gcc-and-llvm/
https://gcc.gnu.org/legacy-ml/gcc/2018-09/msg00037.html

BUT... it may be the case that gcc is simply "left behind" with the primary
focus being on LLVM.  instead, we may have to be the ones that take the
initiative.

(In reply to Alexandre Oliva from comment #58)
> I haven't been able to keep up with this in detail (sorry, my attention has
> been temporarily diverted), but I'm a little worried about how to represent
> a "shuffled" CR register file map, if I get the right idea of what's being
> proposed.

[appreciated your concerns.  for context here: we need your input to determine,
rather quickly, if the CR remap is viable.  i am USD 8,000 in debt on credit
cards and NLnet does not donate for time, only for results.  i apologise that
this puts us under pressure, i.e. we need to make fast pragmatic decisions,
even if they turn out mid-to-long-term to have been wrong.]

providing some perspective: it is and it isn't shuffled.  one way to imagine it
is:

* thing indexed 0 has 2 names: CR0 and CR0.0
* thing indexed 1 has 1 name: CR0.1
* ...
* .....        15     1     : CR0.15
* thing indexed 16 has 2 names: CR1 and CR1.0
*               17     1        CR1.1
* ...
* .......      127     1        CR7.15

thus it is still sequential and linear, it's just that the scalar v3.0B
registers have been placed into positions that are now every 16th in the
sequence.

if on the other hand you think of it as being grouped linearly by CRn.0 as
being sequentially numbered, followed by all CRn.1 etc CR0.0 CR1.0 ... CR7.0 as
indexed 0-7 then CR0.1 CR1.1 ... as indexed 8-15 and so on *then* it is
discontiguous and yes it would cause misunderstandings and a lot of trouble
(and, you point out below: unworkable for Vectors)

it was 9 days before i was able to grasp this difference fully and that alone
is of serious concern.

if the conceptual numbering of scalar CRs can indeed be simply shifted up 4
bits, given a different name format (CRn.MM) that binutils recognises and sorts
out, would that be viable?

> The key concepts that GCC deals with for purposes of register allocation are
> requirements of instructions (constraints, as in extended asms) and modes
> (closely related with types).
> 
> CRs and flags in general are dealt with without caring about their internal
> representation.  
> [ ... ]

interesting. and that's why they don't end up in the frontend because of the
abstraction.  you'd need to bring out *CCmodes* into the frontend in order to
apply __attribute__ to them, and that's clearly not going to be happening.

> It's all modelled abstractly, as if the condition code register held the
> result of the compare rather than whatever bits the underlying hardware
> uses,

sounds sensible to me.

> So you won't see anything in GCC that cares that CRs are 4-bits wide and use
> one bit for EQ, one for LT, one for GT, and one for UN, in whatever order
> that is.  This solves some potential problems for us, because endianness of
> those bits is not an issue.  

also elwidth overrides are meaningless so don't enter into the conversation at
all.

> There's nothing in the IR that enables reinterpretation of CR bits as an
> integral quantity, or vice-versa.

*click*... this may cause problems for what i called "crweird" instructions.

ah wait... there is precedent: isel and setb.  these interact to select/set INT
regs based on a CCode (CR).

to give some context: the crweird instructions are a way to transfer
CRs-as-predicates into scalar INTs and vice-versa.  we need these so as not to
have to add duplicate instructions (same functionality, one on CRs one on INTs)

    mtcrweird: RA, BB, mask.mode

    reg = (RA|0)
    lsb = reg[63] # MSB0 numbering sigh
    n0 = mask[0] & (mode[0] == lsb)
    n1 = mask[1] & (mode[1] == lsb)
    n2 = mask[2] & (mode[2] == lsb)
    n3 = mask[3] & (mode[3] == lsb)
    CR{BB}.eq = n0
    CR{BB}.lt = n1
    CR{BB}.gt = n2
    CR{BB}.ov = n3

you can see if that is Vectorised it is intended to put arithmetic bit 1 of RA
into CR{BB+1} etc etc etc.

hence the name "weird".

now, if this *cannot be represented* in gcc we are in a bit of... um...
schtuck.

one potential conceptual route is to internally "typecast" the CCodes into
predication bits (including potentially the transformation process)

another would be for predication to just take a range of CR-vectors, say to the
register allocator, "MINE! hands off!" and the CCodes side never talks to the
predication side.

>  Indeed, CCmodes generally do not pass the
> TARGET_MODES_TIEABLE_P predicate with other modes, meaning you cannot
> reinterpret a CCmode "quantity" in a register as another mode, as you often
> can reinterpret a wide integral mode as a narrow one, and vice-versa, when
> the machine, the ABI and the compiler keep them extended under uniform
> conventions.

ah.  small diversion needed, come back to numbering in a minute.

we do really need the ability to consider CRs as predicate bits, otherwise if
we have to use only integers we lose a huge amount of capability, and the
hardware becomes either unmanageably complex or severely
performance-compromised.

now, whether that's done as CCModes are also predicates in gcc or not?

question: can "CCodes-with-compares" at least be "typecast" to a new kind of
CCode, a "predicate" CCode?  or to an underlying existing predicate type in
gcc?

i assumed that this would be possible, at least in some fashion, even if it
requires some hoops to jump through.

the concept behind CRs-as-predicates i copied how Branch BO field works,
because Branches test CCodes in exactly the same way to do if/else in scalar
that you use predicates to perform vector "variants" of if/else:

    if x > y: # cmp creates CR
       # branch on CR with BO created here
       x -= 5

vector version would be:

    VectorCMP x,y # creates vector of CRs
    svp.mask=CR,BO=gt addi x.v, -5

anyway.  back to numbering.

> Now, the problem with "shuffled" register ordering is that the controls GCC
> uses to tell how modes and registers related are TARGET_HARD_REGNO_MODE_OK,
> that tells whether a quantity in a given machine mode can be held in a given
> register, and TARGET_HARD_REGNO_NREGS, that tells how many *consecutive*
> registers are needed to hold that mode, starting at a given register.

right.  and MVL, which is *very specifically only allowed to be set statically
by an immediate*, defines that quantity.

it's changeable on a "per-setmvli" basis, but it *is* a static compile-time
quantity.

thus the compiler may decide, at the simplest crudest level, "screw it, MVL is
hardcoded to 4 or 8 or 16 and that's the end of it", which effectively turns SV
into a type of brain-dead but functional predication-capable SIMD ISA, or it
may be a bit more intelligent about it and decide on a per-function basis what
the best allocation is, to help avoid register spill.

> In order for wider-than-register modes to be held in a set of registers,
> those registers *have* to be contiguous in GCC's internal notion of the
> register file. 

this is why i described the conceptual numbering for CRs as "being viewable as
contiguous if you don't mind interspersing the scalar CRs every 16th index"

however, Alexandre, just a heads-up: REMAP *COMPLETELY* obliterates the
expectation of linear numbering, by design.

this is something that is supported in NEON as hard-coded in LDST, called
"Structure Packing", and it is also now in RVV 0.9.  typical uses include
Matrix Multiplies and for getting all the RRRRR and GGGGG and BBBBB into
contiguous registers where data was actually in RGBRGBRGB.

just so you know: REMAP can be applied to the *entire* ISA.  any arithmetic
vector op, any MV, any LDST.

> It is sometimes the case that the contiguity is not relevant
> for the architecture, e.g., if there isn't any opcode that operates on pairs
> of registers holding a double-word value, but these often appear when a pair
> of consecutive registers holds a double-precision floating-point value, or a
> widening multiply necessarily sets a pair of neighbor registers.

yes, there are a number of instruction examples in many ISAs that support this
double-op SIMD and widen/narrow, it is no surprise then that gcc has had to
understand this.

it will become particularly interesting, a long LONG way down the line, how
SV's polymorphic elwidth overrides end up being implemented, ultimately.

intermediary steps there will clearly have to involve avoiding different
src-dest overrides on arithmetic operations initially, and using (inefficient)
patterns of MVs that "mirror" the same widening-narrowing explicit opcodes
typically added to SIMD architectures.

that will give breathing space to allow a full research investigation into how
to add polymorphic elwidth overrides to arithmetic ops.

i mean in a generic fashion, rather than as special-cased for certain specific
instructions.

this btw is going to happen rather a lot: the "abstraction" of SV means that
the compromises taken by most ISAs (only certain ops have saturation, only
certain ops have widen/narrow) *do not have to be taken*

>  When this
> happens, the order of registers in the abstract register file in the
> compiler has to match the order and the grouping required by the machine,
> otherwise the allocation won't get things right.

understood.  this instinctively is why i really do not like the vertical
stratification.  scalar registers are no longer accessible in a contiguous
block.

also this is one of the things that is making me slightly nervous about the
CRn.MM numbering: it doesn't match precisely one-for-one with the INT/FP
arrangement.

as in: yes you can rearrange the naming so that it *looks* contiguous, but try
accessing them as scalar and it all goes sideways.  literally.

when treated as Vectors-of-results that generate corresponding Vectors-of-CRs
the numbering matches.  the names are weird but the numbering matches.

however the moment you try to access those values as *scalar* despite the fact
that they were just produced by an instruction just before, all hell breaks
loose.

not only can you not *get* access to CR3.15 directly for example (you have to
insert a predicated MV operation to copy it to CR3 or CR3.8 for example) you
have to run a calculation to work out the FP/INT reg it's associated with. 
something like:

     (idx&0b1111)<<3 | (idx&0b1110000)>>4

that's the relationship between CR numbering and INT/FP numbering.

(no Jacob, just to emphasise again: making all INT/FP/CR numbering the same by
applying the same N.MM remapping isn't ok, unfortunately, because the entire
hardware of 18 months needs to be abandoned and rethought)  

> When it comes to vectors of gprs and fprs, we didn't have the problem I'm
> concerned about: the vector modes can just require N contiguous registers,
> and since they appear as neighbors in the abstract register file, that works
> just fine. 

it's "obvious" in other words.  and, in addition, once a Vector of INT/FO
results is computed, if access to those is required explicitly by scalar then
as long as the Vector was kept to the lower half of the regfile they are also
accessible directly *and accessible linearly as well*.

* Vector add may start at r0, r4, r8, ...
   r120, r124 and progresses linearly.
   vector at r0 progresses r0 r1 r2 r3 r4

* Scalar access may be at any of r0-r63
  so it is only the upper range of Vectors
  that cannot be accessed.
  (without a Vector mv, that is)

CRs on the other hand:

* Vector CRs may start at CR0.0 CR0.8
  CR1.0 CR1.8 ... CR7.0 CR7.8
  vectors progress CR0.0 CR0.1 CR0.2

* Scalar access MAY NOT even refer AT ALL
  to CR1.1, CR1.2 throughout the FULL
  RANGE of the regfile, ALL the way to
  CR7.2 and CR7.15.

this discontiguity is why the "slightly weird" algorithm of treating scalar
numbering differently from Vector was added.

> Unlike other wide types, the WORDS_BIG_ENDIAN predicate doesn't
> affect the expected significance of partial values split across multiple
> registers in vector types, so we're fine in this regard.

ok.

> However, if there are opcodes that require different groupings or orderings
> of CRs, there will be a representation problem.  E.g., if we need CR12 to be
> right next to CR4 because of some opcode that takes a pair of CRs by naming
> CR4 and affecting CR4 and CR12 as a V2CC quantity, they'd have to be
> neighbors for this V2CC allocation to be possible. 

right.  well, the only "fly in the ointment" is mfcr and the fxm version when
used with multiple bits (which i think i'm right in saying you're not supposed
to do but all hardware supports it).

the example involving CR4 and CR12 is actually realistic when CR12 is "renamed"
correctly to CR4.1 (4+8=12)

accessing CR4 and CR4.1 *can* be done under a Vector op. they are contiguous.

they can NOT repeat NOT be accessed sequentially via a SCALAR op.  CR4.1 is not
even accessible AT ALL.

CR4 and CR4.8 *would* be accessible contiguously via scalar, but not CR4.1

> But if in other
> circumstances we use say a V8CC quantity starting at CR0 to refer to
> CR0..7's 32 bits, then those 8 CRs would have to be consecutive in the
> register file, without room for CR12 after CR4.

the idea is - was - i have already convinced myself it's a bad idea - that a
V8CC would be "CR0.0 CR0.1 ... CR0.7"

> So please be careful with creative register ordering, to avoid creating
> configurations that may end up impossible to represent without major surgery
> in the compiler.

by going through it, above, i've basically convinced myself it's not just a bad
idea to do vertical sequencing, it's a *really* bad idea.

> Also, keep in mind that, even if some configurations might be possible to
> represent with the knobs I mentioned above, the rs6000/powerpc port has a
> huge legacy of variants, so whatever we come up with sort of has to fit in
> with *all* that legacy.  E.g., IIRC 32-bit ppc variants have long used
> consecutive 32-bit FPRs for (float+float) double-precision-ish values, and
> consecutive 32-bit GPRs to hold 64-bit values.  There were ABI requirements
> to that effect, that required the abstract register file in the compiler,

wait.... whuuu???

oh god... this is the VSX/Altivec thing isn't it?  where the INT/FP regfiles
are combined then "recast" to a 32-entry 128 bit SIMD regfile, something like
that?

please please bear in mind we are doing *nothing* like that!  we did consider
it (a long while back), to basically merge the FP and INT regfiles on top of
each other.

to reiterate and emphasise: we are going *nowhere near* VSX, which i consider
to be a harmful legacy ISA, good as it was in 2001 it's time for it to be
retired, and if we do ever "support" it in 2 or more years time it will be
under serious protest and with absolute bare minimum attention, resources,
performance and impact on the existing HDL.

there is a reason why NXP has abandoned OpenPOWER, and that reason is: VSX.

in technical terms anything that you "learn" from VSX, anything involving
regfile typecasting such as the one for rs6000, these truly need to be set
aside.

unlike rs6000:

* the SV INT regfile is polymorphic on
  *elwidth* not the type (except int and
  logical, exactly as in v3.0B, long
  before SV existed)

* likewise the FP regfile is polymorphic
  on width, but may NOT be typecast to INT
  (or logical ops)

you CANNOT put a raw integer into an FP or vice-versa then have either fed to
either FP or INT pipeline as if it was in the other.

in SV, exactly as with v3.0B:

* GPR is GPR, FPR is FPR.
* GPR operations are only possible on the GPR regfile
* FP operations are only possible on the FP regfile

rs6000, as i understand it, if you perform an INT VSX operation on VSX
registers numbered in a certain range the result is stored in the *SCALAR* FP
regfile, correct?

i ask because we are NOT repeat NOT doing that in SV.  considered it.  rejected
it.

> and also that in debug information, to use the register ordering implied by
> the architecture.  If we were to require the introduction of intervening
> registers, for purposes of vectorization, between registers that such old
> arches need as neighbors, insurmountable conflicts will arise.

i am not quite following, inasmuch as thay due to SIMD being harmful and the
sheer overwhelming quantity of opcodes involved there is no intention in my
mind to support any of VSX and if we do it will be as seriously (and very
deliberately) performance compromised, so badly that software developers work
very hard to avoid using VSX entirely.

with that in mind i am slightly confused.  are you saying:

* "if there is any intention to support VSX in *addition* to SV it will be
difficult to do so"

if so that is not in the slighest bit a problem because if there is even the
tiniest chance that SV is compromised by VSX, then, metaphorically and
clinically, VSX gets shot in the head: problem goes away.

* "even if you DON'T intend to "support" VSX, the way that gcc support for
legacy hardware such as rs6000 is written, SV will *still* be challenging,
despite SV not even being close, at all, to how VSX works."

the former is not a problem at all (VSX is a harmful SIMD ISA, it is a real
easy choice to say "goodbye VSX")

the latter would, *deep breath*, require some further investigation.

i hope the former.

-- 
You are receiving this mail because:
You are on the CC list for the bug.