[Libre-soc-bugs] [Bug 558] gcc SV intrinsics concept

Thu Jan 14 19:47:09 GMT 2021

https://bugs.libre-soc.org/show_bug.cgi?id=558

--- Comment #61 from Jacob Lifshay <programmerjake at gmail.com> ---
(In reply to Luke Kenneth Casson Leighton from comment #60)
> preamble 1 this is so long that i am going to need to do a followup summary.
> this alone may take me half an hour.  patience appreciated.
> 
> preamble 2
> 
> found this:
> https://gcc.gnu.org/onlinedocs/gccint/Condition-Code.html
> 
> which is not hugely informative but at least gives hints
> 
> and this:
> https://gcc.gnu.org/onlinedocs/gccint/Machine-Independent-Predicates.
> html#Machine-Independent-Predicates
> 
> which shows that lt/gt/le... etc (BO Branch style CR tests) are
> representable in CRs.
> 
> preamble 3: the following is very relevant, and important to note that where
> RVV gcc work leads, we will have an easier (less expensive) time following.
> 
> https://www.embecosm.com/2018/09/09/supporting-the-risc-v-vector-extension-
> in-gcc-and-llvm/
> https://gcc.gnu.org/legacy-ml/gcc/2018-09/msg00037.html
> 
> BUT... it may be the case that gcc is simply "left behind" with the primary
> focus being on LLVM.  instead, we may have to be the ones that take the
> initiative.
> 
> 
> 
> (In reply to Alexandre Oliva from comment #58)
> > I haven't been able to keep up with this in detail (sorry, my attention has
> > been temporarily diverted), but I'm a little worried about how to represent
> > a "shuffled" CR register file map, if I get the right idea of what's being
> > proposed.
> 
> [appreciated your concerns.  for context here: we need your input to
> determine, rather quickly, if the CR remap is viable.  i am USD 8,000 in
> debt on credit cards and NLnet does not donate for time, only for results. 
> i apologise that this puts us under pressure, i.e. we need to make fast
> pragmatic decisions, even if they turn out mid-to-long-term to have been
> wrong.]
> 
> providing some perspective: it is and it isn't shuffled.  one way to imagine
> it is:
> 
> * thing indexed 0 has 2 names: CR0 and CR0.0
> * thing indexed 1 has 1 name: CR0.1
> * ...
> * .....        15     1     : CR0.15
> * thing indexed 16 has 2 names: CR1 and CR1.0
> *               17     1        CR1.1
> * ...
> * .......      127     1        CR7.15
> 
> thus it is still sequential and linear, it's just that the scalar v3.0B
> registers have been placed into positions that are now every 16th in the
> sequence.
> 
> if on the other hand you think of it as being grouped linearly by CRn.0 as
> being sequentially numbered, followed by all CRn.1 etc CR0.0 CR1.0 ... CR7.0
> as indexed 0-7 then CR0.1 CR1.1 ... as indexed 8-15 and so on *then* it is
> discontiguous and yes it would cause misunderstandings and a lot of trouble
> (and, you point out below: unworkable for Vectors)
> 
> it was 9 days before i was able to grasp this difference fully and that
> alone is of serious concern.
> 
> if the conceptual numbering of scalar CRs can indeed be simply shifted up 4
> bits, given a different name format (CRn.MM) that binutils recognises and
> sorts out, would that be viable?
> 
> 
> > The key concepts that GCC deals with for purposes of register allocation are
> > requirements of instructions (constraints, as in extended asms) and modes
> > (closely related with types).
> > 
> > CRs and flags in general are dealt with without caring about their internal
> > representation.  
> > [ ... ]
> 
> interesting. and that's why they don't end up in the frontend because of the
> abstraction.  you'd need to bring out *CCmodes* into the frontend in order
> to apply __attribute__ to them, and that's clearly not going to be happening.
> 
> > It's all modelled abstractly, as if the condition code register held the
> > result of the compare rather than whatever bits the underlying hardware
> > uses,
> 
> sounds sensible to me.
> 
> 
> > So you won't see anything in GCC that cares that CRs are 4-bits wide and use
> > one bit for EQ, one for LT, one for GT, and one for UN, in whatever order
> > that is.  This solves some potential problems for us, because endianness of
> > those bits is not an issue.  
> 
> also elwidth overrides are meaningless so don't enter into the conversation
> at all.
>  
> > There's nothing in the IR that enables reinterpretation of CR bits as an
> > integral quantity, or vice-versa.
> 
> *click*... this may cause problems for what i called "crweird" instructions.
> 
> ah wait... there is precedent: isel and setb.  these interact to select/set
> INT regs based on a CCode (CR).
> 
> to give some context: the crweird instructions are a way to transfer
> CRs-as-predicates into scalar INTs and vice-versa.  we need these so as not
> to have to add duplicate instructions (same functionality, one on CRs one on
> INTs)
> 
>     mtcrweird: RA, BB, mask.mode
> 
>     reg = (RA|0)
>     lsb = reg[63] # MSB0 numbering sigh
>     n0 = mask[0] & (mode[0] == lsb)
>     n1 = mask[1] & (mode[1] == lsb)
>     n2 = mask[2] & (mode[2] == lsb)
>     n3 = mask[3] & (mode[3] == lsb)
>     CR{BB}.eq = n0
>     CR{BB}.lt = n1
>     CR{BB}.gt = n2
>     CR{BB}.ov = n3
> 
> you can see if that is Vectorised it is intended to put arithmetic bit 1 of
> RA into CR{BB+1} etc etc etc.
> 
> hence the name "weird".
> 
> now, if this *cannot be represented* in gcc we are in a bit of... um...
> schtuck.
> 
> one potential conceptual route is to internally "typecast" the CCodes into
> predication bits (including potentially the transformation process)
> 
> another would be for predication to just take a range of CR-vectors, say to
> the register allocator, "MINE! hands off!" and the CCodes side never talks
> to the predication side.
> 
> 
> 
> >  Indeed, CCmodes generally do not pass the
> > TARGET_MODES_TIEABLE_P predicate with other modes, meaning you cannot
> > reinterpret a CCmode "quantity" in a register as another mode, as you often
> > can reinterpret a wide integral mode as a narrow one, and vice-versa, when
> > the machine, the ABI and the compiler keep them extended under uniform
> > conventions.
> 
> ah.  small diversion needed, come back to numbering in a minute.
> 
> we do really need the ability to consider CRs as predicate bits, otherwise
> if we have to use only integers we lose a huge amount of capability, and the
> hardware becomes either unmanageably complex or severely
> performance-compromised.
> 
> now, whether that's done as CCModes are also predicates in gcc or not?
> 
> question: can "CCodes-with-compares" at least be "typecast" to a new kind of
> CCode, a "predicate" CCode?  or to an underlying existing predicate type in
> gcc?
> 
> i assumed that this would be possible, at least in some fashion, even if it
> requires some hoops to jump through.
> 
> the concept behind CRs-as-predicates i copied how Branch BO field works,
> because Branches test CCodes in exactly the same way to do if/else in scalar
> that you use predicates to perform vector "variants" of if/else:
> 
>     if x > y: # cmp creates CR
>        # branch on CR with BO created here
>        x -= 5
> 
> vector version would be:
> 
>     VectorCMP x,y # creates vector of CRs
>     svp.mask=CR,BO=gt addi x.v, -5
> 
> anyway.  back to numbering.
> 
> 
> > Now, the problem with "shuffled" register ordering is that the controls GCC
> > uses to tell how modes and registers related are TARGET_HARD_REGNO_MODE_OK,
> > that tells whether a quantity in a given machine mode can be held in a given
> > register, and TARGET_HARD_REGNO_NREGS, that tells how many *consecutive*
> > registers are needed to hold that mode, starting at a given register.
> 
> right.  and MVL, which is *very specifically only allowed to be set
> statically by an immediate*, defines that quantity.
> 
> it's changeable on a "per-setmvli" basis, but it *is* a static compile-time
> quantity.
> 
> thus the compiler may decide, at the simplest crudest level, "screw it, MVL
> is hardcoded to 4 or 8 or 16 and that's the end of it", which effectively
> turns SV into a type of brain-dead but functional predication-capable SIMD
> ISA, or it may be a bit more intelligent about it and decide on a
> per-function basis what the best allocation is, to help avoid register spill.
> 
>  
> > In order for wider-than-register modes to be held in a set of registers,
> > those registers *have* to be contiguous in GCC's internal notion of the
> > register file. 
> 
> this is why i described the conceptual numbering for CRs as "being viewable
> as contiguous if you don't mind interspersing the scalar CRs every 16th
> index"
> 
> however, Alexandre, just a heads-up: REMAP *COMPLETELY* obliterates the
> expectation of linear numbering, by design.
> 
> this is something that is supported in NEON as hard-coded in LDST, called
> "Structure Packing", and it is also now in RVV 0.9.  typical uses include
> Matrix Multiplies and for getting all the RRRRR and GGGGG and BBBBB into
> contiguous registers where data was actually in RGBRGBRGB.
> 
> just so you know: REMAP can be applied to the *entire* ISA.  any arithmetic
> vector op, any MV, any LDST.
> 
> 
> > It is sometimes the case that the contiguity is not relevant
> > for the architecture, e.g., if there isn't any opcode that operates on pairs
> > of registers holding a double-word value, but these often appear when a pair
> > of consecutive registers holds a double-precision floating-point value, or a
> > widening multiply necessarily sets a pair of neighbor registers.
> 
> yes, there are a number of instruction examples in many ISAs that support
> this double-op SIMD and widen/narrow, it is no surprise then that gcc has
> had to understand this.
> 
> it will become particularly interesting, a long LONG way down the line, how
> SV's polymorphic elwidth overrides end up being implemented, ultimately.
> 
> intermediary steps there will clearly have to involve avoiding different
> src-dest overrides on arithmetic operations initially, and using
> (inefficient) patterns of MVs that "mirror" the same widening-narrowing
> explicit opcodes typically added to SIMD architectures.
> 
> that will give breathing space to allow a full research investigation into
> how to add polymorphic elwidth overrides to arithmetic ops.
> 
> i mean in a generic fashion, rather than as special-cased for certain
> specific instructions.
> 
> this btw is going to happen rather a lot: the "abstraction" of SV means that
> the compromises taken by most ISAs (only certain ops have saturation, only
> certain ops have widen/narrow) *do not have to be taken*
> 
> 
> >  When this
> > happens, the order of registers in the abstract register file in the
> > compiler has to match the order and the grouping required by the machine,
> > otherwise the allocation won't get things right.
> 
> understood.  this instinctively is why i really do not like the vertical
> stratification.  scalar registers are no longer accessible in a contiguous
> block.
> 
> 
> also this is one of the things that is making me slightly nervous about the
> CRn.MM numbering: it doesn't match precisely one-for-one with the INT/FP
> arrangement.
> 
> as in: yes you can rearrange the naming so that it *looks* contiguous, but
> try accessing them as scalar and it all goes sideways.  literally.
> 
> when treated as Vectors-of-results that generate corresponding
> Vectors-of-CRs the numbering matches.  the names are weird but the numbering
> matches.
> 
> however the moment you try to access those values as *scalar* despite the
> fact that they were just produced by an instruction just before, all hell
> breaks loose.
> 
> not only can you not *get* access to CR3.15 directly for example (you have
> to insert a predicated MV operation to copy it to CR3 or CR3.8 for example)
> you have to run a calculation to work out the FP/INT reg it's associated
> with.  something like:
> 
>      (idx&0b1111)<<3 | (idx&0b1110000)>>4
> 
> that's the relationship between CR numbering and INT/FP numbering.
> 
> (no Jacob, just to emphasise again: making all INT/FP/CR numbering the same
> by applying the same N.MM remapping isn't ok, unfortunately, because the
> entire hardware of 18 months needs to be abandoned and rethought)  
> 
> 
> > When it comes to vectors of gprs and fprs, we didn't have the problem I'm
> > concerned about: the vector modes can just require N contiguous registers,
> > and since they appear as neighbors in the abstract register file, that works
> > just fine. 
> 
> it's "obvious" in other words.  and, in addition, once a Vector of INT/FO
> results is computed, if access to those is required explicitly by scalar
> then as long as the Vector was kept to the lower half of the regfile they
> are also accessible directly *and accessible linearly as well*.
> 
> * Vector add may start at r0, r4, r8, ...
>    r120, r124 and progresses linearly.
>    vector at r0 progresses r0 r1 r2 r3 r4
> 
> * Scalar access may be at any of r0-r63
>   so it is only the upper range of Vectors
>   that cannot be accessed.
>   (without a Vector mv, that is)
> 
> CRs on the other hand:
> 
> * Vector CRs may start at CR0.0 CR0.8
>   CR1.0 CR1.8 ... CR7.0 CR7.8
>   vectors progress CR0.0 CR0.1 CR0.2
> 
> * Scalar access MAY NOT even refer AT ALL
>   to CR1.1, CR1.2 throughout the FULL
>   RANGE of the regfile, ALL the way to
>   CR7.2 and CR7.15.
> 
> this discontiguity is why the "slightly weird" algorithm of treating scalar
> numbering differently from Vector was added.
> 
> > Unlike other wide types, the WORDS_BIG_ENDIAN predicate doesn't
> > affect the expected significance of partial values split across multiple
> > registers in vector types, so we're fine in this regard.
> 
> ok.
> 
> > However, if there are opcodes that require different groupings or orderings
> > of CRs, there will be a representation problem.  E.g., if we need CR12 to be
> > right next to CR4 because of some opcode that takes a pair of CRs by naming
> > CR4 and affecting CR4 and CR12 as a V2CC quantity, they'd have to be
> > neighbors for this V2CC allocation to be possible. 
> 
> right.  well, the only "fly in the ointment" is mfcr and the fxm version
> when used with multiple bits (which i think i'm right in saying you're not
> supposed to do but all hardware supports it).
> 
> the example involving CR4 and CR12 is actually realistic when CR12 is
> "renamed" correctly to CR4.1 (4+8=12)
> 
> accessing CR4 and CR4.1 *can* be done under a Vector op. they are contiguous.
> 
> they can NOT repeat NOT be accessed sequentially via a SCALAR op.  CR4.1 is
> not even accessible AT ALL.
> 
> CR4 and CR4.8 *would* be accessible contiguously via scalar, but not CR4.1
> 
> 
> 
> > But if in other
> > circumstances we use say a V8CC quantity starting at CR0 to refer to
> > CR0..7's 32 bits, then those 8 CRs would have to be consecutive in the
> > register file, without room for CR12 after CR4.
> 
> the idea is - was - i have already convinced myself it's a bad idea - that a
> V8CC would be "CR0.0 CR0.1 ... CR0.7"
> 
> 
>  
> > So please be careful with creative register ordering, to avoid creating
> > configurations that may end up impossible to represent without major surgery
> > in the compiler.
> 
> by going through it, above, i've basically convinced myself it's not just a
> bad idea to do vertical sequencing, it's a *really* bad idea.
>  
> > Also, keep in mind that, even if some configurations might be possible to
> > represent with the knobs I mentioned above, the rs6000/powerpc port has a
> > huge legacy of variants, so whatever we come up with sort of has to fit in
> > with *all* that legacy.  E.g., IIRC 32-bit ppc variants have long used
> > consecutive 32-bit FPRs for (float+float) double-precision-ish values, and
> > consecutive 32-bit GPRs to hold 64-bit values.  There were ABI requirements
> > to that effect, that required the abstract register file in the compiler,
> 
> wait.... whuuu???
> 
> oh god... this is the VSX/Altivec thing isn't it? 

Nope.

> where the INT/FP regfiles
> are combined then "recast" to a 32-entry 128 bit SIMD regfile, something
> like that?

VSX doesn't do that.

Back to ABI stuff...
successive floating-point registers are used to store IBM's special (super
annoying) double-double form of long double (which I think should be relegated
permanently to the history books and 128-bit IEEE float used instead, but we
need to support what legacy programs expect...). They are also used for
float/double complex numbers IIRC, where the first register stores the real
half and the second register stores the imaginary half.

Just think of what mess can be achieved with a complex double-double number...
XD.

I think they're also used for passing by-value structs with 2 float/double
fields. similarly, but with int regs for structs with 2 integer (one of char,
int, long, etc.) fields.

-- 
You are receiving this mail because:
You are on the CC list for the bug.