[Libre-soc-bugs] [Bug 558] gcc SV intrinsics concept
bugzilla-daemon at libre-soc.org
bugzilla-daemon at libre-soc.org
Thu Jan 14 19:47:09 GMT 2021
https://bugs.libre-soc.org/show_bug.cgi?id=558
--- Comment #61 from Jacob Lifshay <programmerjake at gmail.com> ---
(In reply to Luke Kenneth Casson Leighton from comment #60)
> preamble 1 this is so long that i am going to need to do a followup summary.
> this alone may take me half an hour. patience appreciated.
>
> preamble 2
>
> found this:
> https://gcc.gnu.org/onlinedocs/gccint/Condition-Code.html
>
> which is not hugely informative but at least gives hints
>
> and this:
> https://gcc.gnu.org/onlinedocs/gccint/Machine-Independent-Predicates.
> html#Machine-Independent-Predicates
>
> which shows that lt/gt/le... etc (BO Branch style CR tests) are
> representable in CRs.
>
> preamble 3: the following is very relevant, and important to note that where
> RVV gcc work leads, we will have an easier (less expensive) time following.
>
> https://www.embecosm.com/2018/09/09/supporting-the-risc-v-vector-extension-
> in-gcc-and-llvm/
> https://gcc.gnu.org/legacy-ml/gcc/2018-09/msg00037.html
>
> BUT... it may be the case that gcc is simply "left behind" with the primary
> focus being on LLVM. instead, we may have to be the ones that take the
> initiative.
>
>
>
> (In reply to Alexandre Oliva from comment #58)
> > I haven't been able to keep up with this in detail (sorry, my attention has
> > been temporarily diverted), but I'm a little worried about how to represent
> > a "shuffled" CR register file map, if I get the right idea of what's being
> > proposed.
>
> [appreciated your concerns. for context here: we need your input to
> determine, rather quickly, if the CR remap is viable. i am USD 8,000 in
> debt on credit cards and NLnet does not donate for time, only for results.
> i apologise that this puts us under pressure, i.e. we need to make fast
> pragmatic decisions, even if they turn out mid-to-long-term to have been
> wrong.]
>
> providing some perspective: it is and it isn't shuffled. one way to imagine
> it is:
>
> * thing indexed 0 has 2 names: CR0 and CR0.0
> * thing indexed 1 has 1 name: CR0.1
> * ...
> * ..... 15 1 : CR0.15
> * thing indexed 16 has 2 names: CR1 and CR1.0
> * 17 1 CR1.1
> * ...
> * ....... 127 1 CR7.15
>
> thus it is still sequential and linear, it's just that the scalar v3.0B
> registers have been placed into positions that are now every 16th in the
> sequence.
>
> if on the other hand you think of it as being grouped linearly by CRn.0 as
> being sequentially numbered, followed by all CRn.1 etc CR0.0 CR1.0 ... CR7.0
> as indexed 0-7 then CR0.1 CR1.1 ... as indexed 8-15 and so on *then* it is
> discontiguous and yes it would cause misunderstandings and a lot of trouble
> (and, you point out below: unworkable for Vectors)
>
> it was 9 days before i was able to grasp this difference fully and that
> alone is of serious concern.
>
> if the conceptual numbering of scalar CRs can indeed be simply shifted up 4
> bits, given a different name format (CRn.MM) that binutils recognises and
> sorts out, would that be viable?
>
>
> > The key concepts that GCC deals with for purposes of register allocation are
> > requirements of instructions (constraints, as in extended asms) and modes
> > (closely related with types).
> >
> > CRs and flags in general are dealt with without caring about their internal
> > representation.
> > [ ... ]
>
> interesting. and that's why they don't end up in the frontend because of the
> abstraction. you'd need to bring out *CCmodes* into the frontend in order
> to apply __attribute__ to them, and that's clearly not going to be happening.
>
> > It's all modelled abstractly, as if the condition code register held the
> > result of the compare rather than whatever bits the underlying hardware
> > uses,
>
> sounds sensible to me.
>
>
> > So you won't see anything in GCC that cares that CRs are 4-bits wide and use
> > one bit for EQ, one for LT, one for GT, and one for UN, in whatever order
> > that is. This solves some potential problems for us, because endianness of
> > those bits is not an issue.
>
> also elwidth overrides are meaningless so don't enter into the conversation
> at all.
>
> > There's nothing in the IR that enables reinterpretation of CR bits as an
> > integral quantity, or vice-versa.
>
> *click*... this may cause problems for what i called "crweird" instructions.
>
> ah wait... there is precedent: isel and setb. these interact to select/set
> INT regs based on a CCode (CR).
>
> to give some context: the crweird instructions are a way to transfer
> CRs-as-predicates into scalar INTs and vice-versa. we need these so as not
> to have to add duplicate instructions (same functionality, one on CRs one on
> INTs)
>
> mtcrweird: RA, BB, mask.mode
>
> reg = (RA|0)
> lsb = reg[63] # MSB0 numbering sigh
> n0 = mask[0] & (mode[0] == lsb)
> n1 = mask[1] & (mode[1] == lsb)
> n2 = mask[2] & (mode[2] == lsb)
> n3 = mask[3] & (mode[3] == lsb)
> CR{BB}.eq = n0
> CR{BB}.lt = n1
> CR{BB}.gt = n2
> CR{BB}.ov = n3
>
> you can see if that is Vectorised it is intended to put arithmetic bit 1 of
> RA into CR{BB+1} etc etc etc.
>
> hence the name "weird".
>
> now, if this *cannot be represented* in gcc we are in a bit of... um...
> schtuck.
>
> one potential conceptual route is to internally "typecast" the CCodes into
> predication bits (including potentially the transformation process)
>
> another would be for predication to just take a range of CR-vectors, say to
> the register allocator, "MINE! hands off!" and the CCodes side never talks
> to the predication side.
>
>
>
> > Indeed, CCmodes generally do not pass the
> > TARGET_MODES_TIEABLE_P predicate with other modes, meaning you cannot
> > reinterpret a CCmode "quantity" in a register as another mode, as you often
> > can reinterpret a wide integral mode as a narrow one, and vice-versa, when
> > the machine, the ABI and the compiler keep them extended under uniform
> > conventions.
>
> ah. small diversion needed, come back to numbering in a minute.
>
> we do really need the ability to consider CRs as predicate bits, otherwise
> if we have to use only integers we lose a huge amount of capability, and the
> hardware becomes either unmanageably complex or severely
> performance-compromised.
>
> now, whether that's done as CCModes are also predicates in gcc or not?
>
> question: can "CCodes-with-compares" at least be "typecast" to a new kind of
> CCode, a "predicate" CCode? or to an underlying existing predicate type in
> gcc?
>
> i assumed that this would be possible, at least in some fashion, even if it
> requires some hoops to jump through.
>
> the concept behind CRs-as-predicates i copied how Branch BO field works,
> because Branches test CCodes in exactly the same way to do if/else in scalar
> that you use predicates to perform vector "variants" of if/else:
>
> if x > y: # cmp creates CR
> # branch on CR with BO created here
> x -= 5
>
> vector version would be:
>
> VectorCMP x,y # creates vector of CRs
> svp.mask=CR,BO=gt addi x.v, -5
>
> anyway. back to numbering.
>
>
> > Now, the problem with "shuffled" register ordering is that the controls GCC
> > uses to tell how modes and registers related are TARGET_HARD_REGNO_MODE_OK,
> > that tells whether a quantity in a given machine mode can be held in a given
> > register, and TARGET_HARD_REGNO_NREGS, that tells how many *consecutive*
> > registers are needed to hold that mode, starting at a given register.
>
> right. and MVL, which is *very specifically only allowed to be set
> statically by an immediate*, defines that quantity.
>
> it's changeable on a "per-setmvli" basis, but it *is* a static compile-time
> quantity.
>
> thus the compiler may decide, at the simplest crudest level, "screw it, MVL
> is hardcoded to 4 or 8 or 16 and that's the end of it", which effectively
> turns SV into a type of brain-dead but functional predication-capable SIMD
> ISA, or it may be a bit more intelligent about it and decide on a
> per-function basis what the best allocation is, to help avoid register spill.
>
>
> > In order for wider-than-register modes to be held in a set of registers,
> > those registers *have* to be contiguous in GCC's internal notion of the
> > register file.
>
> this is why i described the conceptual numbering for CRs as "being viewable
> as contiguous if you don't mind interspersing the scalar CRs every 16th
> index"
>
> however, Alexandre, just a heads-up: REMAP *COMPLETELY* obliterates the
> expectation of linear numbering, by design.
>
> this is something that is supported in NEON as hard-coded in LDST, called
> "Structure Packing", and it is also now in RVV 0.9. typical uses include
> Matrix Multiplies and for getting all the RRRRR and GGGGG and BBBBB into
> contiguous registers where data was actually in RGBRGBRGB.
>
> just so you know: REMAP can be applied to the *entire* ISA. any arithmetic
> vector op, any MV, any LDST.
>
>
> > It is sometimes the case that the contiguity is not relevant
> > for the architecture, e.g., if there isn't any opcode that operates on pairs
> > of registers holding a double-word value, but these often appear when a pair
> > of consecutive registers holds a double-precision floating-point value, or a
> > widening multiply necessarily sets a pair of neighbor registers.
>
> yes, there are a number of instruction examples in many ISAs that support
> this double-op SIMD and widen/narrow, it is no surprise then that gcc has
> had to understand this.
>
> it will become particularly interesting, a long LONG way down the line, how
> SV's polymorphic elwidth overrides end up being implemented, ultimately.
>
> intermediary steps there will clearly have to involve avoiding different
> src-dest overrides on arithmetic operations initially, and using
> (inefficient) patterns of MVs that "mirror" the same widening-narrowing
> explicit opcodes typically added to SIMD architectures.
>
> that will give breathing space to allow a full research investigation into
> how to add polymorphic elwidth overrides to arithmetic ops.
>
> i mean in a generic fashion, rather than as special-cased for certain
> specific instructions.
>
> this btw is going to happen rather a lot: the "abstraction" of SV means that
> the compromises taken by most ISAs (only certain ops have saturation, only
> certain ops have widen/narrow) *do not have to be taken*
>
>
> > When this
> > happens, the order of registers in the abstract register file in the
> > compiler has to match the order and the grouping required by the machine,
> > otherwise the allocation won't get things right.
>
> understood. this instinctively is why i really do not like the vertical
> stratification. scalar registers are no longer accessible in a contiguous
> block.
>
>
> also this is one of the things that is making me slightly nervous about the
> CRn.MM numbering: it doesn't match precisely one-for-one with the INT/FP
> arrangement.
>
> as in: yes you can rearrange the naming so that it *looks* contiguous, but
> try accessing them as scalar and it all goes sideways. literally.
>
> when treated as Vectors-of-results that generate corresponding
> Vectors-of-CRs the numbering matches. the names are weird but the numbering
> matches.
>
> however the moment you try to access those values as *scalar* despite the
> fact that they were just produced by an instruction just before, all hell
> breaks loose.
>
> not only can you not *get* access to CR3.15 directly for example (you have
> to insert a predicated MV operation to copy it to CR3 or CR3.8 for example)
> you have to run a calculation to work out the FP/INT reg it's associated
> with. something like:
>
> (idx&0b1111)<<3 | (idx&0b1110000)>>4
>
> that's the relationship between CR numbering and INT/FP numbering.
>
> (no Jacob, just to emphasise again: making all INT/FP/CR numbering the same
> by applying the same N.MM remapping isn't ok, unfortunately, because the
> entire hardware of 18 months needs to be abandoned and rethought)
>
>
> > When it comes to vectors of gprs and fprs, we didn't have the problem I'm
> > concerned about: the vector modes can just require N contiguous registers,
> > and since they appear as neighbors in the abstract register file, that works
> > just fine.
>
> it's "obvious" in other words. and, in addition, once a Vector of INT/FO
> results is computed, if access to those is required explicitly by scalar
> then as long as the Vector was kept to the lower half of the regfile they
> are also accessible directly *and accessible linearly as well*.
>
> * Vector add may start at r0, r4, r8, ...
> r120, r124 and progresses linearly.
> vector at r0 progresses r0 r1 r2 r3 r4
>
> * Scalar access may be at any of r0-r63
> so it is only the upper range of Vectors
> that cannot be accessed.
> (without a Vector mv, that is)
>
> CRs on the other hand:
>
> * Vector CRs may start at CR0.0 CR0.8
> CR1.0 CR1.8 ... CR7.0 CR7.8
> vectors progress CR0.0 CR0.1 CR0.2
>
> * Scalar access MAY NOT even refer AT ALL
> to CR1.1, CR1.2 throughout the FULL
> RANGE of the regfile, ALL the way to
> CR7.2 and CR7.15.
>
> this discontiguity is why the "slightly weird" algorithm of treating scalar
> numbering differently from Vector was added.
>
> > Unlike other wide types, the WORDS_BIG_ENDIAN predicate doesn't
> > affect the expected significance of partial values split across multiple
> > registers in vector types, so we're fine in this regard.
>
> ok.
>
> > However, if there are opcodes that require different groupings or orderings
> > of CRs, there will be a representation problem. E.g., if we need CR12 to be
> > right next to CR4 because of some opcode that takes a pair of CRs by naming
> > CR4 and affecting CR4 and CR12 as a V2CC quantity, they'd have to be
> > neighbors for this V2CC allocation to be possible.
>
> right. well, the only "fly in the ointment" is mfcr and the fxm version
> when used with multiple bits (which i think i'm right in saying you're not
> supposed to do but all hardware supports it).
>
> the example involving CR4 and CR12 is actually realistic when CR12 is
> "renamed" correctly to CR4.1 (4+8=12)
>
> accessing CR4 and CR4.1 *can* be done under a Vector op. they are contiguous.
>
> they can NOT repeat NOT be accessed sequentially via a SCALAR op. CR4.1 is
> not even accessible AT ALL.
>
> CR4 and CR4.8 *would* be accessible contiguously via scalar, but not CR4.1
>
>
>
> > But if in other
> > circumstances we use say a V8CC quantity starting at CR0 to refer to
> > CR0..7's 32 bits, then those 8 CRs would have to be consecutive in the
> > register file, without room for CR12 after CR4.
>
> the idea is - was - i have already convinced myself it's a bad idea - that a
> V8CC would be "CR0.0 CR0.1 ... CR0.7"
>
>
>
> > So please be careful with creative register ordering, to avoid creating
> > configurations that may end up impossible to represent without major surgery
> > in the compiler.
>
> by going through it, above, i've basically convinced myself it's not just a
> bad idea to do vertical sequencing, it's a *really* bad idea.
>
> > Also, keep in mind that, even if some configurations might be possible to
> > represent with the knobs I mentioned above, the rs6000/powerpc port has a
> > huge legacy of variants, so whatever we come up with sort of has to fit in
> > with *all* that legacy. E.g., IIRC 32-bit ppc variants have long used
> > consecutive 32-bit FPRs for (float+float) double-precision-ish values, and
> > consecutive 32-bit GPRs to hold 64-bit values. There were ABI requirements
> > to that effect, that required the abstract register file in the compiler,
>
> wait.... whuuu???
>
> oh god... this is the VSX/Altivec thing isn't it?
Nope.
> where the INT/FP regfiles
> are combined then "recast" to a 32-entry 128 bit SIMD regfile, something
> like that?
VSX doesn't do that.
Back to ABI stuff...
successive floating-point registers are used to store IBM's special (super
annoying) double-double form of long double (which I think should be relegated
permanently to the history books and 128-bit IEEE float used instead, but we
need to support what legacy programs expect...). They are also used for
float/double complex numbers IIRC, where the first register stores the real
half and the second register stores the imaginary half.
Just think of what mess can be achieved with a complex double-double number...
XD.
I think they're also used for passing by-value structs with 2 float/double
fields. similarly, but with int regs for structs with 2 integer (one of char,
int, long, etc.) fields.
--
You are receiving this mail because:
You are on the CC list for the bug.
More information about the libre-soc-bugs
mailing list