[Libre-soc-isa] [Bug 569] svp64 register predicates vs BE arrays of bits

Wed Feb 9 10:17:04 GMT 2022

https://bugs.libre-soc.org/show_bug.cgi?id=569

--- Comment #10 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
(In reply to Jacob Lifshay from comment #6)

> Because that assumption is baked into LLVM, probably spread throughout the
> code, making it quite difficult to split out bit order as independent from
> byte order, we will probably want to take the path of least resistance and
> change SVP64 to have bitmasks be MSB0 in BE, and LSB0 in LE.

unfortunately, if i understand correctly, it is quite insane and deeply
problematic to follow this assumption.  let's walk through an example:

* CRs are used as predicates
* BE mode is set
* the predicate is set to say use the EQ bit
* the operation to be executed is on a vectorised "crand"
  instruction where the expectation is to combine the results
  of the crand instruction for further use in predicate masks
* VL is set to 3

what happens here - if i understand correctly is:

* the predicate mask comprising bits CR0.EQ, CR1.EQ, and CR2.EQ
  are constructed in **REVERSED** order because this is what LLVM
  expects
* the crand operation extracts (say)
     CR8.SO  CR9.SO  CR10.SO  and ANDs them with
     CR16.GT CR17.GT CR18.GT  applying the **REVERSED** predicate mask
     CR2.EQ  CR1.EQ  CR0.EQ   storing the result in
     CR32.LT CR33.LT CR34.LT

then the next instruction, a cror, which is expecting to then use
the CR32-CR33 results as its incoming predicate, must first bit-reverse
them?  but even before that, the CR0-CR2 had to be bit-reversed.

now, should we instead "fix" this by inverting the ordering of the
Vectors so that in BE mode they go VL-1..0 by default whereas in
LE mode they go 0..VL-1 by default? this will do people's heads in.

and, stricly, should we re-order the definition of the bit-numbering
SO GT EQ LT in BE mode so that it now becomes inverted?

this would indeed meet the strict definition required by LLVM.

but such a definition then creates insanity at the GPR/FPR level:
it's not so much any one single operation that is problematic,
it's the interaction *between* operations where things become
deeply problematic, and if flipping the elwidths half way through
that introduces a whole new dimension of complexity, even just
to consider let alone implement.

overall it is just easier to say "LE and BE apply to memory *ONLY*,
the GPR/FPR and CR regfile contents are strictly off limits:
CRs are already defined and do not change; GPR/FPR is strictly
defined as a LE-byte-addressable SRAM at ALL times"

i.e. as far as the hardware is concerned, the only presence of
BE byte-swapping is in the LD/ST operations, hooking an XOR
gate into ldbrx.

byte-reversing here, byte-reversing there, byte-reversing everywhere
is just too much. it will be literally months to review.

if we had completely separate Vector Register files and completely
separate Vector Predicate Mask register files i would say "yes, no
problem".  [but, as you are aware, that then requires a whole stack
of MV/copy instructions]

however because of the retro-fitting on top of an *existing* scalar
regfile (similar to the original MMX) it's just too much.

my feeling is that when it comes to adding LLVM support to SVP64
it is going to be radically different and yet radically simpler
from every other Vector ISA, because of the for-looping.

i fully expect the "for-looping-on-scalars" concept to hit LLVM
in the exact same surprisingly-elegant way that it has in hardware,
drastically simplifying how it is added.

and if SVP64 is damaged by fitting with how SIMD and
other Vector ISAs have been done (with their explicit intrinsics),
that job will be made far harder.

remember: if we follow how things are done for other Vector ISAs
in LLVM, we have ONE AND A HALF MMMILLLLION vector intrinsics.

auto-generating a header file with 1.5 million intrinsics is flat-out
insane.

therefore we *have* to go back to first principles in LLVM (and gcc)
and hit them with a lower-level-conceptual rethink, propagating the
"for-looping-on-scalars" right the way down to the IR representation.
ultimately i expect that to also drastically simplify the competing
SIMD and Vector ISAs implementations but that's not our problem/focus.

-- 
You are receiving this mail because:
You are on the CC list for the bug.