[Libre-soc-dev] [llvm-dev] [RFC] Vector/SIMD ISA Context Abstraction

Mon Aug 9 16:46:43 BST 2021

again, apologies, a follow-up: i'd like to keep the conversation going (with everyone).

a reminder / summary of the proposal:

     all basic *scalar* LLVM intrinsics extend with *optional* arguments that
     provide Vector / SIMD Augmentation Context. 

the benefit being that the number of intrinsics needed now and in the future in LLVM is dramatically reduced

first, a clarification: Renato, you asked if the shuffle capability of LLVM SVE was sufficient: i replied slightly flippantly asking if shuffle-{any-arith-op} existed as a concept (apologies for that).

SVP64 does not have shuffle-{any-arith-op} however being targetted at 3D and Video it does have Swizzle and a new concept: REMAP.  Swizzle can be applied through prefixing to all source registers.  it is well-known in the GPU world, especially how important it is, and does not need describing.

REMAP is a completely new concept.  an algorithmic "remapping" is applied to the normally sequentially-incrementing Vector Element indices.  useful limited easy-to-implement "remappings" are being developed, such as Matrix Schedules (0 3 6 1 4 7 2 5 8) and RADIX-2 FFT/DCT Butterfly Schedules.

normally Shuffle is limited to either memory operations or to register MV operations, and both are inherently supported by SVP64 through Vectorisation of base scalar operations: Indexed LD/ST for example.

my point is that whilst SVP64 supports the "normal" expected type of Shuffle Operations expected of Vector ISAs (Vector-Indexed-LD, Indexed-Reg-MV) it also has GPU style Swizzle (a limited type of shuffle for short vectors up to length 4) and REMAP.

thus, there is a case even for adding shuffle-augmentation to base LLVM intrinsics as optional arguments.

the one that *is* much more general purpose but was not mentioned except in passing was VGATHER-VSCATTER.

in all other Vector ISAs these are usually either memory-only or Reg-MV operations (or both).  it's usually done with Predicate Masks. In SVP64, surprise: both VGATHER and VSCATTER are abstracted-out concepts that can apply to almost every operation.  this is not possible to do all thr time, but when *both* are applied (VGATHER to the source regs or memory, VSCATTER to the dest), we call that "Twin Predication".

thus, again, we would propose adding *both* a source predicate mask *and* destination predicate mask to base llvm intrinsics, as optional arguments.

the other concept is slightly odd: element-width overrides even on operations where the source registers are specified at a fixed width already.  this one i am slightly uncertain about.

we have a Mode in SVP64 called "Saturate" which has sub-options Signed and Unsigned.  the rules for this took us some time to derive: eventually we realised that the rule has to be that the arithmetic operation appears to take place at *infinite* precision, followed up by truncation to the min/max of the output bitwidth.

all other definitions turned out to be problematic in some way (particularly for multiply or power).

what i am not certain about is whether it is perfectly sufficient to use standard base LLVM intrinsics, and count on source register type and return type as the SVP64 src width and dest width, and simply add optional arguments for signed/unsigned saturation.

however what is clear to me is that there is very little conceptual limit as to what can be added as optional arguments to base intrinsics.  it would be up to ISA Maintainers to define what they can provide in hardware.

i would very much love to hear from other ISA Maintainers as to whether the ISA they are responsible for could benefit from this approach, both in the 3D GPU World as well as standard non-GPU: ARM SVE2, x86, AMDGPU, MIPS, ppc64, SX-Aurora, everyone.

SIMD ISAs would have an optional argument specifying the (fixed) length.  Cray-style Scalar Vector ISAs would have an optional argument specifying that the length was variable.

the invitation is therefore to see if this idea, of adding optional Vectorisation Context to base llvm intrinsics, has merit across the entire LLVM community, and, if it does, what would it look like?

key question: what impact would a large number of optional arguments to LLVM base intrinsics have, on performance and memory consumption? would it be beneficial or adverse? i honestly have no idea.

another question: if a given ISA does not provide a particular hardware feature (saturation let us say) then should this be declared in some fashion such that LLVM avoids emitting llvm.add(args, sat=signed)

OR

should the functionality be provided anyway by way of soft-passes behind the scenes? i.e. the lack of hardware saturation would result in IR being emitted that ultimately performed the saturation using multiple assembly operations.

given that this latter approach would effectively imply that *all* LLVM IR backends "supported" SIMD and Vectorisation (emulated through IR passes for non-Vector non-SIMD hardware) it would need some serious thought.

l.