[Libre-soc-dev] [RFC] Vector/SIMD ISA Context Abstraction

Sat Jul 31 00:27:10 BST 2021

(please cc me i am subscribed digest)

i have an idea which i have been meaning to float for some time.  as
context: i am the lead author of the Draft SVP64 Cray-like Vector
Extensions for the Power ISA, which is being designed for Hybrid CPU,
VPU and 3D GPU workloads.

SVP64 is similar to Broadcom VideoCore IV's "repeat" feature and to
x86 "REP" but with Vectorisation Context. unlike x86 REP which simply
repeats the following instruction, SVP64 *augments* the following
instruction to:

* change any one of src and dest registers to scalar or vector
* adds both src *and dest* predication in some cases, and
* overrides the element width of src and additionally overrides dest
register width (8/16/32/64 bit or FP16/BF16/FP32/FP64) and
* adds several modes including saturation, fail-first, iteration and
reduction and other modes never seen in any commercial ISA.

there are also two modes of operation:

* Vertical First which requires explicit incrementing of the Vector
Element offset (effectively turning the register file into an
indexable SRAM)
* Horizontal First which is equivalent to the original Cray Vectors and to RVV.

Vertical-First may be permitted to execute an arbitrary number of
elements in parallel "batches": interestingly, when those batches are
chosen at runtime to be equal to Maximum Vector Length, that
effectively  executes *all* element operations Horizontally and
incidentally is directly equivalent to Cray-style Vector execution.

here's the problem:

where the Scalar Power ISA for the SFFS compliancy subset is 214
instructions, SVP64 Context is 24 bits and consequently multiplies
those 214 instructions to well north of a QUARTER OR A MILLION ISA
Intrinsics.

adding in GPU-style Swizzle context and the Draft REMAP looping for
Matrix Multiply, FFT, DCT, Iterative reduction and other modes and it
could well be several MILLION intrinsics.

the standard approach used to autogenerate intrinsics with scripts,
making all intrinsics available as a flat header file or c++ template,
which works extremely well for all other ISAs, are therefore
absolutely out of the question.

if however instead of an NxM problem this was turned into  N+M,
separating out "scalar base" from "augmentation" throughout the IR,
the problem disappears entirely.

the nice thing about that approach is that it also tidies up other
ISAs as well, including SIMD ones.  very few ISAs have intrinsics
which are only inherently meaningful in a Vector context (a cross
product instruction would be a perfect illustrative exception to that
rule).

even permute / shuffle Vector/SIMD operations are separateable into
"base" and "abstract Vector Concept": the "base" operation in that
case being "MV.X" (scalar register copy, indexable - reg[RT] =
reg[reg[RA]] and immediate variant reg[RT] = reg[RA+imm])

the issue is that this is a massive intrusive change, effectively a
low-level redesign of LLVM IR internals for every single back-end.

on the other hand, as we make progress over the next few years with
SVP64, if there was resistance to this concept, trying to shoe-horn
SVP64 into an NxM intrinsics concept is guaranteed to limit SVP64's
full capqbilities or, worse, run people's machines out of resources
during compilation, and ultimately cause complaints about LLVM's
performance.

i have no idea where to go with this, and wanted to open up the floor
to alternatives as well as present an opprtunity for discussion of the
ramifications, advantages and disadvantages of separating out
parallelism / vectorisation as an abstract concept from scalar "base"
intrinsics, and what that would look like in practice.

also, i have to ask: has anything like this ever been considered before?

l.