[Libre-soc-dev] [llvm-dev] [RFC] Vector/SIMD ISA Context Abstraction

Tue Aug 3 18:32:29 BST 2021

(renato thank you for cc'ing, due to digest subscription at the moment)

On Tue, Aug 3, 2021 at 3:25 PM Renato Golin <rengolin at gmail.com> wrote:
>
> On Sat, 31 Jul 2021 at 00:33, Luke Kenneth Casson Leighton via llvm-dev <llvm-dev at lists.llvm.org> wrote:
>>
>> if however instead of an NxM problem this was turned into  N+M,
>> separating out "scalar base" from "augmentation" throughout the IR,
>> the problem disappears entirely.
>
>
> Hi Luke,
>
> It's not entirely clear to me what you are suggesting here.

it's a nebulous but fundamentally low-level concept, that may take some
time to sink in, and also appreciate the significance.

some background: over the past 3+ years i have made a comprehensive
comparative study of ISAs (both SIMD and Vector), the latter only being
revived recently thanks to RVV bringing Cray-style Vectors back into the
forefront of computing research.

here is a quick summary of that:

* Packed SIMD.  the worst kind of ISA. the following article puts
  it diplomatically:
https://www.sigarch.org/simd-instructions-considered-harmful/
  i have no such compunction or affiiation, and as an engineer can speak
  freely and plainly.  gloves off, statement of fact: Packed SIMD
  is the worst thing to become ubiquitous in computer science, bar none.
  frankly, the sooner that Packed SIMD is shot and buried as an embarrassing
  and incredibly expensive historical footnote in the history of
computing, the better
  [there's always exceptions: for e.g. embedded DSP Audio, Packed SIMD
is perfect].

* Predicated SIMD: by total contrast, this is actually not half bad. SVE2,
  AVX-512, GPU ISAs, anything that takes masks per element, the masks
  completely eliminate Packed SIMD Hell (except for ISAs that still have
  SIMD alignment for memory LD/STs even for Predicated LD/STs,
  but hey, nothing's perfect)

* Horizontal-First Vectors. known best from the Cray-i, also in modern ISAs
  such as NEC's SX-Aurora and now RVV, horizontal vectors are of the
  form "for i in range(VL) operation(vec_src_reg[i], vec_dest_regs[i])"

* Vertical-First Vectors.  these are *NOT* very well-known, and the only
  two ISAs that i know of (to date) are SVP64 and Mitch Alsup's MyISA 66000.

btw, here's the [terse] page on SVP64's format:
    https://libre-soc.org/openpower/sv/svp64/
the overview is much more useful for understanding:
    https://libre-soc.org/openpower/sv/overview/

additional relevant ISAs:

* Broadcom VideoCore IV which has a Vector-form of "REP", where the
  operation may be repeated 2, 4, 8, 16, 32 or "the count taken from
  scalar r0".

* the Mill Architecture, which is a "tagged" ISA.  here, there is no "ADD32",
  "ADD16", "ADD64", "ADD8", there is *ONLY* "ADD".  the width of the
  operation is taken *FROM THE LOAD OPERATION* which is the ONLY
  place where the register operand width is specified.  operations are
  DEDUCED, statically, at compile-time, based on the "tags" associated
  with source registers as they cascade through.  two strategically
  important instructions are included: WIDEN and NARROW.

the significance of mentioning the Mill is that the ISA has a closer
match to the simple (basic) LLVM intrinsics than most other ISAs.
however even there (unless they've done drastic and extensive
changes) they will be limited to trying to fit a flexible (tagged) ISA
into an inflexible IR that was never designed with "context" in mind.

> For context:
>  * Historically, we have tried to keep as many instructions as native IR as possible to avoid the explosion of intrinsics, as you describe.

a crucially important goal that gets a big thumbs-up from me.

>  * However, traditionally, intrinsics reduce the number of instructions in a basic block instead of increasing them, so there's always the balance.

where the opposite of that is that the CISC-ness of a given new
intrinsic itself could impact ISAs that don't support that feature
natively, making it necessary for them to emit rather more assembly
instructions than it first appears.

>  * For example, some reduction intrinsics were added to address bloat, but no target is forced to use them.

excellent.  iteration and reduction (including fixed schedule
paralleliseable reduction) is one of the intrinsics being added to
SVP64.  it's good to hear that that, as a concept, has been added.  if
i may, i will use that as an example, later.

>  * If you can represent the operation as a series of native IR instructions, by all means, you should do so.

this assumes (perfectly reasonably, mind you) that the (hypothetical)
ISA itself is not capable of expressing a given operation *in* IR, and
consequently has to be done as  a series of passes, substituting for a
lack of native (direct) support of a given operation with some
(faster?) operations that *do* (ultimately) exist as actual assembler.

in some architectures a particular native IR instruction might
actually exist, but the native assembler variant is so horribly slow
at the hardware level that alternatives are actually *demanded* by
users.  AVX's native Horizontal Reduction instructions would be a good
example.

> I get it that a lot of intrinsics are repeated patterns over all variations and that most targets don't have that many, so it's "ok".
>
> I also get it that most SIMD vector operations aren't intrinsically vector,

[indeed.  i have spent considerable time recently on the wikipedia
Vector_processor page, and associated nearby related pages
(SIMD, GPUs, etc), correcting that unfortunate meme that "SIMD equals
vectors".  this is unfortunately where Corporate Marketing has badly
interfered with actual Computer Science.  sigh.]

> but expansions of scalar operations for the benefit of vectorisation
>(plus predication, to avoid undefined behaviour and to allow "funny" patterns, etc).

yes.  and this perspective is where Mitch Alsup's MyISA 66000, and SVP64's
"Vertical-First" Mode come into play: the instructions in both are effectively
executed *in scalar form ONLY* (as far as the Program Order is concerned),
and, at the end of a loop/branch, you *EXPLICITLY* increment the element
index, such that all *SCALAR* operations in the loop now execute on *scalar*
element one.  end of loop, explicit increment element index to 2, loop back
and execute on element *two* of the Vector Register.  repeat until
loop-termination
condition.

the challenge ahead for Libre-SOC (and for MyISA 66000) will be to introduce
this entirely new concept to compilers.  however given that it's effectively
scalar rather than Vector, the task should actually be a *lot* easier than it
is for Horizontal-First ISAs such as SVE and RVV.

> But it's not clear to me what the "augmentation" part would be in other targets.

the proposal is - at its heart - to replace all IR of the form:

    llvm.masked.load.v16f32.predicatespec(arguments)
    llvm.masked.load.v2f64.predicatespec(arguments)

and so on

with just:

     llvm.load(mask=x, arguments).

where there *is* no llvm.masked.load, there is *only* an optional argument
that *at runtime* (or, compile-time more like, i.e. when running llvm) rather
than expands out explicitly / statically to dozens of special IR intrinsics,
*there is only one*:

     llvm.load.

additional optional arguments also then specify whether this operation
is twin-predicated by having a *second* predicate mask (yes, SVP64
can apply one predicate mask to the source, and another to the destination.
conceptually this is equivalent to back-to-back VGATHER-VSCATTER).

additional optional arguments also then specify whether there is SWIZZLE
applied, or Sub-Vectors, or any other types of "augmentation".

now, here's the kicker:

what we need to support SVP64 is for *all llvm basic intrinsics to support
all possible optional augmentations of all possible types*.

yes, really, that's not a typo or a mis-statement.  we *genuinely* need
a sign-extended twin-predicated intrinsic:

     llvm.sext(source_mask=source_pred, dest_mask=dest_pred, source_argument)

>> even permute / shuffle Vector/SIMD operations are separateable into
>> "base" and "abstract Vector Concept": the "base" operation in that
>> case being "MV.X" (scalar register copy, indexable - reg[RT] =
>> reg[reg[RA]] and immediate variant reg[RT] = reg[RA+imm])
>
>
> Shuffles are already represented as IR instructions (insert/extract vector), so I'm not sure this clarifies much.

ok, so is it possible to do shuffle-sign-extend, shuffle-fptrunc,
shuffle-fabs, shuffle-sqrt, shuffle-log, and any other single-src
single-dest operation?

this is where the "augmentation" - the separation of PREFIX-SUFFIX
comes into play.

SVP64 has the ability to set up "SWIZZLE" contexts as well as certain
kinds of "REMAP" Schedules (triple-loop butterfly schedules) -
PREFIXes - that can be *applied* to base operations (SUFFIXes), which,
if we were to expand all those possibilities out would literally
create several MILLION intrinsics.

> Have you looked at the current scalable vector implementation?

briefly, yes.  i also helped with some review insights when RVV was
being added, although that was a brief glimpse into a massive world
where i was (and still am) constrained, unfortunately, by time and
resources, much as i would love that to be otherwies.

> It allows a set of operations on open-ended vectors that are controlled by a predicate, which is possibly the "augmentation" that you're looking for?

no.  what is happening there is that it is a reflection of the
limitations of the current ISAs.  i can say with 100% certainty that
the SVE implementation will not have been designed to take SVP64 into
consideration.

the reason is actually very simple and straightforward: at the time
LLVM SVE was added, SVP64 did not even exist.

so for example, let us take the new feature added in LLVM SVE: reduction.

most Vector ISAs add *explicit* reduction operations.  NEC SX-Aurora
for example has reduce-add, reduce-multiply, reduce-OR, reduce-AND,
reduce-XOR, and that's about it.

SVP64 has:

   * reduction as a fixed (paralleliseable) schedule
   * base operation.

you can LITERALLY apply reduction to.... to... "llvm.maximum" scalar
operation, or to... divide or subtract (or other non-commutative
operation) if you really really want to, and the ISA will go, "ok,
done.  next?".

    sv.fmax/MR FRT.v, FRA.v, FRB.v # MR means "map-reduce mode"

you can apply parallel reduction to Power ISA v3.0 Condition Register
operations, "crand" or "cror" or "crnor".

    sv.crand/MR BT.v, BA.v, BC.v

you can even apply parallel reduction to single-argument instructions
if you really, really want to: we're not going to stop that from
happening, because somebody might find it useful given the fact that
the parallel-reduction is on a fixed Power-2-halving Schedule that
could have practical uses, and the hardware is *required* to write out
all intermediate values into a *vector* result.

you can even apply parallel reduction Schedules to triple-argument
instructions (FMA), however there it gets tricky and complicated (and
i haven't thought it through, fully, what it actually means, i.e.
whether it's useful).  certainly if the MUL register argument is
considered scalar and the others Vector, that is actually useful
(performs repeated cumulative multiply as part of the Schedule).

does this help illustrate what i mean by "augmentation"?  there is a
"base" (scalar) operation, you "augment" it, and it *becomes*
SIMD-like, *becomes* Vector-like, *becomes* predicated, *becomes*
Swizzled, *becomes* reduced.

the development of LLVM SVE would not have taken this possibility into
account, because, put simply, it is plain common sense in something as
complex as LLVM not to waste time writing code for something that does
not have a real-world use-case.

>> the issue is that this is a massive intrusive change, effectively a
>> low-level redesign of LLVM IR internals for every single back-end.
>
>
> Not necessarily.

this would be fantastic (and a huge relief) if it can be so arranged.  one of my
biggest concerns is that what i am advocating is at such a fundamental level
that it could, if done incorrectly, be extremely disruptive.

however even now as i think about it on-the-fly, if the proposal is as simple
as adding c++-like "optional named arguments" to (all) base scalar LLVM
intrinsics, then, i think that would work extremely well.  it would have
zero impact on other ISAs, which is a huge plus.

> For example, scalable vectors are being introduced in a way that non-scalable back-ends (mostly) won't notice.
> And it's not just adding a few intrinsics, the very concept of vectors was changed.
> There could be a (set of) construct(s) for your particular back-end that is invisible to others.

the problem is that all other Vector ISAs have constrained themselves to 32 bit
(or, for GPU ISAs, often 48 or 64). they *explicitly* add *explicit*
O(N) opcodes.
RVV adds 192 *explicit* opcodes, embedded into a MAJOR 32-bit opcode
specifically dedicated for use by RVV, and that was part of its original design.

ARM, likewise, will have done something similar, with SVE and SVE2.

the problem with that approach is that it is extremely limiting in the possible
permutations / combinations of *potential* instructions that *could* exist,
if there was not such a limit of trying to cram into a 32-bit space.

[it does have to be said, however, that there are some serious practical
benefits to limiting the possibilities of an ISA: validation and verification
of silicon before spending USD 16 million on 7nm masks is a Damn Good
Reason :) and it is one that we are going to have to give some serious
thought to: how to verify the hardware for an ISA with literally several
MILLION instructions]

we have left the entirety of the Scalar Power v3.0B ISA alone (which
is 32-bit), called that "base", and added a full 32-bit Prefix (called
SVP64) which contains the Vectorisation Context.

SVP64 is - fundamentally - an O(NxM) ISA, where N ~= 250
and M is ~= 1,000 to 8,000.

actually, it's O(NxMxOxPxQ) where:

* N~=250 is the base scalar Power v3.0B ISA
* M~=1,000-8,000 is the Vectorisation Context
* O~=2^20 (guessing here) is REMAP Schedules and
* P~=2^(3*12) is SWIZZLE Contexts for GPUs (XXYZ, WZZX)
* Q=64 (Vector Length, VL)

thus for example with Twin-Predication applied to e.g. llvm.sext or llvm.cos we
have implicit back-to-back VGATHER-VSCATTER behaviour *WITHOUT*
needing a pair of LD/ST operations or register MV operations before and
after the Vectorised operation.

we anticipate some extremely powerful compact representations, and to
be honest it may literally take several years for the full
implications of SVP64's
power and flexibility to sink in.

in-register paralleliseable DCT can be done in 9 instructions, and
paralleliseable
in-register-file (small) insertion-sort likely in around 11
instructions thanks to the
Data-Dependent Fail-on-First-Condition Mode.

we can even implement a (small) Vectorised quick-sort, in-register, fully
paralleliseable, in probably about... 20 instructions.  it's on my TODO list
to investigate.

> Of course, the more invisible things, the harder it is to validate and change intersections of code, so the change must really be worth the extra hassle.

appreciated.

> With both Arm and RISCV implementing scalable extensions, that change was deemed worthy and work is progressing.
> So, if you could leverage the existing code to your advantage, you'd avoid having to convince a huge community to implement a large breaking change.

the possibility that occurred to me, above, as writing this, of adding
optional arguments
(containing the Vector Augmentation Context) to base scalar llvm intrinsics,
would i believe achieve that.

if any other ISA vendors wanted to use that, they could, as a first pass, map
e.g. llvm.load(optional_predicate=xxx) *onto* llvm.masked.load(....) and thus
avoid huge disruption, and carry that out in an incremental fashion.  or not.
at their choice.

there are several variants on this theme of optional arguments of some
description
to the base:

    llvm.add<optional_vector_context>(normal_arguments)

where optional_vector_context is an object of some type that
itself contains optional "augmentation" features.  i would advocate
something like this:

    llvm.add(normal_arguments, source_override_width=<8/16/32/64>,

dest_override_width=<8/16/32/64>,

saturation_mode=<signed/unsigned>,
                                                   source_pred=xxx,
dest_pred=yyyy,
                                                   fail_first_mode,
swizzle_src=<XYZW>,

REMAP_schedules=<RA,RB,RC,RT,RS>,

scalar_or_vector_regs=<RA=v/s,RB=v/s,

                   RC=v/s,RT=v/s,RS=v/s>)

can you imagine expanding all of those out into a declared (flat) list
of intrinsics?
what that would do to LLVM SVE if we tried?

the "augmentation" list is absolutely massive and starts to give some
idea of why
LLVM SVE, as designed, simply won't cope, and why we have to think about this
differently.

the thing is: from the study i've made of other ISAs, i can say that with
near 100% certainty that there *will* be a direct map to all of the existing
LLVM SVE intrinsics recently added *and to those of all SIMD ISAs as well*

and, what i expect to happen is that instead of a massive list of thousands
of SIMD intrinsics for e.g. x86, it will reduce down to a fraction of what
is in LLVM x86 backend right now.  in fact, i expect the exact same reduction
to occur for *all* Packed and Predicated SIMD ISAs supported by LLVM.

that will have both reduction in maintenance burden, and it should, in
theeoorry,
reduce compile times as well.  in theory.  in practical terms it depends what
the impact is of the "optional" arguments.

hmmm, that will need some thought.

even the Mill i believe could benefit, from being able to map much more
closely to the actual underlying ISA, which *only* has "ADD" (not
ADD8/16/32/64),
because they could potentially add an "auto" or "implicit" option to the
source width / dest width arguments, which would be much more in line
with how the actual ISA itself works (implicit tagged - polymorphic - registers)

https://millcomputing.com/docs/compiler/

> And you'd also give us one more reason for the scalable extension to exist. :)

:)

as i mentioned at the start, with that list of ISAs, there _do_ exist
ofher Vector
ISAs and actual hardware implementations, out there: NEC SX-Aurora has
been shipping for decades, now - first implementations were April 1983!

   https://en.wikipedia.org/wiki/NEC_SX

here's some background:

    https://sx-aurora.github.io/

and yes, they do have a Vector Extension variant of llvm:

    https://sx-aurora.github.io/posts/llvm-ve-rv/

i do hope at some point that they come out of the woodwork and participate
in LLVM SVE. and that the product continues to ship, it's pretty incredible
and i am delighted that NEC has had a strong enough customer base to
keep on selling it and maintaining SX-Aurora.

Mitch Alsup's MyISA 66000 will need gcc and llvm at some point, and it is
another ISA with a form of Scalable Vectors - one that has been specially
designed to "thunk" down to simple scalar hardware.

thank you, Renato, for responding, it's given me the opportunity to explain
a bit more in-depth.  feel free to cc libre-soc-dev in future, we don't mind
[relevant!] cross-posts.

warmest,

l.