[Libre-soc-dev] 3D MESA Driver

Sat Aug 8 16:04:27 BST 2020

hi vivek, welcome to libresoc, apologies for the (new) list issues.  please
when replying (this applies to everyone) always cc vivek on this thread, he
is set up "digest" mode.

vivek i thought it might be helpful to give a rundown of the libresoc
vector system, SimpleV, which is to be "applied" to the POWER9 ISA.

the reason is because of your experience and desire to help, your input is
valuable as to what instructions actually go into the hardware.

normally, as a compiler writer, you would be told by the hardware engineers
what was needed: we *do not* want to repeat that.  we therefore would like
your active input in the actual hardware, to make ot efficient and
effective.

so this is a really nice opportunity.

SV is effectively a hardware sub-loop (subcontext) on the standard Program
Counter.  it really is no more complex to describe than that.  details
however take 7 hours to describe in full (which I did with Alain back in
february).

the sub-loop which runs from 0 to VL-1 (VL is Vector Length) effectively
pauses the PC and issues *multiple* scalar instructions.

example:

* addi r5, r5, 2 (VL=4)

which in POWER9 is "add the immediate 2 to r5" will actually issue:

* addi r5, r5, 2
* addi r6, r6, 2
* addi r7, r7, 2
* addi r8, r8, 2

and that basically really is all there is to it.  at no time do we have
"vector opcodes".  with few exceptions (branch, trap) all *scalar* opcodes
*become* vectorised inherently.

in hardware terms, at the instruction issue phase, what will actually
happen is that the issue engine will notice the 4 (or 8 etc) VL loop, and
will "batch" these scalar instructions into SIMD groups.

however this will be done *automatically* by the hardware, precisely so
that you, as a compiler writer, do not have to massively complicate the
compiler (see "SIMD considered harmful" article).

additional details:

* SUBVL
* Predication (including a novel concept "twin predication")
* element width overrides.

# elwidth

element width overrides are equivalent, if we think of a register file as a
byte level SRAM, as being this:

typedef regentry union {
    uint64 actual_reg; // scalar
    uint8_t b[];
    uint16_t h[];
    uint32_t w[];
    uint64_t d[];
} ;

regentry int_regfile[128];

conceptually here we are relying on the fact that each "actual_reg" is
packed and contiguous: the arrays "overrun" and this helps us to understand
conceptually what is really actually going on in the hardware.

so for scalar operations (when VL is not used) the hardware will read/write
to

    int_regfile[RA].actual_reg

and when VL is active, the b, h, w or d array is accessed instead
(deliberately overrunning to other parts of the regile), on that for-loop
from 0 to VL-1, depending on whether the 64 (or 32 bit) instruction has had
the "elwidth" override set to 8, 16, 32 or "default".

(default will use the behaviour of the *instruction*.  some instructions in
the v3.0B PowerISA manual actually say they are 32 bit rather than 64.  or,
they take only 32 bits from the operands, more like).

so in this way, we do not have to invent vector instructions add8i, add16i,
add32i etc. we can simply use "addi" for all of them by setting elwidth.

# predication

predication is done through tagging.  one scalar register is "tagged" as
being the predicate.  each bit is then fed to the issue engine.  if the
predicate was 0b0110 for the above example using addi, then only the r6 and
r7 addi instructions would be done.

in reality, in the hardware, the 4-wide SIMD backend will receive 4 bit
"chunks" of the predicate, and this will enable/disable parts of that SIMD
operation.

again, there is no need for you as a compiler writer to do that: you set up
the predicate (up to 64 bits at a time) and issue a single operation.
hardware takes care of the details.

# SUBVL

sometimes, especially for vec2/3/4, you want to do loops on vectors of
vec2/3/4.  this is what SUBVL is for, and effectively it is a sub-sub-loop
on PC, intuitively as might be expected.

however one key thing: predicate bits do *not* extend down individually to
SUBVL.  they apply to the *whole* vec2/3/4.

this saves a lot of bits when setting up predicates.  it would be necessary
to do bit level mask manipulation in order to expand 0b0110 into 0000 1111
1111 0000 for an array of vec4 for example and that is costly.

so that is the basics.  it is sufficient to turn any standard scalar ISA
into a vectorised one without actually ever having had to design a boatload
of vector instructions.

more involved however is things like NORMALISE opcodes, CORDIC, and
CROSSPRODUCT.  these very definitely are actual vector instructions and
consequently are defined in terms of vec2/3/4 as appropriate.

whilst the preliminary work has started here on these vector ISA
operations, your input would be particularly welcome (when we are not under
timepressure for the Oct 2020 deadline).

adding scalar IEEE754 opcodes such as SIN, COS, LOG1P, it should be
naturally obvious that adding these to the POWER9 ISA means that
automatically they will also become vectorised through SV.

in essence SV is about massively reducing the complexity of the work needed
by everyone: binutils, compiler writers, simulators and hardware.

l.

-- 
---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68