[Libre-soc-dev] new svp64 page

Thu Dec 10 22:00:11 GMT 2020

On 12/10/20, Jacob Lifshay <programmerjake at gmail.com> wrote:
> On Thu, Dec 10, 2020, 12:54 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
> wrote:
>
>> dest elwidth overrides has secondary purposes such as providing
>> multiply widening.
>>
>
> Wouldn't mul-high provide nearly all of what we need?

no, interestingly.

* mul takes N input bits and produces (internally) N*2 output
* mulhi takes the *top* half of that output (throwing away the lower)
* mul-widen *specifically* allows the dest to be twice the size of the
src and consequently receives the *full* N*2 result.

the only way to get a mul-widen without dest elwidth override is to
perform macro-op fusion of:

* a vec2.X mul-low
* a vec2.Y mul-hi

doable but messy, and i would suggest that we reserve this only for
doing 64x64-to-128 multiply.

>>
>> one of the "disadvantages" of the RISC approach where pre/post
>> processing is controlled uniformly by N bits is: some combinations
>> just do not make sense.
>>
>> for example: saturation bits on logical ops are totally meaningless.
>>
> used to express average, as mentioned below

... hang on, you missed the logical reasoning giving a case as to why
that specific chosen example (avg) should be added as a v3.N Scalar
op.

the argument *per se* is good (save opcodes).

> likewise, elwidth overrides except to truncate or zero-extend.

and sign-extend (in some cases)

> you still need elwidth overrides to get the right element size otherwise
> instructions might process too-few/too-many registers causing issues.

you've lost me here.  it is our role and responsibility to make it
absolutely clear so that no losses occur.

> There might also be microarchitectural reasons to express the elwidth and
> subvector length, even though they appear redundant for logical ops, since
> some future microarchitecture could have a separate 16/8 bit datapath from
> 32/64 or something like that, and that would allow the compiler to specify
> which datapath to use.

i am slightly confused, and also wary of anything that would require
the *compiler* to control datapaths.

i went through a microarchitecture scoreboard and associated FU design
in early 2019 that did exactly that.

it did *not* involve or need a compiler datapath.  the OoO engine was
perfectly capable of detecting that the operation was VL=3 and 8-bit
elwidth and passing the operation through to the 32-bit SIMD backend
ALU...

... completely transparently.  there's a diagram somewhere of how it works.

if you were thinking in terms of "64 bit logical operations exist
therefore why on earth would we have elwidth overrides surely that's
daft just use 64 bit operations forget VL forget SV" this is a very
hazardous approach.

VL is set at a global level and if some 8bit vector arithmetic needs
to be done followed by some 8 bit logical in the same loop then not
only would an extra sv.setvl instruction be needed, not only would
that extra instruction need to compute VL//8 in order to get 64 bit
logical operations fitting 8 at a time, but... yeah, you see where
that's going?

also you have to bear in mind that the last nonaligned VL (VL=5, VL=7)
issues an auto-predication mask that blocks out the last elements in
the result, so that the remaining parts of a 64 bit register do not
get destroyed.

remember: we are NOT destroying the upper parts of the regfile.  they
are left UNALTERED by providing byte-level WEN lines on the regfile,
and mapping these directly to predicate mask bits.

lastly, predicates also apply at the *byte* level on 8bit elwidth
overridden operations... *including logical operations*.

the predicates apply to *elements not 64 bit register file numbers*.

the fact that we have to shove 8 bits of predicate at a time into a
8x8 SIMD ALU is *our* microarchitectural implementation choice.

other implementors may choose something different although i cannot think what.

> Except scalar avg wouldn't be that common and SV prefix can be totally
> scalar.

indeed.... except SVP is now a 64 bit ISA which i am pissed about.
that's a huge step backwards. sigh.

anyway: i would greatly prefer that the SV prefix be a "qualifier of
behaviour" i.e. that the decoder *not* need to analyse the SV prefix
bits to determine the operation *type*.

using the sat bits to choose the *operation* (turning logical into
add) sets a precedent that will have detrimental effects on the
decoder complexity.

it's an option, yes, however i feel that it is an option that we
should reserve for "last resort drastic pressure on opcode space".

>>
>> it puzzles me that there's all this wonderful powerful SIMD ops yet
>> the scalar ops, absolutely crucial to do cleanup of non-aligned
>> multiples of the SIMD size, are left without corresponding ops!
>>
>
> VSX has scalar avg -- after all, that's what the S in VSX stands for:
> Scalar.

ah interesting.  so they did think about it.  i'll see if i can find it.

l.