[Libre-soc-dev] dsld/dsrd

Tue Oct 25 12:43:37 BST 2022

On Tue, Oct 25, 2022 at 12:00 PM Jacob Lifshay <programmerjake at gmail.com> wrote:

> do you mean this hmac?

no, poly1305.  couldn't remember its name.

> you're forgetting that dsld/dsrd limit the shift amount to xlen,

64-> 6-bit - across 2x 64-bit regs
32-> 5-bit - across 2x 32-bit regs
16-> 4-bit - across 2x 16-bit regs
32-> 3-bit - across 2x 8-bit regs

behaviour is as expected.

> also, shorter than 64-bit shifts (or adds or multiplies or ...)
> are horribly inefficient on hardware without a wide simd backend
> where each element usually takes its own clock cycle (think in-order embedded core):

you're forgetting the dynamic partitioned simd ALUs will take
care of that, and a micro-coding engine spotting aliases, widening
back up at Issue-time, spotting that the last part is non-8-byte,
and issuing the required partition mask to knock out the other
7 bytes.

> a 256-bit shift takes 4 64-bit ops, but it takes 32(!) 8-bit ops,
> hence why bigint ops should always be 64-bit ops.

you've conflated (fallen into the trap of assuming parity) front-end
ISA with back-end SIMD implementation, there. easy enough to
do.

>> liked the principle: the example shows it can be done automatically,
>> no need to add extra assembler flags.  the only problem being
>> [for maddedu] it destroys (fights) with the other use for the 9th bit:
>> selection of RS=RT+1 [RS=RT+MAXVL, scalar has MAXVL=1]
>
>
> but you've effectively decided we always have RS=RC by the
> changes you made to ls003...

that's incorrect.

> so the ninth bit is unused now....

that's also incorrect. i had to *swap* the defaults.  the initial thought
was, "RS=RC will be useful as the default for scalar, RS=RT+MAXVL
will be useful as the default for SVP64".

turns out that's false: RS=RC is most useful as the default for
*both*.

the assembler-notation hasn't been created yet, to specify the
swapover.

> 4-operand dsld/dsrd is a 3-in 1-out instruction,

which is fine except we've already been through why that
can't work.

> unless we want it to do carry chaining like maddedu or divmod2du
> in bigint / word mode where it's 3-in 2-out

ahh.. it wouldn't surprise me in the least if it was 3-in 2-out
(what with maddedu and divmod2du turning out to be).

> (imho carry chaining is a good idea for symmetry with the other bigint ops
> and not needing a scalar register prefix/suffix on the vector to be shifted,
> reducing register requirements and making compilers' jobs much easier,
> though it has a down-side of creating a dependency chain).

well that happens anyway, and it can be solved with micro-coding
just like maddedu can perform redirection to a bigger (wider)
SIMD ALU with big carry-look-aheads.

> dsrd suffix requirement demo (signed right shift of 512-bit bigint):
>
> # input vector in r8..r15, little endian
> li r3, 23 # shift amount
> setvl VL=8
> # set up suffix with copy of sign bit, must be allocated right
> # after end of input vector, which is a pain to do in a compiler
> # when it's deciding where to allocate the input vector.
> sradi r16, r15, 63

meh.

> # 3-in 1-out version so you can see what happens
> # reminder: RT = (((RA << 64) | RB) >> RC) & (2 ** 64 - 1)
> sv.dsrd/mrr *r8, *r9, *r8, r3

thoughts:

* works if EXTRA2-3-2-2 (RT=2,RA=3,RB=2,RC=2) is allowed
  as a (new) EXTRA-encoding. making me jittery, that one.
* doesn't need /mr, *does* need /mrr to get reverse-gear.

> the carry chaining version could be (there are a few alternative definitions):
> dsld RT, RA, RB, RC
> v = RA
> sh = RB % 64
> v <<= sh
> mask = (1 << sh) - 1
> v |= RC & mask
> RT = v & (2 ** 64 - 1)
> RS = v >> 64

erm... ermermerm.... it's looking good, isn't it? RC can
then be even-numbered, fitting into standard EXTRA2.

> the carry chaining version of dsld as-written also works well for prefix-code encode.

branch time. let's see how it goes.  (i don't normally recommend branches)

> oh, also, texture ops are something like 6-in 2-out (just counting
> 64-bit regs, not elements), so have fun with that!

urrr tell me about it.  got some ideas on that (for another time -
involving "tagging", as was done in Snitch.  it's about the only
sane solution (the ISA WG will still freak out about the concepts
but i'd expect them to calm down after the alternative - 6in2out -
is presented).

l.