[Libre-soc-dev] dsld/dsrd

Tue Oct 25 12:00:25 BST 2022

On Tue, Oct 25, 2022, 02:40 lkcl <luke.leighton at gmail.com> wrote:

> On October 25, 2022 2:39:18 AM GMT+01:00, Jacob Lifshay
> <programmerjake at gmail.com> wrote:
>
> >I would regard elwidth overrides as much less important for dsld/dsrd,
> >they
> >are mostly only useful for full 64-bit, sld/srd can be easily used when
> >smaller elwidths are needed.
>
> initially i thought so too, but then realised that sld/srd are
> also single-reg value shift, where dsrd/ew=32 is still twin-regs
> and can still be used to do bigint-shifting.
>
> which doesn't "seem" to matter until you look at chacha20 hmac
>

do you mean this hmac?
https://en.wikipedia.org/wiki/Hmac
if you don't and you're just talking about chacha20 without specifically
that cryptographic-hash-based message authentication code algorithm, please
don't refer to it as chacha20 hmac, since chacha20 is quite useful without
hmac, e.g. in chacha20-poly1305.

which uses 17 byte arithmetic.  everyone else rounds that up
> to the nearest 32, 64 (or even 128-bit) integer arithmetic.
>

you're forgetting that dsld/dsrd limit the shift amount to xlen, they
(without additional dynamic register offset) can't do a bigint shift by
more than xlen even if you tried removing that limit because they don't
recieve the input bits from more than 1 xlen-word away. so using elwid=8
VL=17 limits you to max shift amount of 7.

also, shorter than 64-bit shifts (or adds or multiplies or ...) are
horribly inefficient on hardware without a wide simd backend where each
element usually takes its own clock cycle (think in-order embedded core):
a 256-bit shift takes 4 64-bit ops, but it takes 32(!) 8-bit ops, hence why
bigint ops should always be 64-bit ops.

>
> >> or svoffset) which begs the question: is it worthwhile to have
> >> some form of special (non-orthogonal) behaviour involving
> >> RC and the 9th bit of EXTRA which is free in EXTRA2 4-operand
> >> form?
> >
> >how about just defining the 9th bit to instead make RB (or RA) be
> >EXTRA3
> >form for all 4-operand instructions? that is also useful for maddedu
> >where
> >RB (or RA) is the scalar multiplier and you need to specify successive
> >registers in the vector/bigint you're multiplying by:
>
> liked the principle: the example shows it can be done automatically,
> no need to add extra assembler flags.  the only problem being
> [for maddedu] it destroys (fights) with the other use for the 9th bit:
> selection of RS=RT+1 [RS=RT+MAXVL, scalar has MAXVL=1]
>

but you've effectively decided we always have RS=RC by the changes you made
to ls003...
so the ninth bit is unused now....
https://bugs.libre-soc.org/show_bug.cgi?id=960#c2

>
> >> or, to attempt 3-operand EXTRA3 with 4 operands, treating the
> >> shift source as mandatory scalar, for example?
> >
> >no, vector shift source is specifically needed for prefix-code
> >encoding.
>
> yeah i worked that out afterwards (doh), retrospectively.
>
> my thoughts are here for prefix-code, just use sm=2 variant and
> go into the loop with a copy of the shift-amount.  then follow up
> with a separate OR-reduction instruction.
>
> (we are 100% categorically not going to be adding 4-in 1-out 64-bit
> reg instructions, we had that conversation already.  3-in 2-out is
> pushing the limit as it is, and is only justifiable because of RTp,
> RTa, the very existence of VSX, and the LD-ST-update instructions).
>

4-operand dsld/dsrd is a 3-in 1-out instruction, unless we want it to do
carry chaining like maddedu or divmod2du in bigint / word mode where it's
3-in 2-out (imho carry chaining is a good idea for symmetry with the other
bigint ops and not needing a scalar register prefix/suffix on the vector to
be shifted, reducing register requirements and making compilers' jobs much
easier, though it has a down-side of creating a dependency chain).

dsrd suffix requirement demo (signed right shift of 512-bit bigint):

# input vector in r8..r15, little endian
li r3, 23 # shift amount
setvl VL=8
# set up suffix with copy of sign bit, must be allocated right
# after end of input vector, which is a pain to do in a compiler
# when it's deciding where to allocate the input vector.
sradi r16, r15, 63
# 3-in 1-out version so you can see what happens
# reminder: RT = (((RA << 64) | RB) >> RC) & (2 ** 64 - 1)
sv.dsrd/mrr *r8, *r9, *r8, r3
# output in r8..r15, little endian

the carry chaining version could be (there are a few alternative
definitions):
dsld RT, RA, RB, RC
v = RA
sh = RB % 64
v <<= sh
mask = (1 << sh) - 1
v |= RC & mask
RT = v & (2 ** 64 - 1)
RS = v >> 64

dsrd RT, RA, RB, RC
v = RA << 64
sh = RB % 64
v >>= sh
RS = v & (2 ** 64 - 1)
mask = ~((2 ** 64 - 1) >> sh)
v >>= 64
v |= RC & mask
RT = v

the carry chaining version of dsld as-written also works well for
prefix-code encode.

oh, also, texture ops are something like 6-in 2-out (just counting 64-bit
regs, not elements), so have fun with that!

Jacob