[Libre-soc-dev] dsld/dsrd

Thu Oct 27 00:33:07 BST 2022

On Tue, Oct 25, 2022, 04:43 lkcl <luke.leighton at gmail.com> wrote:

> On Tue, Oct 25, 2022 at 12:00 PM Jacob Lifshay <programmerjake at gmail.com>
> wrote:
>
> > do you mean this hmac?
>
> no, poly1305.  couldn't remember its name.
>
> > you're forgetting that dsld/dsrd limit the shift amount to xlen,
>
> 64-> 6-bit - across 2x 64-bit regs
> 32-> 5-bit - across 2x 32-bit regs
> 16-> 4-bit - across 2x 16-bit regs
> 32-> 3-bit - across 2x 8-bit regs
>
> behaviour is as expected.
>
> > also, shorter than 64-bit shifts (or adds or multiplies or ...)
> > are horribly inefficient on hardware without a wide simd backend
> > where each element usually takes its own clock cycle (think in-order
> embedded core):
>
> you're forgetting the dynamic partitioned simd ALUs

I specifically am talking about hardware without simd ALUs

will take
> care of that, and a micro-coding engine spotting aliases,

a tiny core won't have that -- too complex.

> widening
> back up at Issue-time, spotting that the last part is non-8-byte,
> and issuing the required partition mask to knock out the other
> 7 bytes.
>
> > a 256-bit shift takes 4 64-bit ops, but it takes 32(!) 8-bit ops,
> > hence why bigint ops should always be 64-bit ops.
>
> you've conflated (fallen into the trap of assuming parity) front-end
> ISA with back-end SIMD implementation, there. easy enough to
> do.
>

nope, i am specifically talking about a cpu without a simd backend, you
just missed where i said that.

>
> >> liked the principle: the example shows it can be done automatically,
> >> no need to add extra assembler flags.  the only problem being
> >> [for maddedu] it destroys (fights) with the other use for the 9th bit:
> >> selection of RS=RT+1 [RS=RT+MAXVL, scalar has MAXVL=1]
> >
> >
> > but you've effectively decided we always have RS=RC by the
> > changes you made to ls003...
>
> that's incorrect.
>
> > so the ninth bit is unused now....
>
> that's also incorrect. i had to *swap* the defaults.

well, as i stated on the bug report, it would be much easier to figure that
out had you originally discussed that on the mailing list (or even
responded to my comment there -- iirc you didn't till now).

  the initial thought
> was, "RS=RC will be useful as the default for scalar, RS=RT+MAXVL
> will be useful as the default for SVP64".
>
> turns out that's false: RS=RC is most useful as the default for
> *both*.
>
> the assembler-notation hasn't been created yet, to specify the
> swapover.
>
> > 4-operand dsld/dsrd is a 3-in 1-out instruction,
>
> which is fine except we've already been through why that
> can't work.
>
> > unless we want it to do carry chaining like maddedu or divmod2du
> > in bigint / word mode where it's 3-in 2-out
>
> ahh.. it wouldn't surprise me in the least if it was 3-in 2-out
> (what with maddedu and divmod2du turning out to be).
>
> > (imho carry chaining is a good idea for symmetry with the other bigint
> ops
> > and not needing a scalar register prefix/suffix on the vector to be
> shifted,
> > reducing register requirements and making compilers' jobs much easier,
> > though it has a down-side of creating a dependency chain).
>
> well that happens anyway, and it can be solved with micro-coding
> just like maddedu can perform redirection to a bigger (wider)
> SIMD ALU with big carry-look-aheads.
>
> > dsrd suffix requirement demo (signed right shift of 512-bit bigint):
> >
> > # input vector in r8..r15, little endian
> > li r3, 23 # shift amount
> > setvl VL=8
> > # set up suffix with copy of sign bit, must be allocated right
> > # after end of input vector, which is a pain to do in a compiler
> > # when it's deciding where to allocate the input vector.
> > sradi r16, r15, 63
>
> meh.
>
> > # 3-in 1-out version so you can see what happens
> > # reminder: RT = (((RA << 64) | RB) >> RC) & (2 ** 64 - 1)
> > sv.dsrd/mrr *r8, *r9, *r8, r3
>
> thoughts:
>
> * works if EXTRA2-3-2-2 (RT=2,RA=3,RB=2,RC=2) is allowed
>   as a (new) EXTRA-encoding. making me jittery, that one.
> * doesn't need /mr, *does* need /mrr to get reverse-gear.
>
i wrote /mrr...

>
> > the carry chaining version could be (there are a few alternative
> definitions):
> > dsld RT, RA, RB, RC
> > v = RA
> > sh = RB % 64
> > v <<= sh
> > mask = (1 << sh) - 1
> > v |= RC & mask
> > RT = v & (2 ** 64 - 1)
> > RS = v >> 64
>
> erm... ermermerm.... it's looking good, isn't it? RC can
> then be even-numbered, fitting into standard EXTRA2.
>
> > the carry chaining version of dsld as-written also works well for
> prefix-code encode.
>
> branch time. let's see how it goes.  (i don't normally recommend branches)
>

idk how you're recommending we use branches...using vectorized dsld with
or-reduction seems entirely obvious to me.

>
> > oh, also, texture ops are something like 6-in 2-out (just counting
> > 64-bit regs, not elements), so have fun with that!
>
> urrr tell me about it.  got some ideas on that (for another time -
> involving "tagging", as was done in Snitch.  it's about the only
> sane solution (the ISA WG will still freak out about the concepts
> but i'd expect them to calm down after the alternative - 6in2out -
> is presented).
>

nah, imho we *really need* that many inputs, almost all of them are
different for every texture op, so putting them in sprs or something
doesn't really work. it can be done with a 4-arg where all 4 args are
register pairs (some of those are really f32x4 or i32x4).

for comparison, AMD's texture instructions have like 8 input vector
registers.

The ISA WG can just calm down, if a gpu implements the texture
instructions, it really needs all the inputs to be compliant with the
opengl/vulkan specs, so that many inputs is appropriate. trying to reduce
input count by spreading them across multiple instructions just makes
texture ops that much slower and isn't helpful imho. if the cpu doesn't
want to implement the texture ops because it thinks they're too expensive,
you know it's not seriously trying to be a gpu and that's fine.

Jacob