[Libre-soc-dev] SVP64 Vectorised add-carry => big int add

Tue Apr 19 20:12:05 BST 2022

On Tue, Apr 19, 2022, 04:43 lkcl <luke.leighton at gmail.com> wrote:

>
>
> On April 19, 2022 10:12:44 AM UTC, Jacob Lifshay <programmerjake at gmail.com>
> wrote:
> >it occurred to me that, assuming we want less than 4-in 2-out,
>
> to stop the OPF ISA WG freaking out, yes.
>
> >mule's pseudocode:
> >mule RT, RA, RB:
> >prod = RA * RB + CARRY # 64-bit * 64-bit + 64-bit -> 128-bit
> >RT = LOW_HALF(prod)
> >CARRY = HIGH_HALF(prod)
>
> modified to RT, RA, RB, RC, i love it.
>
> [i explained already, SPRs add to contextswitch and are not acceptable, GF
> being an exception because of the number of operands needed otherwise
> becomes unworkable, plus it is the entire group of GF ops that need a
> modulo]
>
> rewriting as 4-arg:
>
> mule RT, RA, RB, RC
>      prod = RA*RB+RC
>      RT = LO(prod)
>      RC = HI(prod)
>
> interestingly by allowing *RC* to be Vectorised the possibilities open up
> even more.  in previous discussions on these MULADDs we agreed that RC, as
> an accumulator, was less appropriate to prioritise as a Vector register.
> however in the bigint math case it is the *multiplicand* (RB) that is the
> scalar.
>
>
> >the div inner loop would end up as:
> ># vn[] is in r32, qhat is in r3, un[] is in r64
> >li r0, 0
> >mtspr CARRY, r0 # clear carry for multiplication
> >subfc r0, r0, r0 # set CY for subtraction
> >sv.mule r96.v, r32.v, r3.s  # r96... = r32... * r3
> >sv.sube r64.v, r64.v, r96.v # r64... = r64... - r96...
>
> did you mean subfe here? (subfe: RT = ~RA + RB + CA)
>

yes, subfe a, b, c is sube a, c, b ... just like subf a, b, c is sub a, c, b

how would the fixup condition be detected?

assuming the loop is for(i = 0; i <= n; i++), then fixup is simply !CY.

  can you add it to the c source, i'm not going to be able to cope with the
> carry-analysis over HI/LO boundaries
>

sure.

>
> >the mul inner loop would be similar: a sv.mule followed by sv.adde.
> >
> >because of how it's defined, sv.mule can benefit from the same 256-bit
> >*
> >64-bit -> 320-bit multiplier optimization, also, because it only has
> >the
> >one output vector (unlike mulx), it can be much more easily fused with
> >sv.adde/sv.subfe if desired.
>
> niice.
>
> two output vectors is expensive in terms of register allocation.  a scalar
> accumulator is really elegant.
>
> ah.  i know. you can get both by having separate Scalar/Vector markers on
> RC, just like in sv.ldu
>
> * 2-bit EXTRA2, means 4 operands can be marked
> * EXTRA IDX0: d:RT - RT as destination
> * EXTRA IDX1: d:RC - RC as a destination
> * EXTRA IDX2: s:RA - RA as a source
> * EXTRA IDX3: s:RC - RC as a source
>

I never liked having separate EXTRA2/3 for both source and dest on the same
register field...it makes register allocation a pain -- i'm very inclined
to just have llvm always generate the exact same register for both.

imho having RB be able to be a vector is more important, since it allows
using mule for vectorized 128-bit arithmetic as a pair of 64-bit vectors,
which Rust project-portable-simd wants to support but has disabled for now
due to bugs in llvm's ppc backend.

example of sv.mule with vector RB:
/// calculates `(a * b + c) / d` using exact arithmetic
///
/// Safety: every element of the result must fit in a `u64` and
/// `d` must not have any zero elements
unsafe fn madd128div(a: u64x16, b: u64x16, c: u64x16, d: u64x16) -> u64x16 {
    // doesn't work due to 128-bit elements not being currently supported
by std::simd,
    // we hope to re-enable it at some point
    if (d == Simd::splat(0)).any() {
        unreachable_unchecked() // tell llvm it doesn't have to care about
div-by-zero
    }
    let mut v: u128x16 = a.cast();
    v *= b.cast();
    v += c.cast();
    v /= d.cast();
    if (v > Simd::splat(u64::MAX as u128)).any() {
        unreachable_unchecked() // tell llvm it doesn't have to care about
overflow
    }
    v.cast::<u64>()
}

assembly:
madd128div:
# assume pointer to return value is in r3
# assume pointer to a, b, c, and d are in r4, r5, r6, and r7 respectively
# assume none of r32..r95 or VL/MVL are callee-saved
setvli 16
sv.ld r32.v, (r4.s) # load a
sv.ld r48.v, (r5.s) # load b
sv.ld r64.v, (r6.s) # load c
sv.ld r80.v, (r7.s) # load d
sv.mule r32.v, r32.v, r48.v, r64.v # r32 is low half, r64 is high half
# use proposed 128-bit / 64-bit -> 64-bit div
sv.divu128_to_64 r32.v, r64.v, r32.v, r80.v # r32 = ((r64 << 64) + r32) /
r80
sv.std r32.v, (r3.s) # store result
ret

One other thing we'd want is a unsigned * signed version of mule (muluse?),
for bigint multiplication by a signed 64-bit number, RB would be signed and
RA and RT would be unsigned, I'd have to think about what RC would be to
make carrying correctly work.

Jacob