[Libre-soc-dev] SVP64 Vectorised add-carry => big int add

Jacob Lifshay programmerjake at gmail.com
Tue Apr 19 20:12:05 BST 2022


On Tue, Apr 19, 2022, 04:43 lkcl <luke.leighton at gmail.com> wrote:

>
>
> On April 19, 2022 10:12:44 AM UTC, Jacob Lifshay <programmerjake at gmail.com>
> wrote:
> >it occurred to me that, assuming we want less than 4-in 2-out,
>
> to stop the OPF ISA WG freaking out, yes.
>
> >mule's pseudocode:
> >mule RT, RA, RB:
> >prod = RA * RB + CARRY # 64-bit * 64-bit + 64-bit -> 128-bit
> >RT = LOW_HALF(prod)
> >CARRY = HIGH_HALF(prod)
>
> modified to RT, RA, RB, RC, i love it.
>
> [i explained already, SPRs add to contextswitch and are not acceptable, GF
> being an exception because of the number of operands needed otherwise
> becomes unworkable, plus it is the entire group of GF ops that need a
> modulo]
>
> rewriting as 4-arg:
>
> mule RT, RA, RB, RC
>      prod = RA*RB+RC
>      RT = LO(prod)
>      RC = HI(prod)
>
> interestingly by allowing *RC* to be Vectorised the possibilities open up
> even more.  in previous discussions on these MULADDs we agreed that RC, as
> an accumulator, was less appropriate to prioritise as a Vector register.
> however in the bigint math case it is the *multiplicand* (RB) that is the
> scalar.
>
>
> >the div inner loop would end up as:
> ># vn[] is in r32, qhat is in r3, un[] is in r64
> >li r0, 0
> >mtspr CARRY, r0 # clear carry for multiplication
> >subfc r0, r0, r0 # set CY for subtraction
> >sv.mule r96.v, r32.v, r3.s  # r96... = r32... * r3
> >sv.sube r64.v, r64.v, r96.v # r64... = r64... - r96...
>
> did you mean subfe here? (subfe: RT = ~RA + RB + CA)
>

yes, subfe a, b, c is sube a, c, b ... just like subf a, b, c is sub a, c, b

how would the fixup condition be detected?


assuming the loop is for(i = 0; i <= n; i++), then fixup is simply !CY.

  can you add it to the c source, i'm not going to be able to cope with the
> carry-analysis over HI/LO boundaries
>

sure.

>
> >the mul inner loop would be similar: a sv.mule followed by sv.adde.
> >
> >because of how it's defined, sv.mule can benefit from the same 256-bit
> >*
> >64-bit -> 320-bit multiplier optimization, also, because it only has
> >the
> >one output vector (unlike mulx), it can be much more easily fused with
> >sv.adde/sv.subfe if desired.
>
> niice.
>
> two output vectors is expensive in terms of register allocation.  a scalar
> accumulator is really elegant.
>
> ah.  i know. you can get both by having separate Scalar/Vector markers on
> RC, just like in sv.ldu
>
> * 2-bit EXTRA2, means 4 operands can be marked
> * EXTRA IDX0: d:RT - RT as destination
> * EXTRA IDX1: d:RC - RC as a destination
> * EXTRA IDX2: s:RA - RA as a source
> * EXTRA IDX3: s:RC - RC as a source
>

I never liked having separate EXTRA2/3 for both source and dest on the same
register field...it makes register allocation a pain -- i'm very inclined
to just have llvm always generate the exact same register for both.

imho having RB be able to be a vector is more important, since it allows
using mule for vectorized 128-bit arithmetic as a pair of 64-bit vectors,
which Rust project-portable-simd wants to support but has disabled for now
due to bugs in llvm's ppc backend.

example of sv.mule with vector RB:
/// calculates `(a * b + c) / d` using exact arithmetic
///
/// Safety: every element of the result must fit in a `u64` and
/// `d` must not have any zero elements
unsafe fn madd128div(a: u64x16, b: u64x16, c: u64x16, d: u64x16) -> u64x16 {
    // doesn't work due to 128-bit elements not being currently supported
by std::simd,
    // we hope to re-enable it at some point
    if (d == Simd::splat(0)).any() {
        unreachable_unchecked() // tell llvm it doesn't have to care about
div-by-zero
    }
    let mut v: u128x16 = a.cast();
    v *= b.cast();
    v += c.cast();
    v /= d.cast();
    if (v > Simd::splat(u64::MAX as u128)).any() {
        unreachable_unchecked() // tell llvm it doesn't have to care about
overflow
    }
    v.cast::<u64>()
}

assembly:
madd128div:
# assume pointer to return value is in r3
# assume pointer to a, b, c, and d are in r4, r5, r6, and r7 respectively
# assume none of r32..r95 or VL/MVL are callee-saved
setvli 16
sv.ld r32.v, (r4.s) # load a
sv.ld r48.v, (r5.s) # load b
sv.ld r64.v, (r6.s) # load c
sv.ld r80.v, (r7.s) # load d
sv.mule r32.v, r32.v, r48.v, r64.v # r32 is low half, r64 is high half
# use proposed 128-bit / 64-bit -> 64-bit div
sv.divu128_to_64 r32.v, r64.v, r32.v, r80.v # r32 = ((r64 << 64) + r32) /
r80
sv.std r32.v, (r3.s) # store result
ret


One other thing we'd want is a unsigned * signed version of mule (muluse?),
for bigint multiplication by a signed 64-bit number, RB would be signed and
RA and RT would be unsigned, I'd have to think about what RC would be to
make carrying correctly work.

Jacob


More information about the Libre-soc-dev mailing list