[Libre-soc-dev] SVP64 Vectorised add-carry => big int add

Sun Apr 17 06:06:54 BST 2022

On Fri, Apr 15, 2022, 22:54 lkcl <luke.leighton at gmail.com> wrote:

>
>
> On April 16, 2022 12:26:38 AM UTC, lkcl <luke.leighton at gmail.com> wrote:
>
> >            uint64_t v = (uint64_t)q[i] * d[j] + carry;
> >            carry = v >> 32;
> >            v = (uint32_t)v;
>
> rright. ok.  i have a bit more of a handle on this.
>
> both halves are needed, but normally in scalar mul you can do macro op
> fusion:
>
> * mullo r3, r10, r11
> * mulhi r4, r10, r11
>
> ==>
>
> * OP_MULLOHI r3&4, r10, r11
>
> when SVP64 Vectorised the element ops are split up unless actually doing
> the same fusion trick on the vector ops *before* putting into element
> execution.
>
> question is, is it worth adding a mulx?

maybe? a small vertical sv loop should work, but it might be better to have
a dedicated instruction which, *like carry look ahead* for sv.adde, would
allow some microarchitectures to have a 64x256->320-bit multiplier that
handles 4 mul-add-carry instructions at once (would normally would be a
4-wide simd 64x64->128 multiplier but they can be pretty easily merged at
nearly no additional gate cost and be normally dynamically split into
smaller multipliers like the simd multiplier I wrote (according to quick
testing with yosys, 1 64x64->128 mul is 25494 cells and a 64x256->320 mul
is 99197 cells, so it is actually slightly smaller than 4x 64x64->128-bit
multipliers!)

and if so, is it worth trying to overload OE=1 on say "sv.madd" rather than
> add a new opcode?
>

imho we probably want the new instruction, OE=1 is needed for overflow
detection, not to be confused with carry-out. overflow detection is used
quite a lot in Rust and also JavaScript -- JS engines probably use it to
detect when numbers don't fit in i32 and it needs f64 instead.

>
> (madd is RT=RA*RB+RC, maddo would be {RT,RT+1}=RA*RB+RC

definitely not...afaict that would break existing scalar code using maddo
if maddo was changed to write 2 regs instead of 1. maddo already exists
iirc (but with a slightly different mnemonic...maddldo? i'd have to check).

and sv.maddo would be {RT,RT+VL}=RA*RB+RC)

if we were to add carrying functionality to madd, we'd want a 64-bit CARRY
spr so the instructions can have semantics like:
maddcarry RT, RA, RB, RC
# for big_a * word_b + big_c
# unsigned 64x64->128 mul-add-add
result <- RA * RB + RC + CARRY
CARRY <- HIGH_HALF(result)
RT <- LOW_HALF(result)

we'd also want a msubcarry:
# for big_a * word_b - big_c
# + ~RC rather than - RC so carry is correctly handled
result <- RA * RB + ~RC + CARRY
CARRY <- HIGH_HALF(result)
RT <- LOW_HALF(result)

and a mrsubcarry (the one actually needed by bigint division):
# for big_c - big_a * word_b
result <- RC + ~(RA * RB) + CARRY # this expression is wrong, needs further
thought
CARRY <- HIGH_HALF(result)
RT <- LOW_HALF(result)

so the inner loop in the bigint division algorithm would end up being
(assuming n, d, and q all fit in registers):
li r3, 1 # carry in for subtraction
mtspr CARRY, r3 # init carry spr
setvl loop_count
sv.mrsubcarry rn.v, rd.v, rq.s, rn.v

Jacob