[Libre-soc-dev] SVP64 Vectorised add-carry => big int add

Tue Apr 19 11:12:44 BST 2022

it occurred to me that, assuming we want less than 4-in 2-out, we could
have 3-in 2-out where the instruction is just a bigint times word instead
of a bigint times word plus bigint.

so, what we'd end up with:
bigint + bigint -> bigint:
sv.adde

bigint - bigint -> bigint:
sv.subfe

bigint * word -> bigint:
sv.mule

mule's pseudocode:
mule RT, RA, RB:
prod = RA * RB + CARRY # 64-bit * 64-bit + 64-bit -> 128-bit
RT = LOW_HALF(prod)
CARRY = HIGH_HALF(prod)

the div inner loop would end up as:
# vn[] is in r32, qhat is in r3, un[] is in r64
li r0, 0
mtspr CARRY, r0 # clear carry for multiplication
subfc r0, r0, r0 # set CY for subtraction
sv.mule r96.v, r32.v, r3.s  # r96... = r32... * r3
sv.sube r64.v, r64.v, r96.v # r64... = r64... - r96...

the mul inner loop would be similar: a sv.mule followed by sv.adde.

because of how it's defined, sv.mule can benefit from the same 256-bit *
64-bit -> 320-bit multiplier optimization, also, because it only has the
one output vector (unlike mulx), it can be much more easily fused with
sv.adde/sv.subfe if desired.

Jacob