lkcl luke.leighton at gmail.com
Sun Apr 17 18:26:30 BST 2022

```On Sun, Apr 17, 2022 at 2:08 PM lkcl <luke.leighton at gmail.com> wrote:

> see end of:
>
> https://libre-soc.org/openpower/sv/bitmanip/appendix/
>
> how about storing the 128-bit mul-add in a *pair* of vectors, 3-in 2-out just like DCT/FFT: RT and RT+VL
>
> a second followup instruction can perform the carry-adds with corrections.

could i ask people to check the math, here, i wrote it out:
https://libre-soc.org/openpower/sv/bitmanip/appendix/

i started from this:

# for big_c - big_a * word_b
result <- RC + ~(RA * RB) + CARRY
result_high <- HIGH_HALF(result)
if CARRY <= 1 then # unsigned comparison
result_high <- result_high + 1
end
CARRY <- result_high
RT <- LOW_HALF(result)

and, assuming the above is inserted into a SVP64 Vector for-loop,
performed a code-morph where {result} is separated out
into its own SVP64 Vector for-loop, storing a *pair* of 64-bit
result vectors into {RT} and {RS=RT+VL}

i then noted that

result <- RC + ~(RA * RB) + CARRY
=>  result <- RC + ~(RA * RB) + 1 - 1 + CARRY
=> result <- RC - (RA * RB) + CARRY - 1
=> product <- RC - (RA * RB) and
result <- result + CARRY - 1

thus, all {products} can be separated into a standard mul-subtract
where top and bottom half of {products} are split into vectors
starting at {RT} and {RT+VL} - aka {RT} and {RS}

prod[0:127] = (RA) * (RB)
sub[0:127] = EXTZ(RC) - prod
RT <- sub[64:127]
RS <- sub[0:63]

a *second* instruction, slightly modified from jacob's original
to now include the "+1", performs the very-weird adds

cat[0:127] = (RB) || (RS)
sum[0:127] = cat + EXTZ(RA) + *128
rhi[0:63] = sum[0:63]
if (RA) <= 1 then rhi = rhi + (*63 || 1)
RA = rhi
RT = sum[64:127]

where this one uses (RA) as the CARRY from jacob's original, where
RA is an input *and* implicit output (like LD-ST-with-update), and
some minor weirdness has to be done on the register numbering
to use the intermediate results correctly

# RS=RT+VL, assume VL=8, therefore RS starts at r8.v
# q       : r16
# dividend: r24.v
# divisor : r32.v
# carry   : r40
li r40, 0
sv.msubx r0.v, r16, r24.v, r32.v