lkcl luke.leighton at gmail.com
Tue Apr 19 12:43:23 BST 2022

```
On April 19, 2022 10:12:44 AM UTC, Jacob Lifshay <programmerjake at gmail.com> wrote:
>it occurred to me that, assuming we want less than 4-in 2-out,

to stop the OPF ISA WG freaking out, yes.

>mule's pseudocode:
>mule RT, RA, RB:
>prod = RA * RB + CARRY # 64-bit * 64-bit + 64-bit -> 128-bit
>RT = LOW_HALF(prod)
>CARRY = HIGH_HALF(prod)

modified to RT, RA, RB, RC, i love it.

[i explained already, SPRs add to contextswitch and are not acceptable, GF being an exception because of the number of operands needed otherwise becomes unworkable, plus it is the entire group of GF ops that need a modulo]

rewriting as 4-arg:

mule RT, RA, RB, RC
prod = RA*RB+RC
RT = LO(prod)
RC = HI(prod)

interestingly by allowing *RC* to be Vectorised the possibilities open up even more.  in previous discussions on these MULADDs we agreed that RC, as an accumulator, was less appropriate to prioritise as a Vector register.  however in the bigint math case it is the *multiplicand* (RB) that is the scalar.

>the div inner loop would end up as:
># vn[] is in r32, qhat is in r3, un[] is in r64
>li r0, 0
>mtspr CARRY, r0 # clear carry for multiplication
>subfc r0, r0, r0 # set CY for subtraction
>sv.mule r96.v, r32.v, r3.s  # r96... = r32... * r3
>sv.sube r64.v, r64.v, r96.v # r64... = r64... - r96...

did you mean subfe here? (subfe: RT = ~RA + RB + CA)

or did you mean the 3-in 2-out subxd?
(see https://libre-soc.org/openpower/sv/bitmanip/appendix/)

**subxd RT, RA, RB** (RS=RB+VL for SVP64, RS=RB+1 for scalar)

cat[0:127] = (RS) || (RB)
sum[0:127] = cat - EXTS(RA)
RA = ~sum[0:63] + 1
RT = sum[64:127]

if you meant subfe that's *great* because it is standard 2-in 1-out (plus carry in/out)

how would the fixup condition be detected?  can you add it to the c source, i'm not going to be able to cope with the carry-analysis over HI/LO boundaries

>the mul inner loop would be similar: a sv.mule followed by sv.adde.
>
>because of how it's defined, sv.mule can benefit from the same 256-bit
>*
>64-bit -> 320-bit multiplier optimization, also, because it only has
>the
>one output vector (unlike mulx), it can be much more easily fused with

niice.

two output vectors is expensive in terms of register allocation.  a scalar accumulator is really elegant.

ah.  i know. you can get both by having separate Scalar/Vector markers on RC, just like in sv.ldu

* 2-bit EXTRA2, means 4 operands can be marked
* EXTRA IDX0: d:RT - RT as destination
* EXTRA IDX1: d:RC - RC as a destination
* EXTRA IDX2: s:RA - RA as a source
* EXTRA IDX3: s:RC - RC as a source

RB would always be a scalar (and always r0-r31)

fascinatingly this provides the functionality of *both* mule *and* mulx.

mule:

* d:RT=v
* d:RC=s
* s:RA=v
* s:RC=s

mulx:

* d:RT=v
* d:RC=v
* s:RA=v
* s:RC=v with a different EXTRA2 bit to target
an alternative location from d:RC.

btw i am starting to wonder if this is what is in Power ISA 3.1 MMA.

l.

```