lkcl luke.leighton at gmail.com
Mon Apr 18 23:30:20 BST 2022

```
On April 18, 2022 6:27:22 PM UTC, Jacob Lifshay <programmerjake at gmail.com> wrote:
>On Mon, Apr 18, 2022, 09:53 lkcl <luke.leighton at gmail.com> wrote:
>
>> but then how to calculate t?  i have no idea, i need help here.
>
>
>I changed the code to have the original loop, a sub-mul-borrow loop
>(basically what the original loop does, but with different var names
>and
>the vn[j + n] stuff converted to a final iteration), and the mrsubcarry
>algorithm I gave.

brilliant, thank you.  the "fixup" variable also helped.

you saw i added variants which split into two 3-in 2-out instructions, they are quite fascinating esp. when compared to their mul-add counterparts (below)

>afaict mrsubcarry is likely the most efficient in hardware, as
>explained
>earlier.

well, that can go in an implementation note. an ISA needs to be implementable by a variety of designers. 6 operands is just too much, for multiple reasons.

i added a big-mul an hour ago, again, Knuth Algorithm M, and using a mul-add-twin (mulx) it came out as:

for (i = 0; i < m; i++) {
unsigned product = u[i]*v[j] + w[i + j];
phi[i] = product>>16;
plo[i] = product;
}
for (i = 0; i < m; i++) {
t = (phi[i]<<16) | plo[i] + k;
w[i + j] = t;          // (I.e., t & 0xFFFF).
k = t >> 16;
}

(the original was 16bit, it is obvious how to expand to 32 or 64)

what is fascinating is, the borrow version you negate all additions, including k. (k=-(t>>16)) and it is exactly the same

that makes for a symmetrical pattern which makes the case for 3in/2out stronger when presenting to the ISA WG: "these near-identical instructions come as a set".

i can also see this being used with RA as a Vector:

addxd RT, RA, RB (RS=RB+VL for SVP64, RS=RB+1 for scalar)

cat[0:127] = (RS) || (RB)
sum[0:127] = cat + EXTZ(RA)
RA = sum[0:63]
RT = sum[64:127]

also, the subxd if Rc=1 tests RA:

subxd RT, RA, RB (RS=RB+VL for SVP64, RS=RB+1 for scalar)

cat[0:127] = (RS) || (RB)
sum[0:127] = cat - EXTS(RA)
RA = ~sum[0:63] + 1
RT = sum[64:127]

Rc=1 on RA (not RT), that is basically the fixup test (if RA != 0)

when expressed in this way it is intuitively obvious: "oh, the carry is nonzero on the last element, no surprise something has to be done"

whereas with the RSUB version "fixup = carry !=1" i have absolutely no clue why that is not a test against zero.

the other thing, sigh, that has to be taken into consideration: how to do MASSIVE numbers (524288 digits), which will involve VL loops.  i *think* it's not a problem, carry simply, well, carries over each time round len/VL iterations.

l.

```