[Libre-soc-dev] SVP64 Vectorised add-carry => big int add

Sun Apr 17 11:15:05 BST 2022

On April 17, 2022 5:06:54 AM UTC, Jacob Lifshay <programmerjake at gmail.com> wrote:
>On Fri, Apr 15, 2022, 22:54 lkcl <luke.leighton at gmail.com> wrote:
>
>>
>>
>> On April 16, 2022 12:26:38 AM UTC, lkcl <luke.leighton at gmail.com>
>wrote:
>>
>> >            uint64_t v = (uint64_t)q[i] * d[j] + carry;
>> >            carry = v >> 32;
>> >            v = (uint32_t)v;
>>
>> rright. ok.  i have a bit more of a handle on this.
>>
>> both halves are needed, but normally in scalar mul you can do macro
>op
>> fusion:
>>
>> * mullo r3, r10, r11
>> * mulhi r4, r10, r11
>>
>> ==>
>>
>> * OP_MULLOHI r3&4, r10, r11
>>
>> when SVP64 Vectorised the element ops are split up unless actually
>doing
>> the same fusion trick on the vector ops *before* putting into element
>> execution.
>>
>> question is, is it worth adding a mulx?
>
>
>maybe? a small vertical sv loop should work,

yes, vertical would be great, macroop fusion would kick in fine there.

> but it might be better to
>have
>a dedicated instruction which, *like carry look ahead* for sv.adde,
>would
>allow some microarchitectures to have a 64x256->320-bit multiplier that
>handles 4 mul-add-carry instructions at once (would normally would be a
>4-wide simd 64x64->128 multiplier but they can be pretty easily merged
>at
>nearly no additional gate cost and be normally dynamically split into
>smaller multipliers like the simd multiplier I wrote (according to
>quick
>testing with yosys, 1 64x64->128 mul is 25494 cells and a 64x256->320
>mul
>is 99197 cells, so it is actually slightly smaller than 4x
>64x64->128-bit
>multipliers!)

ooOo :)

>and if so, is it worth trying to overload OE=1 on say "sv.madd" rather
>than
>> add a new opcode?
>>
>
>imho we probably want the new instruction, OE=1 is needed for overflow
>detection, not to be confused with carry-out.

ok.  this is all starting to make sense

> overflow detection is used
>quite a lot in Rust and also JavaScript -- JS engines probably use it
>to
>detect when numbers don't fit in i32 and it needs f64 instead.

i took a look again at https://libre-soc.org/openpower/isa/fixedarith/

that would explain why mulld has OE=1 and Rc=1 where mullhd just has Rc=1.

>>
>> (madd is RT=RA*RB+RC, maddo would be {RT,RT+1}=RA*RB+RC
>
>
>definitely not...afaict that would break existing scalar code using
>maddo
>if maddo was changed to write 2 regs instead of 1. maddo already exists
>iirc (but with a slightly different mnemonic...maddldo? i'd have to
>check).

it doesn't

    Multiply-Add Low Doubleword
    VA-Form

    maddld RT,RA,RB,RC

    Pseudo-code:

    prod[0:(XLEN*2)-1] <- MULS((RA), (RB))
    sum[0:(XLEN*2)-1] <- prod + EXTS(RC)
    RT <- sum[(XLEN*2):(XLEN*2)-1]

    Special Registers Altered: None

as a result there's no Rc=1 or OE=1 variant. the bits are available iirc, let's check...

https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=openpower/isatables/fields.text;h=d4b5075f2b3c16252c6686163c0147d2546e1971;hb=a2a3dfad9e681563d5f44116ed2bfd6e7fc1f9fe#l199

 197 # 1.6.21 VA-FORM
 198    |0      |6     |11     |16     |21|22 |26   |31|
 199    | PO    |  RT  |   RA  |   RB  |   RC |   XO   |

hmmm... nope, checked the PDF, PO=EXT04, bits 26-31 are used up with XO.  they're expensive operations.

basically, those operations are so expensive in terms of bits used that i'd be very reluctant to go down the path of taking up additional EXT04 space for example, given that EXT04 is used heavily for Packed SIMD.

mul without add on the other hand are all XO-Form and all PO=EXT31:

 167 # 1.6.16 XO-FORM
 168    |0     |6   |11   |16     |21 |22    |31  |
 169    | PO   |  RT|   RA|   RB  |OE |   XO |Rc  |
 170    | PO   |  RT|   RA|   RB  |  /|   XO |Rc  |
 171    | PO   |  RT|   RA|   RB  |  /|   XO |  / |

with XO being 10 bit there is plenty of space in EXT31 to drop a few more opcodes in it [mindful that OPF ISA WG approval is the final word here].

so, this is pointing towards a mulx, targetting 2 regs. will write it up.

l.