[Libre-soc-dev] SVP64 Vectorised add-carry => big int add

Wed Apr 13 05:25:17 BST 2022

On Tue, Apr 12, 2022, 20:43 lkcl <luke.leighton at gmail.com> wrote:

> On Wed, Apr 13, 2022 at 2:25 AM Jacob Lifshay <programmerjake at gmail.com>
> wrote:
>
> > sorry, that doesn't work with div and mul, because at the low-size
> > end of the hierarchy of multiplication/division algorithms,
> > the algorithms are O(n^2) operations,
>
> ok yes one of those being because there are twice the number
> of rows, the other being twice the number of columns.
>
> briefly, because it's currently 4:30am here, there's a couple things:
>
> 1) if the hardware is the same then that means that twice the number
>     of 32-bit ops can be issued to the same Dynamic-SIMD back-end
>     which gets one of the factors of 2x back.
>
>     (i.e. you issue 2x32-bit ops to the 64-bit-wide SIMD-back-end
>     ALUs where in the 64-bit case you can only issue 1x 64-bit
>     wide to the *same* hardware. total number of bits per clock
>     cycle is exactly the same)
>

yup, that's exactly what I was describing.

>
> 2) if the hardware for 64-bit MUL is implemented as a FSM

I'm assuming 64-bit mul is implemented using a pipelined 64x64->128-bit
multiplier, since that is the general assumption on fast cpus.

using
>     a) 4 clock cycles (by having only a single 32-bit mul block, using
>         it 4x and doing adds behind the scenes)  OR
>     b) 2 clock cycles (by having a pair of 32-bit mul blocks, using it
>         2x and doing adds behind the scenes) THEN
>     c) the amount of gates is the same AND
>     d) actually the 512x512 mul using such hardware would actually
>         be on-par or 2x slower (!)
>

that's true that it could be on-par, however I think 2x slower for using
64-bit ops instead of 32-bit on hardware with a FSM over a 32x32->64-bit
multiplier is incorrect, since you're forgetting that 64-bit would reduce
the total instruction count drastically, which would reduce time needed if
not bottlenecked on the integer mul fsm.

>
> it's quite clear from what konstantinous said about mulx on x86
> from processors of 2012 that it is implemented in micro-code as
> only a 32-bit multiplier in hardware and doing a batch of 64-bit adds,
> which would get those 4-10 cycles of latency.
>

actually, afaict Konstantinos was just wrong about mulx's latency,
according to (de-facto reference on x86 instruction timings):
https://www.agner.org/optimize/instruction_tables.pdf

on Skylake (iirc one cpu arch he mentioned) the latency for 64-bit reg-reg
mulx is actually 4 cycles with a throughput of 1 per cycle (aka a 4-stage
pipeline). this is 1 cycle more than the standard x86 64x64->128-bit
multiply `mul reg` which has a 3-cycle latency and throughput of 1 per
cycle.

Turns out mulx's latency and throughput are identical on Haswell, afaict
the first intel cpu to implement mulx.

mulx is similar on AMD Zen1-3.

so, no, mulx is not microcoded on those cpus, and it apparently never was
(as would be expected since it's just a standard multiply but without
setting the flags).

Jacob