[Libre-soc-dev] SVP64 Vectorised add-carry => big int add

Wed Apr 13 04:43:41 BST 2022

On Wed, Apr 13, 2022 at 2:25 AM Jacob Lifshay <programmerjake at gmail.com> wrote:

> sorry, that doesn't work with div and mul, because at the low-size
> end of the hierarchy of multiplication/division algorithms,
> the algorithms are O(n^2) operations,

ok yes one of those being because there are twice the number
of rows, the other being twice the number of columns.

briefly, because it's currently 4:30am here, there's a couple things:

1) if the hardware is the same then that means that twice the number
    of 32-bit ops can be issued to the same Dynamic-SIMD back-end
    which gets one of the factors of 2x back.

    (i.e. you issue 2x32-bit ops to the 64-bit-wide SIMD-back-end
    ALUs where in the 64-bit case you can only issue 1x 64-bit
    wide to the *same* hardware. total number of bits per clock
    cycle is exactly the same)

2) if the hardware for 64-bit MUL is implemented as a FSM using
    a) 4 clock cycles (by having only a single 32-bit mul block, using
        it 4x and doing adds behind the scenes)  OR
    b) 2 clock cycles (by having a pair of 32-bit mul blocks, using it
        2x and doing adds behind the scenes) THEN
    c) the amount of gates is the same AND
    d) actually the 512x512 mul using such hardware would actually
        be on-par or 2x slower (!)

it's quite clear from what konstantinous said about mulx on x86
from processors of 2012 that it is implemented in micro-code as
only a 32-bit multiplier in hardware and doing a batch of 64-bit adds,
which would get those 4-10 cycles of latency.

more tomorrow when i'm awake. zz...

l.