[Libre-soc-dev] SVP64 Vectorised add-carry => big int add

Wed Apr 13 02:25:22 BST 2022

On Tue, Apr 12, 2022, 17:20 lkcl <luke.leighton at gmail.com> wrote:

>
>
> On April 12, 2022 4:00:25 AM UTC, Jacob Lifshay <programmerjake at gmail.com>
> wrote:
>
> >See table 2 in:
> >
> http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf
> >for an example of why two separate carry chains are needed...the
> >example
> >algorithm they use is 512-bit multiplication.
>
> making sure there some notes from the meeting tonight: it occurred to me
> that for these bigint computations, div and mul, it is a false assumption
> that it is absolutely necessary to have and to use 64 bit to create or
> synthesise 128 bit from src or dest by messing about with multiple
> instruction sequences.
>

sorry, that doesn't work with div and mul, because at the low-size end of
the hierarchy of multiplication/division algorithms, the algorithms are
O(n^2) operations, so if you make each operation handle half as many bits
but have twice the throughput (what you'd get with vectors of 32-bit ops
rather than 64-bit assuming simd units are still the same total size) you
still run twice as slow because you run 4x as many ops at just 2x the
throughput. 512-bits is small enough that the most efficient division
algorithm is probably Knuth's Algorithm D (with modifications from the
exercises section of his book) which is O(n^2) and uses 2N-by-N-bit
div/rem, which is why I wanted us to add a 128-bit-by-64-bit div op (would
be 3-in 1-out) -- i consider that a 64-bit op.

an oversimplified version of the knuth algorithm d with 32-bit words is:
void div(uint32_t *n, uint32_t *d, uint32_t* q, int n_bytes, int d_bytes) {
    // assumes d[0] != 0, also n, d, and q have their most-significant-word
in index 0
    int q_bytes = n_bytes - d_bytes;
    for(int i = 0; i < q_bytes / sizeof(n[0]); i++) {
        // calculate guess for quotient word
        q[i] = (((uint64_t)n[i] << 32) + n[i + 1]) / d[0];
        // n -= q[i] * d
        uint32_t carry = 0, carry2 = 0;
        for(int j = d_bytes / sizeof(d[0]) - 1; j >= 0; j--) {
            uint64_t v = (uint64_t)q[i] * d[j] + carry;
            carry = v >> 32;
            v = (uint32_t)v;
            v = n[i + j] - v + carry2;
            carry2 = v >> 32; // either ~0 or 0
            n[i + j] = v;
        }
        // fixup if carry2 != 0
    }
    // now remainder is in n and quotient is in q
}

if you switch to 64-bit words, both inner and outer loops loop half as many
times, giving 2x speedup assuming 64-bit ops are twice as slow as 32-bit
ops. having special carry logic just makes the inner loop not change its
speed, the outer loop still changes to be twice as slow with 32-bit ops
rather than 64-bit since it needs twice as many iterations since each word
division only gives half as many bits in the result, needing twice as many
word divisions to get the full quotient.

Jacob