[Libre-soc-dev] big-int goldschmidt div

Sun Apr 24 12:23:01 BST 2022

just looking at what it would take to implement a 128/64 divide, there already exists (scalar) divdeu:

    dividend[0:(XLEN*2)-1] <- (RA) || [0]*XLEN
    divisor[0:(XLEN*2)-1] <- [0]*XLEN || (RB)
    result <- dividend / divisor
    RT <- result[XLEN:(XLEN*2)-1]

extraneous stuff like overflow taken out, there, for clarity.  RA is shifted up by 64, RB zero-extended to 128 bit, result takes lower 64 bits of 128 bit division.

in theory this instruction could be used to get 1/2 way towards the qhat estimate (using 2 top digits) but you still have to get the rest of the way.

a new 128/64 scalar divide instruction would be 

    dividend[0:(XLEN*2)-1] <- (RA) || (RS)

where RS=RA+1 just like lq and stq.

using goldschmidt this would get us a scalar estimate within a few clock cycles *not* 128, and bear in mind because the estimate is needed in the big-int multiply that follows the *entire* back-end SIMD mul engines are held up waiting for that estimate.

https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/biginteger/divmnu64.c;h=b6c4b52e763b87ce842ac214e9901b7065c54092;hb=HEAD#l127

BUT... this is the kicker:

that hold-up still occurs even if Goldschmidt is utilised for the qhat estimate.  even if Gildschmidt produces an answer in 1 clock cycle there is still the loop afterwards (line 132) doing inc/dec while-looping.

which had me thinking:

   "hang on a minute, why are we not implementing Big-Integer Goldschmidt"??

(Goldschmidt is basically, you keep multiplying divisor and dividend by an estimate F which keeps between 0.5 and 1.0, until dividend has reached 1.0, at which point divisor has *become* your answer. each new estimate is 2.0-dividend. i say 0.5, 1.0 and 2.0, i mean the Fixed-Point equivalents)

in a Big-Int Goldschmidt for the initial estimate a 64/64 divide would do, using the top MSBs.  the internal loop is a pair of Big-Int multiplies and a Big-Int  subtract, 2-D, to compute the new estimate, and then if that new estimate is close enough to 1.0 (in FixedPoint) the loop may be terminated early.

compared to Knuth Algorithm D the "downtime" of the SIMD Mul backend is *really* mimimal. yes there are 2 big-int multiplies but guess what, there will be *less iterations* and certainly less qhat/rhat increment adjustment (i.e. none).

bottom line here is, adding a 128/64 div isn't going to 100% help stop the idle time caused by Algorithm D qhat estimate, Goldschmidt Big-Int Div would not *need* 128/64 div and would best Algorithm D hands-down for performance anyway.

l.