[Libre-soc-dev] big-int goldschmidt div
lkcl
luke.leighton at gmail.com
Sun Apr 24 12:23:01 BST 2022
just looking at what it would take to implement a 128/64 divide, there already exists (scalar) divdeu:
dividend[0:(XLEN*2)-1] <- (RA) || [0]*XLEN
divisor[0:(XLEN*2)-1] <- [0]*XLEN || (RB)
result <- dividend / divisor
RT <- result[XLEN:(XLEN*2)-1]
extraneous stuff like overflow taken out, there, for clarity. RA is shifted up by 64, RB zero-extended to 128 bit, result takes lower 64 bits of 128 bit division.
in theory this instruction could be used to get 1/2 way towards the qhat estimate (using 2 top digits) but you still have to get the rest of the way.
a new 128/64 scalar divide instruction would be
dividend[0:(XLEN*2)-1] <- (RA) || (RS)
where RS=RA+1 just like lq and stq.
using goldschmidt this would get us a scalar estimate within a few clock cycles *not* 128, and bear in mind because the estimate is needed in the big-int multiply that follows the *entire* back-end SIMD mul engines are held up waiting for that estimate.
https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/biginteger/divmnu64.c;h=b6c4b52e763b87ce842ac214e9901b7065c54092;hb=HEAD#l127
BUT... this is the kicker:
that hold-up still occurs even if Goldschmidt is utilised for the qhat estimate. even if Gildschmidt produces an answer in 1 clock cycle there is still the loop afterwards (line 132) doing inc/dec while-looping.
which had me thinking:
"hang on a minute, why are we not implementing Big-Integer Goldschmidt"??
(Goldschmidt is basically, you keep multiplying divisor and dividend by an estimate F which keeps between 0.5 and 1.0, until dividend has reached 1.0, at which point divisor has *become* your answer. each new estimate is 2.0-dividend. i say 0.5, 1.0 and 2.0, i mean the Fixed-Point equivalents)
in a Big-Int Goldschmidt for the initial estimate a 64/64 divide would do, using the top MSBs. the internal loop is a pair of Big-Int multiplies and a Big-Int subtract, 2-D, to compute the new estimate, and then if that new estimate is close enough to 1.0 (in FixedPoint) the loop may be terminated early.
compared to Knuth Algorithm D the "downtime" of the SIMD Mul backend is *really* mimimal. yes there are 2 big-int multiplies but guess what, there will be *less iterations* and certainly less qhat/rhat increment adjustment (i.e. none).
bottom line here is, adding a 128/64 div isn't going to 100% help stop the idle time caused by Algorithm D qhat estimate, Goldschmidt Big-Int Div would not *need* 128/64 div and would best Algorithm D hands-down for performance anyway.
l.
More information about the Libre-soc-dev
mailing list