[libre-riscv-dev] div/mod algorithm written in python

Mon Jul 22 22:49:58 BST 2019

On Tuesday, July 23, 2019, Jacob Lifshay <programmerjake at gmail.com> wrote:

> we were going to have 2 radix-8 stages per pipeline stage, right?

Radix 8 is a hell of a lot of gates, compared to the OTF conversion and the
redundant representation algorithms.

>  if I
> recall correctly, the plan was that the div pipeline would be long enough
> for 32-bit but 64-bit would need to go through the pipeline twice (once it
> was modified to allow that).

No, I don't recall seeing that, however it makes sense.

Running twice sounds complicated enough to warrant a separate milestone.

Or to use the FSM version. I am not in the least bit concerned about 64 bit
performance.

8 and 16-bit only need to go through once.

I'd like separate short pipelines where practical, to keep latency down.
Blocking FU Reservation Stations is a bad idea and its avoidance needs
prioritising, because the ENTIRE issue engine has to come to a complete
total hard stop until an RS for the desired operation becomes free.

However if 16 bit can be used for 8 bit that would work.

>
> I wouldn't say having a 96-bit (f32)/192-bit (f64) intermediate is too
> long,

The Jon Dawson FSM needs just double the mantissa. So 53*2 for 64 bit. And
only 2 of those are needed. It's 106 cycles long however it is extremely
gate efficient.

>
> fmadd needs something close to that long anyway and nmigen can reduce
> the width a lot for most of the signals using the wreduce pass.

I really do not want to be relying on yosys optimisation passes. It is bad
practice, the graphviz diagrams are getting out of hand as it is, gtkwave
debugging is also adversely visually compromised, and it means extra
attention has to be paid during layout, inspecting post passes to make sure
yosys got it right.

And if it didn't, we have to go back, redesign the code, then re-run EVERY
unit test for all subsystems that rely on that module,  and rerun every
functional and FPGA test for the ENTIRE system.

That could be days if not weeks of testing CPU time, given how massive this
is getting.

>
> for f32/u32/i32 divpipecoreconfig's bit_width needs to be 32-bits (in order
> to do integer ops),

>
Ok - as a separate milestone.  I want to get FPDIV declared "finished",
first.

>  fract_width should be 23 /* number of fractional bits
> in f32's mantissa */ + 1 /*guard*/ + 1 /*round*/;
> sticky is remainder.bool()
>
> the fp inputs (for div) should be put in the range [1, 2). the output is
> wide enough to handle all combinations of output.

I'd like it to be reduced to what is actually needed.

Debugging 173 bit numbers in binary is flat out impractical in gtkwave.

>
> read the comments on DivPipeCoreInputData to find out the number of
> fractional bits that each input has, shift each fp input so it has the
> correct number of fractional bits to match what DivPipeCore is expecting:

I did, I don't understand it. I just guessed and put in different
combinations of parameters, watched the output in gtkwave, and after enough
changes the output got better and eventually passed.

This is pretty much how I do all code development, which is why good unit
tests are absolutely critical, not just "critical because generally you
need them anyway".

I literally cannot do anything this complex without them.

That and good code comments.

dividend_fract_width = core_config.fract_width * 2
> core_input.dividend.eq(n_mantissa << (dividend_fract_width -
> fp_format.fractional_width))
> core_input.divisor_radix.eq(d_mantissa << (core_config.fract_width -
> fp_format.fractional_width))

Not making a lot of sense. An ASCII art illustration, in the code, would
help.

I don't understand why I had to put a into the top bits of the divisor yet
b filled all of the dividend. That makes asolutely no sense.

 They should be the same size. fractional_width I feel should not exist,
except I saw that Vulkan has fixed width FP with weird sizes.

> at the output:
> # see DivPipeCoreOutputData for fractional widths
> sticky = core_output.remainder.bool()
> round = core_output.quotient_root[0]
> guard = core_output.quotient_root[1]

I included one extra bit from the quotient to be included in sticky.

This due to having to shift the mantissa AFTER final phase, in cases where
the MSB is zero.

The MSB is zero because the range can be 0.5 to 1.9999999998.

That because a and b are in the range 0.5 to 0.9999999999

> # quotient_mantissa is fpformat.width - 2 bits wide with
> # fpformat.fractional_bits fractional bits
> quotient_mantissa = core_output.quotient_root[2:]
>

Can you take a look at div2.py it is pretty close, except for the extra 3rd
bit, so format minus 3 bits and qr[3:]

It's done by lengthening quotientroot by 1 so that when taking the last
bits the one extra LSB goes into the sticky.

Also the MSB is stripped off in div2.py.

A 1 was placed in the MSB of a and b back in div0.py

Then ANOTHER msb added (a zero).  That gives 2 numbers in the range 0.5 to
0.999999 with room for a 1.9999998 result

Really need you to take a look at this.

L.

-- 
---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68