[Libre-soc-bugs] [Bug 413] DIV "trial" blocks are too large

Fri Jul 3 17:09:43 BST 2020

https://bugs.libre-soc.org/show_bug.cgi?id=413

--- Comment #16 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
(In reply to Jacob Lifshay from comment #13)
> (In reply to Luke Kenneth Casson Leighton from comment #10)
> > (In reply to Jacob Lifshay from comment #8)
> > > (In reply to Luke Kenneth Casson Leighton from comment #7)
> > > > >>> 434*434
> > > > 188356
> > > > 
> > > > down from 500,000 it is going to be several hours on placement alone.
> > > > 
> > > > each core section also looks too large, containing rr multiply that
> > > > is not needed.  will try cutting that.
> > > 
> > > all the multiplies should be multiplying by small constants, which should
> > > convert to a few adds.
> > 
> > adds that are 192 bits long.  this results in absolutely massive adders
> > by the time it is converted to gates.  likewise for the trial_comparison
> > (the greater-than-equal)
> > 
> > this results in a 450k VST file because it is literally around 2,000
> > cells to do the compare @ 192 bit long
> > 
> > if one of those compares can be cut out (because the PriorityEncoder
> > will always select at least the lowest flag) then that literally
> > halves the number of gates when radix=1.
> > 
> > 
> > can you help investigate by using yosys and installing coriolis2 and
> > compiling
> > the code so that you can see what is going on.
> > 
> > you need to understand exactly what is going on otherwise guessing what
> > *might* work is going to be a waste of time and we do not have time to
> > waste.
> > 
> > you need the feedback loop which you are entirely missing at the moment
> > by not running the unit tests
> 
> That's simply because I didn't yet get around to working on the unit tests,

sorry, i'm a bit stressed.  we still have mul to do

> I've been distracted by improving power-instruction-analyzer to allow using
> the tested-to-be-correct Rust instruction models directly from our python
> unit tests by adding Python bindings. I didn't push yet because I'm in the
> middle of a refactor and the code doesn't yet compile.

ah ok.  do keep us informed how that goes.

> I've been assuming that yosys synth is pretty representative, since it
> converts to all the gates that are used in the final circuit. If wiring is
> taking up much more space than would be expected from gate count, I can
> figure out how to install coriolis2.

it's that things are not obvious from yosys.  even if not letting it proceed
with routing, just the creation of the VST files (alliance takes the BLIF and
translates modules into VHDL using subset of its syntax) the size of those VST
files is more accurate in terms of area. 

> > 
> > > if the div pipe is flattened, their is probably a lot more that can be
> > > shared between all the different parts, such as every stage multiplying the
> > > divisor by the same constants.
> > 
> > constants are simply converted to pulling locally to VSS or VDD at the
> > point they are needed: they take up no space at all.
> 
> true, except that each stage has its own instance of `divisor * N` for
> example, which gets converted to some adders, rather than a multiplier and a
> constant (assuming yosys isn't stupid).

it's good but not perfect.  it also tends to optimise for reducing latency
rather than gate count.

which is why ARM has always done 2 numbers for its cores: size optimised and
speed optimised

> If that's replaced with propagating
> the pre-multiplied values through each stage, it would increase the d-ff
> count but reduce the adder count.

with 192 bits in rr and so on that's probably not a good idea.

> Additionally, if the wreduce yosys pass is run, it reduces the width of
> arithmetic operations when it can prove that a smaller bit-width will
> suffice.

interesting.

> 
> DivPipeCore is really designed assuming it will be flattened and yosys will
> then have a chance to const-propagate past pipeline registers and convert
> multiple identical ops to use a single op.

yehh we can't make that assumption.  it may be necessary to do partial layout
of subblocks (each pipeline stage separately) but the way things are done right
now there is no infrastructure in place for partial flattening.

i have asked whitequark about a possible nmigen const propagation pass, which
would achieve the same thing.

jean-paul has had to put some hacks in place because there are quite a few
dangling signals both in and out on modules.

-- 
You are receiving this mail because:
You are on the CC list for the bug.