[Libre-soc-bugs] [Bug 413] DIV "trial" blocks are too large

Fri Jul 3 22:08:48 BST 2020

https://bugs.libre-soc.org/show_bug.cgi?id=413

--- Comment #20 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
(In reply to Jacob Lifshay from comment #19)
> (In reply to Luke Kenneth Casson Leighton from comment #18)
> > (In reply to Jacob Lifshay from comment #17)
> > > your pipeline loopback solution is waay more complex then I was imagining:
> > 
> > i've been thinking about how to do this for some considerable time, attempted
> > what you suggest below, and found that when trying to add mask cancellation
> > it became so complex that i couldn't actually work out how to do it.
> 
> I still think you're waay overcomplicating it.

i'm really not, and it's really not that complicated.  5 extra main lines of
code in the base class of ReservationStations and one extra constructor
parameter.

> I'll post a more detailed
> explanation a little later when I have time.

there's no need: i completely get it, because i already implemented it last
year, and i implemented it exactly and i mean  literally exactly as you
described in the ASCII art diagram.

it *didn't work*, jacob.  a combinatorial lockup occurred under certain
conditions that i had to stop investigating because it was taking up far too
much time.

this has me seriously concerned that you will also becone embroiled in trying
to work out a solution and become similarly trapped in the task...

.... when i *already have a simple solution* that only needs around 30 extra
lines of code to adapt *existing* classes for this purpose.

> > 
> > mask cancellation is absolutely essential because it's how the speculative
> > execution gets.. well... cancelled.
> 
> aren't the mask cancellation signals just broadcast to all pipeline stages?

yes.  except you can't broadcast data-erasure to a nmigen FIFO.

> all that's needed is to just mark the instruction as canceled

no, that does not work.  if by that you mean that it should continue to
propagate through the pipelines.

it *has* to be actually entirely erased from existence throughout *all* data
structure because if that is not done, that muxid cannot and must not be used.

the confusion between a cancelled still-in-flight muxid and a new one with the
same muxid will result in data corruption.

if by "mark" you mean "erase from existence" then yes. and that means all
register latches, all stages, all buffers, everything.  it has to be _gone_ as
if it never existed, was never issued.

> and have the
> loop footer stop looping early.

i don't understand this, sorry.

> 
> Note that the loop header and footer are special blocks (not just a mux)
> that always prioritize non-canceled instructions that are looping back over
> new instructions, stalling all stages before the pipeline header when an
> instruction is looping back.

stalling at the header is not the main problem.  it _is_ a problem, but not the
main one.

think it through on the case where all and i mean all stages have data in them.

stalling interferes with the propagation of data at the split and join points
in ways that cause the entire pipeline to become completely ineffective: 50%
utilisation when there are 2x loopback, 33% when there is 3x loopback, 25%
utilisation when there is 4x data loopback.

think of it like a traffic jam.  the mux-in junctions are incapable of allowing
the flow of twice the number of "cars", because the output path is only "1
wide". meaning it is *guaranteed* a 50% flow rate through either of the mux-in
paths.

and one of those is *connected to the mux-out*.

this *guarantees* that the mux-out (loopback) path can only take data 50% of
the time.

in turn that *guarantees* that the pipeline *has* to stall 50% of the time.

it's gridlock, basically, and that i think was why the code i wrote went into a
combinatorial lock.

renember, jacob, i have *already tried implementing* EXACTLY what you are
proposing, and found that it *does not work*.

> That way, no FIFOs or any horribly complex
> stuff are needed.

i did not advocate the use of a FIFO. i only described its need to illustrate
overcoming the problem that you have not yet understood exists.

> The loop header block would have a pipeline register as
> part of it.

that just delays the inevitable stalling by one cycle only.

and is effectively equivalent to a FIFO of depth 1.

a FIFO of depth 1 is not adequate to stop the bottleneck problem when all
stages have data in them.

> Additional advantages of the loop header/footer are that they are
> composable, you can easily build a doubly-nested pipeline loop if we ever
> need that.

do you you mean twin parallel pipelines, such that the system can cope with
twice the throughput?

or do you mean loopback three, four or eight times?

or, do you mean more complex micro-coding (such as using the int div pipeline
shared with FP)?

> You can think of them as the pipeline equivalent of a do-while loop.

sort-of, except it is more like a roundabout with cars on it, with one road in
and one road out (very close to each other but with the exit road *after* the
entry road), and every car must go round the roundabout by 360 (actually appx
375) degrees.

in this situation, traffic flow unfortunately becomes completely ineffective
during peak hours (when the roundabout is entirely full of cars) and can even
achieve gridlock, which a roundabout is not supposed to do!

-- 
You are receiving this mail because:
You are on the CC list for the bug.