[Libre-soc-bugs] [Bug 413] DIV "trial" blocks are too large

Fri Jul 3 20:25:25 BST 2020

https://bugs.libre-soc.org/show_bug.cgi?id=413

--- Comment #18 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
(In reply to Jacob Lifshay from comment #17)
> your pipeline loopback solution is waay more complex then I was imagining:

i've been thinking about how to do this for some considerable time, attempted
what you suggest below, and found that when trying to add mask cancellation
it became so complex that i couldn't actually work out how to do it.

mask cancellation is absolutely essential because it's how the speculative
execution gets.. well... cancelled.

also, you have to have stall propagation in the solution that you outline
below,
i'll illustrate where.

also, *because* of that stall propagation, and how it interacts at the
merge-points, the pipeline utilisation is significantly sub-par.

> The pipeline would have either 7 stages (for 1x radix 3 compute stage per
> pipeline stage and 1 extra stage at each end for leftovers) or about 3 or 4
> (for 2x radix 3 compute stages per pipeline stage).
> I think we should only have 10 or so reservation stations for the non-SIMD
> case, since we don't need much more than the number of instructions that can
> be simultaneously executing. 20 or 30 is excessive.

it really isn't.  remember: only 10 of those are connected to (10)
ComputationUnits, whilst the remainder *store the partial results*.

this solves the problem of stalling.

> There would be a 2-bit Signal tracking the loop number -- totally separate
> from anything in ctx. fiddling with the id seems like a horrible mess that
> can be easily avoided.

you're missing the bit where i tried already to do this (exactly as outlined
below), and tried adding in the necessary mask-cancellation and that *was*
an awful mess.

by contrast, modifying the ctx.mux_id is a very simple solution that i will
have completed in about another 30-40 minutes.

> We would build a custom loop header and footer
> pipeline control stages, where they are constructed together so can send
> signals from footer to header for the loop backedge. 

yes.  this is what i tried.  the code already exists: MultiOutControlBase
and MultiInControlBase.  or, more to the point, CombMultiOutPipeline and
CombMultiInPipeline.

> This would also be
> useful for fsm pipelines since we could implement that using only a
> combinatorial stage in between the loop header and footer.
> 
> 
> So, the pipeline would look kinda like:

nice ASCII art, btw :)

> RS0    RS1 ... RS9
>  |      v      |
>  | +----+----+ |
>  +>+Prio. Sel+<+
>    +----+----+
>         |
>         v
>    +----+----+
>    |  setup  |
>    +----+----+
>         |
>         v
>    +----+----+
> +->+ loop hdr|
> |  +----+----+

this is where the problems start - right here.  when both "setup" and
"loop ftr" have data, *and* compute5 has data:

1) you can't stop compute5 from sending its data (pausing) because
   it's a pipeline.  it *HAS* to send its data to "loop ftr" on the
   next cycle.

2) therefore you MUST - without fail - prioritise "loop ftr" on the
   MUX-IN

3) therefore you must THROW AWAY the data from "setup".

you see how it doesn't work?  .... unless we have stall propagation
right the way through the entire pipeline.  which 

now, if you could *store* the partial result, somehow, in a buffer,
there would be no problem, would there?

and how many would actually needed?  the number needed can easily
be computed: it's the number of current ReservationStations (N)
because regardless of what's in-flight in the pipelines, the RSes
will *only* stop trying to push further data into the pipeline when
all ReservationStations have outstanding data.

with that being the condition when we can *guarantee* that further
data will be pushed into the pipe, that in turn means that we *need*
exactly N more buffers to store the data.

what name could we give those buffers... i know, let's call them
ReservationStationN*2!

it turns out then that the exact same purpose for which ReservationStations
0 to N-1 are used for (to prevent stalling and to keep in-flight data)
applies equally as well to the partial results.

however: here is another solution:

        |
        v
   +----+----+
+->+ loop hdr|
|  +----+----+
|       |
FIFO    v
|  +----+----+
|  | compute0|
|  +----+----+

where the size of that FIFO is capable of storing a *full* batch of
partial results.  in the case where we have 10 Reservation Stations
and the plan is to have a 4-way "loop", that FIFO *must* have 30
(40 - 10) entries.

anything less than that and it is *guaranteed* that data will be lost
(or stalling will be needed).

and those extra 30 FIFO entries?  they serve the exact same purpose
as the extra 30 RSes, with the advantage being that i'm almost done
modifying the code to include the feedback loop (it's taking approximately
30 lines of code) whereas the solution using a FIFO is a lot more code
(a lot more objects)

but there's more.

here's the thing, jacob: both solutions *appear* directly equivalent to
each other... but they're actually not.

the problem is that the muxid must be globally unique.  we absolutely
cannot have multiple in-flight results with the exact same muxid.  if
we did, then it is *guaranteed* that data corruption would occur.

and if there are say only 10 RSes, and only 10 muxids, but there are
40 possible partial in-flight pieces of data, there will be 4 with
muxid=0, 4 with muxid=1 .... 4 with muxid=9.

this becomes... worrying.  however what's different about those 4
with the same muxid?  they have *different* 2-bit Signal tracking
IDs.

you see how that's exactly the same as adding an extra 2 bits on the
muxid?

now, it doesn't actually matter if Signal tracking IDs are used or
if there are 2 extra bits on the muxid.  it's perfectly fine to have
both.

but... there's *still* more!

when it comes to mask cancellation, the data that's in the FIFO would
need to be destroyed.  that means *intrusively* getting inside the
FIFO, and interfering with the data structure which is *specifically*
designed not to be cancellable (it's a round-robin SRAM).

overall, then, when everything is taken into consideration, the proposal
to alter the muxid is way *way* simpler and more effective:

* pipeline no stalling
* no stalling needed
* no interaction at the mux-in/out
* very little code needs to be written
* mask cancellation is not adversely impacted (it's already functional)

-- 
You are receiving this mail because:
You are on the CC list for the bug.