[Libre-soc-dev] Reservation Stations. Was [Libre-soc-bugs] [Bug 782] add galois field bitmanip instructions

Wed Mar 9 11:56:37 GMT 2022

On Wed, Mar 9, 2022 at 8:29 AM Jacob Lifshay <programmerjake at gmail.com> wrote:

> > any un-managed dependencies absolutely have to be met
> > with a stall at issue time.
> >
>
> i never challenged that. i know, and completely get it.

ok good.  [see how not acknowleding things can lead to confusion?
if i don't truly know that you know something, i have to repeat it
again and again]

> with a FSM, which, for the sake of argument, I'm assuming only executes 1
> instruction at a time, no matter how many RSes you decide you need, only 1
> RS will be executing at any time,

so this is why - and i apologise for not emphasising it enough - i have always,
in the examples (and in the diagram) said, "30 FSMs, 30 RSes, 30 DM Rows"

when you do that, assuming that the FSM is [roughly] equivalent in gates to a
single pipeline stage, the actual total resources in gates is
[roughly] equivalent
to that of a 30-stage pipeline.

the exception to that is (as you said in your other message) when there are
more complex stages, and that was the focus of the diagram
https://libre-soc.org/3d_gpu/pipeline_vs_fsms.jpg

on the right, there are *5* small-stage pipelines with 2-out muxes.

> and then issue will stall, even with 3 or 30 or 500 RSes. the stalling
> occurs because issue runs faster than the FSM, not because you don't have
> enough RSes.

this will be because, in reality, 29 of the RSes will not be connected to
anything.

look again at the diagram: each RS is connected *directly* to each FSM.

the circumstances that you are describing are the worst possible (i have
to say dumbest) design: connecting 30 RSes through a 10,000-wire
MUX (assuming 300+ wires per RS), to a single FSM that can only handle
1 single result every 30 clock cycles?

this would be, scuse me for saying it, so _incredibly_ dumb that it never
even occurred to me that you would consider it, even hypothetically :)

it would have the worst of both worlds: a 10,000-wire fan-in MUX,
a 1,920-wire fan-out MUX, all of which would be a waste of gates,
and performance would be s***

> i just said i'm assuming it can only execute 1 instruction at a time:
> "can't run multiple instructions simultaneously"

yes.  that's why you drop 30 of them down.

this actually has a significant advantage over a 30-stage pipeline
when it comes to Multi-Issue execution.

* a 30-stage pipeline can *only* accept 1 input per clock cycle,
  no matter how many RSes you have.
* 30 copies of a FSM fronted by 30 RSes can accept up to 30 operations
   in a single clock cycle

of course, the 30 *RSes* in front of the [one] pipeline can accept up
to 30 operations in a single clock cycle, but those operations then
have to be issued to the pipeline sequentially, one at a time, introducing
an extra N cycles of delay onto the completion of operations,
 where N is the number of instructions that can be issued per clock
in a multi-issue design.

we discussed this previously (last week? 3 weeks ago?)

to "fix" that problem we will have to lay down at least N pipelines where
N is the issue width.  then we have the rather interesting problem of
having to have a 30-to-N-way MUX-in and N-to-30-way MUX-out.

which given the 300+ wires is quite, quite insane.  8-way multi-issue
would be a completely mad 80,000-wire fan-in to what would probably
be half a MILLION gates just in Muxes.

whereas all of that would be 100% eliminated with QTY30 FSMs.

this goes a long way towards explaining why Newton Raphson FPDIV
tends to get preferred over pipelined FPDIVs, because they can be
real short.

> > if they take 30 cycles per FSM, and you want 30 such operations
> > to be in-flight, it is ABSOLUTELY required that there correspondingly
> > be [minimum] 30 RSes.
> >
>
> yes, but only because you want 30 operations in-flight, not because the FSM
> takes 30 cycles.

[the QTY30 FSMs take 30 cycles, or the 1x pipeline takes 30 cycles
and again, like the 30 FSMs, can process up to 30 operations
simultaneously]

the two are effectively synonymous, or "go together". apologies for
not making it clearer.

you need 30 RSes, 30 FU-Regs rows, 30 FU-FU rows, 30 RSes, in order
to keep [either] QTY 30of FSMs or QTY 1of 30-stage pipeline occupied
100% without stall [assuming single-issue]

apologies i assumed you'd understand that the driving force is to avoid
stalling.

> if the FSM takes 5 cycles and you want 64 instructions in-flight, then you
> need 64 RSes to hold them all. if the FSM takes 500 cycles and you want 64
> instructions in-flight, then you need only 64 RSes to hold them all, not
> 500.

ah - again, not quite.  like how a Convolution has a build-up, middle-point,
and a "tail" effect, a 500-cycle pipeline - or FSM - has a down-wind effect
on any operations that depend on the results of the 500-cycle Function Unit.

if the result is not going to be *available* for 500 cycles, and you do not want
stalls to occur, you better have 500 *additional* RSes - for every
other operation
(adds, muls, whatever) down-stream of that one.

the rest of what you wrote (unfortunately) looks like it's invalid because you
cover the [dumb] case: single-FSM, and huge (wasted) MUXes.

l.