[Libre-soc-dev] Reservation Stations. Was [Libre-soc-bugs] [Bug 782] add galois field bitmanip instructions

Wed Mar 9 08:29:36 GMT 2022

On Tue, Mar 8, 2022, 23:42 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
wrote:

>
> > why in the world would you need 30 RSes?
>
> this is the absolute absolute inviolate rule: i repeat it again until you
> have
> accepted it.
>
> you CANNOT have un-managed data.
>

> i will repeat it again.
>
> you CANNOT have un-managed dependencies.
>
> any un-managed dependencies absolutely have to be met
> with a stall at issue time.
>

i never challenged that. i know, and completely get it.

> therefore the moment you run out of RSes, you MUST stall.
>
> therefore, any conditions where you are expecting there to
> be tight loops that do not stall, you MUST have sufficient
> RSes.
>
> > ! if the FSM starts 1 instruction,
> > executes for 30 cycles,
>
> [which is no different from having a pipeline of depth 30...]
>

it's totally different...with a pipeline, you can have 30 instructions
executing simultaneously, all of those instructions need a RS to track
them. you can have 1 instruction completing execution every clock cycle.
you can have 1 instruction starting execution every clock cycle.

with a FSM, which, for the sake of argument, I'm assuming only executes 1
instruction at a time, no matter how many RSes you decide you need, only 1
RS will be executing at any time, all the rest will be waiting to execute
or doing other misc. stuff. If you have the FSM running 100% of the time,
it will only execute 1 instruction every 30 cycles. if you have a loop that
tries to run a bunch of instructions through the FSM, the RSes will fill up
and then issue will stall, even with 3 or 30 or 500 RSes. the stalling
occurs because issue runs faster than the FSM, not because you don't have
enough RSes.

>
> > and then can start the next instruction (i'm
> > assuming it can't run multiple instructions simultaneously in the FSM),
> > there only needs to be enough RSes to ensure that it can always start
> > executing the next instruction immediately when it finishes the previous
> > instruction.
>
> uh-huhn? yes?  think it through.  how many operations of type
> handled-by-the-FSM do you want to be executing simultaneously?
>

i just said i'm assuming it can only execute 1 instruction at a time:
"can't run multiple instructions simultaneously"

>
> if they take 30 cycles per FSM, and you want 30 such operations
> to be in-flight, it is ABSOLUTELY required that there correspondingly
> be [minimum] 30 RSes.
>

yes, but only because you want 30 operations in-flight, not because the FSM
takes 30 cycles.

if the FSM takes 5 cycles and you want 64 instructions in-flight, then you
need 64 RSes to hold them all. if the FSM takes 500 cycles and you want 64
instructions in-flight, then you need only 64 RSes to hold them all, not
500.

>
> note i said "if you want 30 such operations to be in-flight"
> which for e.g. FPDIV tight-inner-loops you already explained
> to me 18 months ago is absolutely critical.
>

3D stuff generally cares about throughput, not latency. RS count would be
affected mostly by the largest VL you want to support, to avoid having one
sv.fdiv stall later independent instructions, not by the latency of a FSM.

>
> > for a FSM that slow, you could probably get away with only 2
> > RSes,
>
> then on the 3rd such operation issued to those FSMs, you MUST
> stall the entire processor issue.  if three such instructions were issued
> in quick succession, that's an entire *28* cycles of stall.
>
> to prevent that from happening, you *MUST* be able to allocate
> to Dependency Matrices, you *MUST* allocate to RSes, and therefore
> you MUST allocate 30 RSes.
>
> again, i repeat: it is absolutely no different from having a pipeline
> of depth 30.
>

i refer to my previous explanation in this email, where i explain how a
pipeline is different than the style of FSM i'm assuming we're talking
about.

loop:
>          500op RT, RA, RB ; does not matter if it is a FSM or a pipeline,
>                                        ; it takes 500 cycles to complete
>          bc loop
>
> that assembly code, if you want it not to stall, had better have 500+
> RSes.
>

assuming the loop has a high iteration count, that assembly loop *will
always* stall for the FSM, since stalling is the only mechanism to prevent
issuing 1 instruction per cycle or two -- way faster than the FSM. the FSM
has a max throughput of 1 instruction every 500 cycles. issue will stall
until issue is going at the same throughput as the FSM (unless the loop is
short enough that our RSes can cover it). This is true wether you have 3
RSes or 7000.

a pipeline has a max throughput of 1 per cycle, so no stalling is necessary
since it can keep up with issue's max rate for that loop (i'm assuming the
fetch/decode pipe can only handle 1 taken branch per clock cycle). For a
pipeline, you only need as many RSes as required to track the instructions
executing in various stages of the pipeline, as well as a few extras for
misc. RS processing. This is because the pipeline has high enough
throughput to keep up with issue. if the pipeline can't keep up with issue,
then it becomes more like the FSM case, where you need the number of
instructions you want in-flight at a time, which should be at least the
number of stages in the pipeline.

Jacob