[Libre-soc-dev] Reservation Stations. Was [Libre-soc-bugs] [Bug 782] add galois field bitmanip instructions

Tue Mar 8 12:46:17 GMT 2022

this is so fundamental that it's worth doing a special mailing list
post to cover it.  for background,
see https://libre-soc.org/3d_gpu/architecture/6600scoreboard/

in particular, see this diagram:
https://libre-soc.org/3d_gpu/multiple_function_units_write.svg

although it is for the original 6600 the concept has not changed:
there are simply far more Function Units in the design that we are
doing, far more register files (6), and each Function Unit requires
far more registers (up to 6 in some cases) and therefore far more
entries per Dependency Matrix Cell.

the other relevant diagram is this:
https://libre-soc.org/3d_gpu/concurrent_comp_unit.png

but do not be confused or deceived by the pipeline being shared
into thinking that just because there is a pipeline being shared
by multiple RSes, somehow this destroys or abrogates the
responsibility for tracking of Dependencies.

to reiterate what i wrote here, in its original context
https://bugs.libre-soc.org/show_bug.cgi?id=782#c44

    the absolute inviolate rule of OoO design is that you
    absolutely cannot have an untracked in-flight computation.
    every register involved in every computation has to be tracked,
    and every RS with an outstanding computation has to be
    frozen (marked busy, not accepting a new computation)
    until the result hits a regfile.

unless you are happy with stalling, of course.

On March 8, 2022 8:02:53 AM UTC, bugzilla-daemon--- via libre-soc-bugs
<libre-soc-bugs at lists.libre-riscv.org> wrote:
>--- Comment #43 from Jacob Lifshay
>well, if we're using a pipeline for gfbinv (which we totally want to if
>we want
>AES to be usably fast without needing 7 zillion gfbinv FSMs), it's much
>harder
>to tack another several stages on the end of a pipeline (unless we want
>to not
>share it with gfmul), unless we want to do microcoding.

the logic already exists, conceptually, to do exactly that. there is
no problem at all in using the ReservationStations2 class,
which automatically creates the tracking, fan-in, feeding-to-pipeline,
fan-out, and ready/valid handling for even FSMs to utilise.

if every FSM already has ready/valid signalling, it is almost laughably
trivial to wire up the FSM to a shared instance of a ReservationStations2
which manages the pipeline.

the question then becomes, is the resources involved in the massive
fan-in fan-out worth the cost of duplicating the "complex" block?

>> btw do not consider pipeline designs to be "the peak epitomy
>> of absolute performance", it is misleading.
>>
>> multiple simpler FSMs with multiple RSes can be more efficient
>> and take advantage of early-out and jump-ahead opportunities
>> that pipelines cannot.  this makes them *less* efficient rather
>> than more, particularly for long-running ops.
>
>that depends...if the pipeline allows us to specialize stages to
>specific parts
>of the algorithm it can be faster (e.g. fpmul with one stage for
>unpacking/denormal-shifting, one stage for partial product computation
>and
>first part of carry-save adders, another stage of carry-save adders,
>final
>stage of carry-save adders and final carry-propagating add, stage
>rounding&packing&denormal handling) since each stage only needs to
>implement
>the parts it handles, making the stages run faster. if it were instead
>several
>totally separate FSMs for the same throughput, each FSM would need the
>logic
>for all stages of the algorithm,

this may turn out to be better (lower cost) than the alternative:
massive fan-in / fan-outs.

one thing though: by the time you get down to that level, the number
of clocks needed to perform such a smaller (intermediary) computation
is, obviously, much smaller, and therefore you can have *multiple*
such shared pipelines, one per *group* of FSMs, with much smaller
fan-in / fan-out on each.

FSMs that take say 30 cycles (and therefore require a minimum
of 30 Reservation Stations) and require only 3 cycles of "complex"
(large) logic, could be subdivided into 10 groups of 3, each group
sharing a 3-stage pipeline, fronted by a ReservationStations2 class
that makes that pipeline *look* like there are 3 copies (Mitch Alsup
calls this "Concurrent Computation Units")

https://git.libre-soc.org/?p=nmutil.git;a=blob;f=src/nmutil/concurrentunit.py;h=e802ed6a93b01d68ee44eb9f76690a41edcfcf25;hb=HEAD#l174

the calculations on gate resources that we as architects then need to do is:

    what is the optimum balance between number of 3-stage
    pipelines and the associated fan-ins plus fan-outs

in other words if those 3-stage partial-processing pipelines are hugely
complex, then 10 duplicates would easily dwarf the size of the associated
3-in 3-out fan-in/out Muxes needed.

in such circumstances it could be better to say only have 5 groups of
6, with 6-way fan-in/out Muxes to the 5 ReservationStation2 instances
providing access to the 5 3-stage pipelines.

in extreme circumstances even that might not be enough, and you'd
need say 2 groups of 15, at which point the 15-way fan-in/out Muxes
would almost certainly begin to dominate

bear in mind, depending on what you are feeding in and out, that could
be as high as 240-bit wide data on 15-way fan-in/out!

for the FPDIV pipeline for example it could well even exceed that,
because we have to have 192-bit partial quotients (3x the 64-bit
inputs) and there is a massive remainder as well!

> making the data path much more complex
>and
>possibly taking more time due to the complexity, reducing max clock
>speed.

if you design something that takes more time, you're doing it wrong,
and eliminate that from consideration.

>>
>> [yes a FSM can be done which processes multiple chained combinatorial
>> stages]
>>
>> RSes need to farm-in and fan-out to pipelines (which have
>> to be BIG muxes), where FSMs can wire directly. each RS has
>> its own FSM, each FSM stores its result in the RS, no
>> massive fan-in/out.
>
>that's kinda deceptive, since the muxes with the massive fan-out/in are
>now in
>the scheduler logic and the busses for carrying intermediate register
>values
>around (i temporarily forgot the name) rather than in the pipeline
>fan-in
>fan-out.

no, it's not [deceptive], and they're not [in the scheduler logic].
you probably mean PriorityPickers and CDBs (see further below)
in which case you've conflated two different mux-buses

the fundamental rule is that you absolutely cannot have untracked
resources. period.  this is non-negotiable.

caveat: you can.... you just have to stall the entire processor,
clear out or wait for absolutely every in-flight piece of computation
to finish, *then* process the un-tracked computation, wait for
that to finish, and *then* and only then continue with dependency-
tracked computations.

given that having even one un-tracked computation could terminate
the possibility for being able to run a THOUSAND dependency-
tracked computations, common sense tells us to work *really hard*
to avoid having un-tracked computations.

so, before getting on to the paragraphs below, let me state:

1)  if you have a 10-stage pipeline or a 10-stage FSM, it is the
   *completion* time that matters (10 clocks) NOT whether it is
   a pipeline or a FSM, you ABSOLUTELY MUST have an
   absolute bare minimum of 10 Reservation Stations

   let us use the word "Function Unit" to cover both FSM and
   pipeline.

2) you absolutely must have a FU-FU Row covering each
    Reservation Station

3) you *CAN* share those Reservation Stations between
    Function Units, with 2 caveats:

    a) because the FU-Regs and FU-FU Dependency
        Matrices have to track all read and all
        write registers associated with the FU(s)
        that they manage, you have to have the
        superset of read and write regs used

     b) you have now introduced the need for FU-aware
         fan-in at the Reservation Stations:
         the RSes will need to know "does this computation
         go to RS1 that i manage or does it go to RS2"

a good example of (3) that Mitch Alsup gives in his book
chapters Scoreboard Mechanics is that of LD/ST units
also acting as ADD Function Units.

every LD/ST *has* to have an adder to compute the
Effective Address.  therefore, it is perfectly reasonable
to have idle LD/ST units do double-duty to perform
64-bit ADD!

this is a perfect example of being able to improve performance
"for free" because the adder is there, you might as well make
use of it.  with the register profiles matching exactly and even
the types of registers matching exactly, there is no additional
cost: no increase in the size of the RSes, no increase in the
size of the Dependency Matrices.

however in more complex examples with zero shared functionality
but resource-pressure on the number of Reservation Stations,
the next highest priority will be to match up Function Units with
similar Register Profiles.

[why do you think i went to all the trouble of creating pipe_data.py
PipeSpecs, with register profiles describing the in/out registers
for every Function Unit?]

if you fail to match up the register profiles, then the size of
each Dependency Matrix Cell goes up significantly.  this
could be Extremely Bad.

so for example here is the pipespec for input to ShiftRot:
https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/fu/shift_rot/pipe_data.py;hb=HEAD

  12     @property
  13     def regspec(self):
  14         return [('INT', 'ra', self.intrange),  # RA
  15                ('INT', 'rb', self.intrange),  # RB/immediate
  16                ('INT', 'rc', self.intrange),  # RB/immediate
  17                ('XER', 'xer_so', '32'), # XER bit 32: SO
  18                ('XER', 'xer_ca', '34,45')] # XER bit 34/45: CA/CA32

and here is the one for Logical:

  13     @property
  14     def regspec(self):
  15         return [('INT', 'ra', self.intrange),  # RA
  16                ('INT', 'rb', self.intrange),  # RB/immediate
  17                ('XER', 'xer_so', '32'),    # bit0: so
  18                ]

and ALU:

  11     @property
  12     def regspec(self):
  13         return [('INT', 'ra', self.intrange),  # RA
  14                ('INT', 'rb', self.intrange),  # RB/immediate
  15                ('XER', 'xer_so', '32'),  # XER bit 32: SO
  16                ('XER', 'xer_ca', '34,45')]  # XER bit 34/45: CA/CA32

so then it would make sense to share Dependency Rows for
Logical and ALU, and it would even be fine to share with ShiftRot:
ShiftRot is the "superset" of inputs (bear in mind you have to
also analyse the outputs)

but it would make absolutely no sense WHATSOEVER to merge with
CRs
https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/fu/cr/pipe_data.py;hb=HEAD

  10     regspec = [('INT', 'ra', '0:63'),      # 64 bit range
  11                ('INT', 'rb', '0:63'),      # 64 bit range
  12                ('CR', 'full_cr', '0:31'), # 32 bit range
  13                ('CR', 'cr_a', '0:3'),     # 4 bit range
  14                ('CR', 'cr_b', '0:3'),     # 4 bit range
  15                ('CR', 'cr_c', '0:3')]     # 4 bit: for CR_OP partial update

if you were to merge CRs with ShiftRot, it would mean ***NINE***
entries in the Dependency Matrix Read side for each cell!

that's just the read side!  if you include the write-side it would end
up with something mad like 20-entry Dependency-tracking.

bottom line you have to be extra- extra- careful here about matching
up which RSes perform shared-tracking for which FU-types.

>for example, lets say the algorithm takes 10 cycles,

which will be the same regardless of whether a pipeline
or an FSM is used...

> and that every
>pipe/FSM
>needs 2 extra FUs in order to stay 100% utilized,

if we didn't spend enough resources on operand-forwarding,
or got the timing wrong, or a-whole-bunch-of-other-reasons,
yes, 2 extra FUs would be a bare minimum reasonable
addition.

bear in mind though that by *only* having that bare minimum,
you may run into stalling on run-ahead [speculative] execution

> and that there are 20
>other
>FUs in the whole cpu:
>pipeline:
>12 FUs -> pipeline
>
>12 -> 1 mux on input
>1 -> 12 mux on output
>(getting big)

yyep.  especially if that's 6 input registers, 3 of which are 64-bit.
plus operands, it could easily be 250+ bits on the fan-in.

times 12.

bear in mind that is **INDEPENDENT** of the *Register Port*
read-fan-in and write-fan-out!  those are ADDITIONAL
Muxes on the other side of the PriorityPickers that protect
each Register File Port.

do not confuse the fan-in/fan-out Muxes on the *OTHER*
side of the ReservationStation2 (leading to the register files)
with those that connect *TO* the Function Unit.

it goes like this:

   RegFileRead -> PriorityPicker ->
   ReservationStationIn -> FunctionUnit ->
   ReservationStationOut -> PriorityPicker ->
   RegFileWrite

where:

* PriorityPicker on RegRead protects the fan-out read "Broadcast Bus"
   (known in Tomasulo as a CDB - Common Data Bus) for
   a regfile port (e.g. named RA or RB or CR_A) that connects
   *ALL* Function Units with an "RA" register to that one
   RegRead Port designated "RA" *ONLY*.

   the degenerate case here is when there is only one FU,
   and only one RegRead port for that one register, in which
   case the PriorityPicker is a pass-thru and the fan-out
   becomes "a bunch of wires".

   this is very rare, but does happen (SPR or TRAP pipeline
   i think. maybe MMU)

* the ReservationStationIn has a COMPLETELY SEPARATE
   fan-in bus into the FunctionUnit, where if the FunctionUnit
   happens to be a repeated FSM block, that fan-in is just
   "a bunch of wires" (no gate logic, no muxes, at all)

you get the idea.  repeat for RS-out, repeat for RegFileWrite.

>something -> 32 mux to fill all cpu FUs from reg reads/operand
>forwarding
>(pretty big)

not necessarily, this is where you've misunderstood.  the regfile
ports get their own completely separate and independent PriorityPicker,
and there are 6 (six) separate and distinct register file types
(INT FAST_SPR SLOW_SPR CR STATE XER) soon to be 7 (seven)
when adding an FPU.

there are therefore six *independent* fan-in/fan-outs, and *TWELVE*
PriorityPickers (1-read, 1-write) because there is one each per
regfile, and *TWELVE* separate Broadcast-Buses (Common Data
Buses).

there will therefore need to be SIX operand-forwarding buses
although to be honest given the sheer numbers here i'm leaning
towards just having the operand-forwarding be an internal
characteristic of the Register File.  one for another time, that one.

when you run test_issuer.py, the information that you ignore,
jacob, will show you a print-out of the register-to-fu allocations
that are handled and created by core.py [not-so-subtle hint:
don't ignore it! :) ]

it does *NOT* attempt to Mux down every single output from
every single FU onto one single Common Data Bus, that makes
absolutely no sense because they're going to completely different
regfiles.

so i organised core.py to connect up to the appropriate regfiles.

now.

here's where there's a "problem".

if we do not have enough regfile ports (only a 1R1W rather than
say a 4R1W) then you run into the difficulty that *every single INT
regfile has to mux down to that one port*!

this is actually an option in issuer_verilog.py (reduce_regs).

what happens is that allll those RA, RB, RC Reg-Reads get merged
onto the one absolutely massive PriorityPicker, and a correspondingly
absolutely massive Common Data Bus gets created!

so we have 10 pipelines (in TestIssuer), ShiftRot has 3 INT reg reads,
MUL has 2, DIV has 2, ALU has 2, Logical has 2, SPR has 1, CR has
1, we're up to a 13-way Broadcast Bus and a 13-way PriorityPicker
already!

obviously this is a situation we would like to avoid, by having regfiles
that properly match up with what is needed.  4R1W would allow
us to have 4 *separate* PriorityPickers, and to start distributing
the load amongst them.

but even that might not be enough, at which point it becomes
sane / sensible to "stripe" the regfile into *multiple* 4R1W
"banks", just to get the size of those Common Data Buses
down.

[*this is why i have been trying to get through to you*, Jacob, for
some considerable time, that you cannot just arbitrarily state
"why not just reorganise the regfiles to be 128-bit, it'll be easy",
it damn well won't, because as you can see, the above is deeply,
deeply comprehensive and took a staggeringly-long time to work
out, i'm holding *ALL* of the above in my head, and so know
keenly well that "a simple request to do 128-bit" results in
throwing away 18+ months design work]

>10 FSMs (for same throughput):
>3 FUs -> 10-step FSM
>3 FUs -> 10-step FSM
>3 FUs -> 10-step FSM
>...
>3 FUs -> 10-step FSM

i don't know where you're getting "3 FUs -> 10-step FSM" from.
you might be referring to "3 registers".  or just be getting
very confused about what an FU is and what an FSM is.
an FSM *is* a type of FU (Function Unit).  you might
mean 3 RSes, that would make sense, in which case
see below

remember: you MUST repeat MUST repeat MUST allocate
EXACTLY one FU per "operation that must be tracked".

if there are 10 FSMs, then there is technically no difference
between the pipeline and the FSM case on the *Regfile*
side: the path "Regfile -> PriorityPicker -> ReservationStation"
remains *exactly* the same.

in the FSM case however - and this is the bit i was explaining
at the end of comment 42
https://bugs.libre-soc.org/show_bug.cgi?id=782#c42

in the FSM case, there are 10 RSes and 10 FSMs, each RS
wires **DIRECTLY** to each FSM, there **IS** no Fan-in
(or Fan-out) Mux.

thus you have SAVED gates, not added them.  given that
some of those are literally a 250-wide data path, it's not
something that can be ignored or considered "trivial enough
to be ignored".

>30 FUs
>
>10x 3 -> 1 muxes on input
>10x 1 -> 3 muxes on output (meh, tiny, who cares)

this exact same trick can be deployed on pipelines as it
can be on FSMs, because it is a trick that can be deployed
on *Function Units*, and both pipelines and FSMs are FUs.

but you need to think in terms of the *Reservation Stations*.

any given *Reservation Stations* (after having their Register Profiles
merged in a sane way) can "front" to multiple Function Units.

an RS could perfectly well perform a mux-in / mux-out to a
ShiftRot pipeline *and* also to a DIV FSM if you really really
wanted to, it really makes no odds.  the only thing to watch,
as i said, being the number of wires of the operands+operation
which could easily be 200+

[btw this is why i introduced Satellite PowerDecoders, to
 *re-de-code* the 32-bit operation *locally*, so as to avoid
  having to broadcast a massive 200+ bit set of wires to
  every single Function Unit, through massive muxes.
  32 wires is a hell of a lot less wires to mux than 200+]

>something bigger -> 50 mux to fill all cpu FUs from reg reads/operand
>forwarding
>(woah, huge!! aah, run!)

see above about the difference between the Regfile side and the
FU side, they are completely separate, and there is absolutely no
difference whatsoever.

bottom line is that it is the *time* (number of cycles) that matters.
whether it is a FSM or whether a pipeline makes no difference to
the total number of RSes needed, the only difference is the balance
of resources for the fan-in-fan-outs vs the size of the (many) FSMs
against the (single, or double, or greater) pipeline.

or, as hinted above, a mix-of-the-two: the FSMs have (one or more,
as "grouped") partial-stage pipelines.  the trick there will be to
ensure that the stages have far, far less registers as input and
output, so as to massively reduce the fan-in, fan-out.

DIV's "pre" stage would be a classic example of where *not* to
add such a trick, because the data paths have been expanded
out to something mad like 350+ wires (if including the operand
data as well).

however the "post" stage, which by that time (hopefully, assuming
here for the sake of illustration) only contains the 64-bit result, and
consequently it *would* make a good balance to have the
[assumed-complex] post-processing be shared across multiple
FSMs by a [assumed-complex] post-processing pipeline.

l.