[Libre-soc-dev] effect of more decode pipe stages on hardware requirements for execution resources for OoO processors

lkcl luke.leighton at gmail.com
Wed Feb 16 18:08:17 GMT 2022

On Wed, Feb 16, 2022 at 4:34 PM Jacob Lifshay <programmerjake at gmail.com> wrote:

> that depends on what you mean by pipeline depth...which pipeline(s)? how is it distributed? is it 2 cycles for every execution pipeline and 0 cycles in the fetch/decode pipeline? is it 1 cycle for every execution pipeline and 1 cycle in the fetch/decode pipeline? is it 2 cycles in the fetch/decode pipeline and 1 cycle in every execution pipeline?

let us say 1 for execution and 1 for decode, initially.

> that depends on what you mean by pipeline depth...which pipeline(s)? if it's the fetch/decode pipeline only that is increasing in depth, then exactly the same number of RSes are needed.

that is incorrect.  it is easy to show.  ok let us say that the
instructions are as follows:

* a Chain40 group
* an infinite sequence of NonChain instructions thereafter

and that the number of Reservation Stations is exactly 40. what
happens is as follows:

* the entire Chain40 instructions are placed into the 40 Reservation Stations
* the remaining NonChain instructions cause an Issue Stall.

if the pipeline depth is 2, then 2 cycles later, one instruction will
drop out.  ONLY at that point can one of the NonChain instructions be
placed into the (now free) ReservationStation.  all other NonChain
instructions are still frozen.

two cycles later, the next Chain40 entry drops out.

slowly, every two cycles, the number of Chain40 entries that were
blocking up the RSes disappear and the number of NonChain instructions
that can be dropped into RSes on each clock cycle correspondingly

now let us set the number of ReservationStations to only 20.

* only half of the Chain40 instructions are placed into the 20 RSes.
* the other half - and the NonChain instructions after it - are all
stalled at issue.

this is a clearly bad situation because none of the NonChain
instructions can "get in there".  only once just over half of the
Chain40 have cleared can the NonChain instructions even begin to be
dropped into RSes and executed.

even if there exists a FIFO decode buffer at decode phase it will soon
run out.  [if it does not run out, then increase Chain40 to Chain80 to
*make* it run out]

now let us set the number of RSes back to 40, and set the decode
length to 9 and execution to 1

* after 1 cycle there are 40 Chain40 instructions in RSes
* however only after another 9 cycles can execution begin
* during those 9 cycles, issue is stalled because there is no space in the RSes.
  (you can drop them into a FIFO if you like but it doesn't help)

after 11 cycles the first of the Chain40 instructions finally gets
executed, freeing up one RS

the first of the NonChain instructions may now - and only now - begin either:
* to be dropped into the free RS (from the FIFO you mentioned) *OR*
* begin the 9-long decode process which was previously stalled at issue phase

but the key point is here that compared to the case where the pipeline
depth was 2, there are at *least* an ADDITIONAL TEN CYCLES OF DELAY
where NonChain instructions cannot get into RSes and therefore cannot
begin execution.

it should be obvious that this can be "fixed" by increasing the number
of RSes from 40 to 50, at which point 10 NonChain instructions can
immediately begin execution WITHOUT being blocked by the Chain40

and THAT IS THE POINT.  the length of the pipelines *has* to be taken
into consideration as a factor to prevent blockage of NonChain
instructions by ChainNN instructions.

now, if the register information cannot be decoded within the 1st
phase of decode, that's a worse situation.

where i think you might be right is if the decode phase is say 9 long
and the register information for determining Hazards is only available
at the 9th (last) phase.  at that point, the decode-FIFO is
effectively completely decoupled from issue *and* hazard-reservation
*and* RS reservation and there is nothing that can be done about that.

if however there is sufficient information early on in the decoding to
make the required Reservations, *then* you had better have enough RSes
to cover from that point onwards, and *then* the length of the
subsequent pipeline phases matters.


More information about the Libre-soc-dev mailing list