[Libre-soc-bugs] [Bug 751] idea for reducing dependency matrixes in 6600-derived architecture with register renaming

bugzilla-daemon at libre-soc.org bugzilla-daemon at libre-soc.org
Thu Dec 2 22:08:39 GMT 2021


https://bugs.libre-soc.org/show_bug.cgi?id=751

--- Comment #5 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
(In reply to Jacob Lifshay from comment #3)
> This idea is intended for a cpu where all micro-ops only write to one
> register each...

that's six separate Function Units in some cases for the Power ISA.
Load/Store would become five Function Units.  ShiftRot would be
three.  Condition Register CRops would become three.

remember that if you *don't* allocate enough FUs, the only option
is to stall.  so although the CR0 FU could be shared between different
FUs, there has to be enough to hold the entire in-flight Reservations
expected.

a large high-end (3+ ghz) multi-issue (8-issue) system normally has
a THOUSAND instructions in-flight at any one time.

you're talking about splitting up into between three to *six* operations,
which would be six **THOUSAND** Function Units with in-flight Reservations.

> e.g. add. would have an output field for RT and for CR0.

and another for XER.SO
and another for XER.CA
and another for XER.OV

that's five, not two.

yes, some instructions will not set XER.CA, or not set XER.OV, or
not set XER.SO: this is determined by the output itself, by the
pipeline itself.

the Reservation unfortunately still has to be made because the
Function Unit *might* write.

soc/fu/alu/output_stage.py:

  30         comb += oe.eq(op.oe.oe & op.oe.ok)
  31         with m.If(oe):
  32             # XXX see https://bugs.libre-soc.org/show_bug.cgi?id=319#c5
  33             comb += xer_so_o.data.eq(xer_so_i[0] | xer_ov_i[0]) # SO
  34             comb += xer_so_o.ok.eq(1)

this logic - tiny as it is would need to move to an entirely separate
Function Unit.  the subsequent lines to another separate Function Unit:

  35             comb += xer_ov_o.data.eq(xer_ov_i)
  36             comb += xer_ov_o.ok.eq(1) # OV/32 is to be set

that's for every ALU that has XER.SO/OV/CA, and there are several.

i think you'll find that this results in an alarmingly-high number
of Reservation Stations and consequently absolutely massive Dependency
Matrices.

(In reply to Jacob Lifshay from comment #4)

> umm, can't a FU simultaneously depend on the outputs of several other FUs?
> you seem to have forgotten this...

through the FU-FU Matrix, yes.  and ti's multi-read as well as multi-write
capable:

https://libre-soc.org/3d_gpu/fu_dep_cell_multi_6600.jpg

here's the source code:
https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/scoremulti/fu_fu_matrix.py;hb=HEAD

it outputs readable_o and writeable_o both of which are true (on a
per-FU basis) if there are no remaining write hazards (for readable_o)
or no remaining read hazards (for writeable_o) respectively.

* read-hazards are tracked per src operand in the FU. for ALU that is:
  - RA
  - RB
  - CR0 (due to writing the SO bit)
  - XER.SO
  - XER.CA
  - XER.OV
* write-hazards are tracked per dest operand in the FU.  for ALU that is:
  - RT
  - CR0 (due to writing the SO bit)
  - XER.SO
  - XER.CA
  - XER.OV

the latches go HI at Issue time and remain HI until the Great-Big-Or-Gate
for Read-Reg-Deps and Write-Reg-Deps of the corresponding FU-REGs Row says
that all Read-deps are cleared or all Write-deps are cleared...
*on a per-port* basis.

you can see from the diagram that on the READ side that there is an
OR gate per FU-FU-cell.  only when every source register latch in the
FU-FU-cell is cleared will the WRITE-WAIT signal go HI indicating that
this FU no longer blocks any other FUs from WRITING.

likewise for the WRITE side there is a corresponding OR gate to create
a READ-WAIT signal which only goes HI when all dest SRC latches
(RT, CR0, XER.SO, XER.CA, XER.OV) go LOW, indicating that this FU no
longer blocks any other FUs from READING.

reducing that down to one write per FU *does not* make the need to
actually track that write go away.  all it does is: move that need to
somewhere else.

so where at the moment the FU-FU DM can track N write-dependencies
per FU, you are talking about having N-times more FUs with only
single write-dep tracking (read-src tracking is still required)

also it does not take away the need for the READ side tracking
which must still be duplicated across all those FUs.

and given that FU-FU is an O(N^2) resource the effect on gate count
could be catastrophically high (several million gates)

there's something else that i can't quite put my finger on that's making
me... nervous / twitchy.  it could just be the numbers involved (the number
of RSes). given that it took four *months* for me to implement
Mitch Alsup's 2nd book chapter idea, and when we went over it we found
that the idea of replacing the FU-FU Matrix with a bitvector was flawed,
the idea of altering such a critical low-level algorithm when even Mitch
could have got it wrong, and how long it took to find that out, makes me
quite nervous, mainly because of the amount of time it takes to properly
evaluate these things.

-- 
You are receiving this mail because:
You are on the CC list for the bug.


More information about the libre-soc-bugs mailing list