[Libre-soc-dev] WIP demo of deficiency of 6600-derived architecture compared to register renaming

Tue Oct 27 12:56:26 GMT 2020

---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68

On Tue, Oct 27, 2020 at 5:10 AM Jacob Lifshay <programmerjake at gmail.com> wrote:
>
> I think I found a performance deficiency where the 6600-derived
> architecture has a bottleneck in the speed it can write to the
> register file when one register is repeatedly written -- it's limited
> to 1 write per clock, yet register renaming (since the writes are to
> different registers due to renaming) can support more than 1 write per
> clock cycle.

ok: it is a common misconception that 6600-style architectures do not
have register renaming: they have *nameless* registers because of the
1:1 correspondance between the CompUnit and its incoming (and outgoing
in our case) register latches.

therefore if you want more quotes renaming quotes you *increase the
number of CompUnits*.  note that "Concurrent CompUnits" does not mean
"increasing the number of pipelines".  see the ALU diagram on p38 of
Mitch's book, 11.4.9.3 and also the LD/ST one around p43 11.4.11.2

note in the ALU one that there are *four* "issue, src1, src2, result"
and *four* go_rd and go_write signals, that happen to be funnelled
through the same 4-stage pipeline.  note in the LD/ST case Mitch's
note that there is no LD/ST pipeline, there is just a heck of a lot of
concurrency.

the important thing to note for ALUs, as i've documented: you have to
have equal or more number of front-end RSes than there is "delay"
through the pipeline that the RSes are fronting for... *excluding*
instruction decode and issue because that's nothing to do with the
pipeline.

so if the ALU pipeline latency is 5 cycles, you *must* have 5
front-end RSes.  you can have more than that: you just musn't have
less.

> This is assuming that the 6600-derived architecture can
> forward values without needing to go through the register file
> (forwarding buses), otherwise it is waay slower.

yes.  the original CDC 6600's register file was "write on the falling
edge, read on the rising" and consequently could act as a forwarding
bus, which was why the additional feature wasn't added.  it also had
an insane number of register file ports (B was 5R2W or something mad)

we do happen to have per-port operand forwarding on the regfiles: any
write that comes in on any port will end up being "forwarded" out if
there happens to be a simultaneous read.

> I spent several hours coming up with a demo to show this, but ended up
> running out of time today.

yep - it takes time.  you just have to be very patient.

> The partially finished demo:
> https://libre-soc.org/3d_gpu/architecture/compared_to_register_renaming/

oof, i had to expand that to 1/4 screen, 1920x1080 :)

ok so the example you've laid out, i can pretty much deduce straight
away that you've only allocated one RS for the ALU (add) and... err..
i'm trying to work out if you've allocated 1 or 2 LDST CompUnits...
it might be 2 because you allow 1st LD to overlap with 2nd ST.  this
is only possible (determined at issue time) if there are enough RSes.

> Also, we should find a better markdown renderer, the above page looks
> waay better using VSCode's builtin markdown preview.

true... well, it's so large that i'm looking at the text file, thank
you for laying it out in ASCII, spaced out neatly.

really, a much better tool would be gtkwave, here.  although i don't
know any gtkwave editors.

ok. so there are 4 instructions in the loop:

L2:ldu r9, 8(r3)
    addi r9, r9, 100
    std r9, 0(r3)
    bdnz .L2

* let us assume each instruction takes 4 cycles (or 5 if you prefer)
* let us also assume that there are 4 Reservation Stations on the
front of the ADD pipeline, for a 4-way Concurrent CompUnit
* likewise let us assume that there are 4 LD/ST CompUnits.

this gives us 4x "nameless" (aka "renamed") registers for ADD and 4x for LD/ST

from there it *should* be obvious that - using the "renamed"
nomenclature so i don't have to refer to "the 1st use of r9, the 2nd
use of r9" etc:

* h2 will end up in the LDST's RS #1
* ADD's RS #1 will create a Read-after-Write hazard on h2 and
  ADD's output RS #1 will store h4
* the std's h4 will end up as another Read-after-Write hazard on the
LDST's RS #2
* (there will be a CTR-related hazard on the branch-compare, i'm
ignoring that for now)

next loop:

* h5 will end up in the LDST's RS #3 **AND THIS IS HOW THE 6600
RENAMING OCCURS**
* ADD's RS #2 ... h5
    ..... h7
* ... h7  ....... LDST's RS #4

and, fortunately, from the first loop (assuming 4 cycle completion)
the 1st LD/ST retires *just* before we start the 3rd loop.
interestingly, so does the ADD, which means that loop 3 can actually
re-use ADD RS #1.

now, if there was a *5* clock latency on each instruction then things
might actually stretch out to the point where the 5 LD/STs "nameless"
renaming latches (RSes) are not enough.   at that point we may have to
extend the number of LD/ST RSes to 6 or 8.

but, not the number of ADD RSes.

this is because we issue to LDSTs twice per loop (one for LD, one for
ST) and consequently have to have twice the number of RSes.

note in turn that this increases both the number of rows in the
FU-REGs and also the rows-columns of FU-FU Dependency Matrices.

consequently we do not want to have larger Matrices than are
*actually* required for given workloads, meaning we have a lot of
analysis to do similar to that which Mitch illustrates from section
11.3.8.1, p13.

l.