[Libre-soc-dev] avoiding huge combinatorial mux-messes

Thu Nov 5 01:10:02 GMT 2020

On 11/4/20, Luke Kenneth Casson Leighton <lkcl at lkcl.net> wrote:

> Paul outlined that
> this has expanded to the point of

missing the button on the screen with fat fingers and hitting "send"
by mistake.  me, not Paul.

sorry :)

...to the point of interfering with synthesis tools, particularly now
that FP32 is in, and VSX is being developed (and developed as a
for-loop around the scalar ISA).

i went "hmmm" and thought it might be valuable to share some of the
design insights that went into LibreSOC.

the first thing is: we evaluated in-order and went, "no. not doing
it". every in-order microarchitecture has "stall" as the "solution",
and every effort to optimise or workaround the performance penalty
associated with that: i have not seen *one* good discussion or
justification.

for example it is "well-known" in the in-order world that you
absolutely do not do "early out" on pipelines, because the changes in
timing play merry hell with efforts to add "Operand Forwarding".

by contrast in an OoO design the dependency tracking goes "pffh, you
gave me an answer a bit early? yawwn" :)

so very early on we subdivided the *entire* ISA down into
"similar-looking register profiles" (which thanks to decode1.vhdl
after conversion to CSV files was very easy to do)

if you look at Mitch's diagram, on the left is how regfile writes are
wired up, and on the right is reads.

there are *12* completely separate and distinct Computation Units (in
the 6600 they are FSMs, think of them as pipelines). some are
duplicates which helped them get 3-4 times better performance than any
other computer on the planet at the time.

but look at A B and X, these are different regfiles.  think of them as
INT, CR and SPR for example.

the profiles are completely different for each pipeline.  LibreSOC's
MUL pipeline regfile profile is for example here:

https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/fu/div/pipe_data.py;h=c8279f42ade1fa4b7932438d27cd8205db0a2dd3;hb=HEAD#l12

  13     regspec = [('INT', 'ra', '0:63'),  # RA
  14                ('INT', 'rb', '0:63'),  # RB/immediate
  15                ('XER', 'xer_so', '32'), ]  # XER

yes it is from DIV because it turns out that the read regfile spec for
MUL and DIV are exactly the same (excluding mac of course).

likewise for regfile writing.

whereas for the ALU pipeline (OP_ADD and not very much else) this has
cr0, XER ov ca so where yes really each of those is treated as
completely separate registers, the XER ones only 2 bit wide, and yes
will have their own Dep Matrix columns!

likewise CRs are treated as individual 4 bit registers.

basically there is a data structure "Specification" for each thing
that needs connecting, exactly as outlined in Mitch's diagram.

then comes the job of connecting them up :)

because we are using python we can analyse those specifications and
*on-the-fly generate* the HDL which wires up Pipelines to Regfiles
based on a particular connectivity strategy.

the takeaway from that is: rather than have hundreds of lines of HDL
which might not turn out to be a good strategy (after weeks of work)
we can change a few lines of python in a for-loop *and it generates
entirely new HDL* [as an Abstract Syntax Tree].

i have seen templates (python jinja, HTMLTMPL, zope, php *shudder*)
achieve this same effect but they are always hard to read (being
effectively 2 languages mixed together, one of which is the "template"
language).

the resultant code is not much better, being extremely dense
for-loops, at least there are comments:

https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/simple/core.py;hb=HEAD

the key functions are "get_byregfiles" and the corresponding connect
read/write ports.

here, from a dictionary of regfile specs, which are themselves lists
of tuples, we have, in get_byregfiles, turned things around so that we
now know, instead of which pipelines use which regfiles, we know
*which regfiles need to connect to which pipelines*.

and of course just as in Mitch's diagram, you need to know differently
for read and for write.

then having got this information, we now have a dictionary of
dictionaries of lists:

* 1st level by regfile name (INT, XER, CR, SPR, FAST)
* 2nd level by *operand* name (RA, RB, SO, CR0)
* 3rd level list of *pipelines*

*now* we have the information necessary to create "Broadcast Buses" to
wire up every single pipeline that needs an "RA" Operand to be
directly wired up to INT Regfile Read Port #1.

etc. etc.

this ends up really rather similar to the original POWER RISC chip
which you kindly described to me had those Broadcast Buses named RA RB
RC RS and RT.

now, here's the fascinating bit. actually 2.

1) if you look around line 335, search for the word "argh", you will
see some experimentation that i did to merge *all* INT regfile reads
onto one single INT regfile port.

this resulted in a whopping 17 way Broadcast Bus onto that one regfile
port.  i decided maybe that was a bit excessive, and went back to 4R
regfile ports.

[but note: it was a 5 line code change to DRASTICALLY alter the resultant HDL!]

2) choosing whether the regfile's access is unary or binary *we don't
care*!  the Register File cares, but the code which generates the HDL
doesn't give a minkeys.

each python base class used here "declares" itself as unary or binary
addressed and this is picked up by core.py to, again, generate
RADICALLY different HDL.

https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/regfile/regfiles.py;hb=HEAD

it is not quite that simple because unary to binary reg# conversion
needs a "map" function which is done here:

https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/decoder/power_regspec_map.py;hb=HEAD

one for read, one for write, and this gets used in the PowerDecoder2
(equivalent to decode2.vhdl) which starts to give you an idea of how
this all meshes together.

so that is how:

* we get away with being able to experiment with radically different
regfile layouts

* the MUXing (regfile Broadcast Buses) is isolated and *explicitly*
done using a python mapreduce tree technique

* the pipelines (or FSMs) become *dead simple* not in the slightest
bit caring about regfiles.

now, one piece of the puzzle i left out here but went over in the conf
call, Paul: Reservation Stations.

when i said in the last bullet above that the pipelines "do not care"
it is because there is:

a) operand latches and result latches created according to the
pipe_data.py regspecs mentioned right at the start

this "Management" job btw is done by something we call MultiCompUnit:

https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/experiment/compalu_multi.py;hb=HEAD

b) *additional* latches called Reservation Stations which store
operands (and capture results) *and make sure those 2 do not get
separated*! :)

(some of these latches are "transparent" i.e. have a combinatorial
bypass on them so there is not 2 clocks of delay here on input and
another 2 on exit)

note that there is a MUX from RSes funneling into the pipeline, and a
corresponding fan-out from the end, to get the result back to its
associated RS.

(aside note, critically important: if you have less RSes than pipeline
stages the pipeline can NEVER run 100% full.  ever).

so to recap:

1) localised MUXes fan-in, fan-out are at the front of and back of
every pipeline (and FSM)

2) there are a shed-load of different pipeline types. even TRAP
(exceptions, interrupts) is handled as a separate pipeline.  TRAP SPR
CR SHIFT MUL LOGICAL BRANCH DIV all completely different register
profiles.

3) after munging the regfile specs it is trivial (conceptually...) to
create dedicated Broadcast Buses (MUXes) per regfile port

and throughout all of this *at no time* does the actual implementation
of a given function (shift, mul, load-with-update) even know *or care*
that its two, or three, or (not kidding, *seven*) registers in some
cases are read / written in parallel via whatever-ports or
sequentially via 4R1W blah blah, because that's not their problem.

the summary, then, after all that (whew) Paul is that if things are
similarly abstracted out to FSMs/pipelines with a common API which
does not know or care about the regfile allocation, having instead
operand latches that capture incoming srces and store outgoing
results, followed by, on top of that common API, a "connector to
regfiles system" takes care of wiring up the pipelines/FSMs to actual
regfile ports through those latches, you *might* find that those messy
combinatorial blocks, now being localised and regularised, become
manageable or at least analyseable.

or, because of the latches on the pipelines, in sane places, a lot
less difficult for the tools to manage.

note that the RSes trick is *not* critically dependent on implementing
an OoO engine.  it is convenient and helps manage things even for
inorder (and note here that we currently do not even have that! it is
currently a FSM that does not allow instructions to overlap, at all).

lastly: because of the stupidly large number of pipelines we had a
correspondingly absolutely massive fan-out of the PowerDecoder2 data
(decode2.vhdl).

up to 192 wires fan out to 12 pipelines!

i "fixed" that by creating 12 separate "subsetted" PowerDecoders
(surprise, called SubsetPowerDecoder) which (surprise) used the CSV
files (decode1.vhdl) to identify both

a) which rows were needed (by UNIT, e.g. LDST or ALU) to create the
hierarchy of switch statements (decode1.vhdl but now *per pipeline*)

b) which columns.  this creates *dynamic* Records which are identical
to decode2toexecute1type... except unneeded fields by the target
pipeline are *not included*.

consequently i only pass the 32 bit instruction (not 192 bits) and
there is a *local* subset Power Decoder right there at the front of
the pipeline.

but that is another story, which i appear to have already told, oh well :)

waah. 1am again.  enough.  good luck.

l.