[Libre-soc-dev] ternaryi, FUs, ALUs, and core.py YouTube video

Tue Nov 23 12:52:22 GMT 2021

On Tue, Nov 23, 2021 at 5:43 AM Jacob Lifshay <programmerjake at gmail.com> wrote:
>
> On Sat, Nov 20, 2021, 06:49 bugzilla-daemon--- via libre-soc-bugs <
> libre-soc-bugs at lists.libre-riscv.org> wrote:
> > https://bugs.libre-soc.org/show_bug.cgi?id=745
> >
> > --- Comment #16 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
> > i put together an explanatory video for you, i recommend viewing it
> > and core.py https://youtu.be/7Th1b-jq40k
>
> I finally got around to watching it.
>
> first, the reason I wanted the bitmanip pipeline to be separate from the
> shift pipeline was because we could support more instruction-level
> parallelism in the later OoO superscalar cpu. That out of the way, thoughts
> on the video:
>
> I think we should try to merge two ALUs and have one set of FUs for the
> merged ALUs, that would be done by inserting a 1 input to 2 output pipeline
> demux in front of the ALUs and a 2 input 1 output pipeline mux after the
> ALUs, thereby forming a merged ALU,

it's much more complicated than that: i've been thinking along the same
lines since first creating regspecs.

> If you think the combined-ALU idea shouldn't be added to our critical path,
> I can just start on adding the bitmanip instructions to the shiftrot and
> condition pipes instead.

yes, basically.  there's a hell of a lot involved, affecting some of the most
complex areas (data-structure-wise) of the design: core.py

> without needing absurd quantities of FUs, cuz if it has three three-stage
> ALUs, and FUs take 2 cycles to run an instruction again, then we can keep
> any one ALU 100% full with 5 FUs shared between the 3 ALUs, and all 3 ALUs
> full with 15 FUs (same as if they were totally separate, though shared FUs
> are still better cuz we get more parallelism when FUs are stuck waiting on
> something cuz the ALUs can still run using the other non-stuck FUs), or
> some intermediate number of FUs.

yes, basically.

flexibility here is paramount because the "normal" way to do this is to do
a hell of a lot of low-level simulations and analysis, then commit massive
resources to laboriously hand-crafting vast and detailed arrays of interconnect,
only to find (or be told) by the primary Architect that whoops, actually, the
original assessment was totally wrong (a new customer scenario has a new
workload which invalidates the entire regfile / regport / FU balance/allocation)

Mitch Alsup kindly went over this with me, once: he designed an architecture
which could do sustained 16-wide FMACs (16 FMACs per clock), with only 1R1W
regfiles, by striping them 4-at-a-time and having the FMAC pipeline be 3
stages.

corollary: if you *didn't* have that exact workload or had a 5-stage FMAC
pipeline, it all went to s***.

this is a typical allocation, from Mitch Alsup's book chapters, Scoreboard
Mechanics:

https://libre-soc.org/3d_gpu/integer_scoreboard_mitch.png

another trick that Mitch taught me about is that e.g. LDST has an ADD
unit, therefore, um, if you have 6 LDST units, then you have 6 additional
ADDers because Effective Address calculation is RA + RB.  you even
have RA+immediate so could do addi as well.

in our case, what we have is:

* one ALU with a certain regspec, e.g. RA RB CR0 as input
* another ALU with a similar but not identical regspec RA RB RC as input

and in addition, there is actual subset functionality where one ALU could
actually *really* do the job of another ALU (such as LDST being able to
do ADD, or in Power ISA case, LDST can also do certain subsets of
ShiftRot as well - those that are byte-aligned)

this should be giving you that "uh-oh, it really isn't as simple as i initially
thought" feeling.

to explain:

Reservation Stations at the moment, there is an extremely simple
"Does This RS Get This Instruction Given To It" logic test:

* is the CSV Function Unit (ALU, LOGICAL) equal to the ReservationStation
  Function Unit, YES or NO.

if YES, RS accepts the instruction.

dead simple, right?

here's the related code:

* line 274 of core.py, if member.value & fnunit
* lines 180, 188 etc. of compunits.py example:
     class ALUFunctionUnit(FunctionUnitBaseMulti):
         fnunit = Function.ALU  <--- this gets tested in core.py line 274

URLs:
* https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/simple/core.py;h=578ff655e1358960a56f158dcfde0d2de04c461e;hb=4c98cc88be5aba23807c6f4bf97e9de6ba13fd73#l274
* https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/fu/compunits/compunits.py;h=be3d4e69c5806dcce1cf68642e0a514ac37249e2;hb=4c98cc88be5aba23807c6f4bf97e9de6ba13fd73#l180

so that's dead easy.  decode the instruction, get its function unit, AND
it with Function: if hit, engage RS.

now, what happens when the ReservationStation can only deal with
a SUBSET of an FU's operations?

you cannot possibly use that simple trick.  you have to do:

    if fnunit == ALU and decoded OP_XXXX == {in some subset of
    functions}...

or, in the case of LDST, you either have to rewrite LDST so it copes
with carry-in and carry-out, or do much more detailed fine-grain
analysis of the instruction's operands in order to determine if the
LDST unit can cope with that type of ADD.

the second part is the merging of the Register Specifications (regspecs).
the code in core.py is already massively complex: i am only just keeping
up with it, by staring at it regularly for hours at a time.

merging of ALUs to create other - virtual - ALUs - is just too much right now.

later: yes.

now: no.

l.