[Libre-soc-dev] ternaryi, FUs, ALUs, and core.py YouTube video

Tue Nov 23 05:42:55 GMT 2021

On Sat, Nov 20, 2021, 06:49 bugzilla-daemon--- via libre-soc-bugs <
libre-soc-bugs at lists.libre-riscv.org> wrote:
> https://bugs.libre-soc.org/show_bug.cgi?id=745
>
> --- Comment #16 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
> i put together an explanatory video for you, i recommend viewing it
> and core.py https://youtu.be/7Th1b-jq40k

I finally got around to watching it.

first, the reason I wanted the bitmanip pipeline to be separate from the
shift pipeline was because we could support more instruction-level
parallelism in the later OoO superscalar cpu. That out of the way, thoughts
on the video:

I think we should try to merge two ALUs and have one set of FUs for the
merged ALUs, that would be done by inserting a 1 input to 2 output pipeline
demux in front of the ALUs and a 2 input 1 output pipeline mux after the
ALUs, thereby forming a merged ALU, which fits into the ReservationStation2
just like a regular ALU. The mux/demux stages would obviously have logic to
correctly handle ready/valid, which-ALU selection (on the demux at the
input), and priority/stall logic (on the mux at the output). Iirc, we
already have the classes for the mux/demux stages, i just can't remember
where exactly they are. All that would be required is building a generic
combined-ALU class that just wraps the other ALUs and calls the mux/demux
classes. It'd be just wiring together already existing code, so should be
pretty easy to build.

Assuming we get the combined-ALU idea to work, I think it might be better
to just use that to build the merged bitmanip/shift ALU (just by calling:
alu = CombinedALU([BitmanipALU(...), ShiftRotALU(...)])
), rather than manually merging them, since that way it'll be trivial to
unmerge them later when we need the extra parallelism for the OoO
superscalar cpu.

If you think the combined-ALU idea shouldn't be added to our critical path,
I can just start on adding the bitmanip instructions to the shiftrot and
condition pipes instead.

Later, for the OoO superscalar cpu, I think we should build an equivalent
to ReservationStation2, except that it can have multiple ALUs and dispatch
multiple instructions to the ALUs simultaneously (superscalar, rather than
making everything bottleneck on dispatching 1 instruction per clock for the
whole ReservationStation2). That should help us achieve high parallelism
without needing absurd quantities of FUs, cuz if it has three three-stage
ALUs, and FUs take 2 cycles to run an instruction again, then we can keep
any one ALU 100% full with 5 FUs shared between the 3 ALUs, and all 3 ALUs
full with 15 FUs (same as if they were totally separate, though shared FUs
are still better cuz we get more parallelism when FUs are stuck waiting on
something cuz the ALUs can still run using the other non-stuck FUs), or
some intermediate number of FUs.

Jacob