[Libre-soc-isa] [Bug 552] single-predication has "splat" capability, needs review
bugzilla-daemon at libre-soc.org
bugzilla-daemon at libre-soc.org
Wed Dec 23 22:41:00 GMT 2020
https://bugs.libre-soc.org/show_bug.cgi?id=552
--- Comment #5 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
(In reply to Jacob Lifshay from comment #4)
> wouldn't it work to have the scalar op just have a whole pile of dest regs
> in the dependency matrix, and the data path can just use all 4 reg-file
> write buses enabled simultaneously, allowing 4 writes per clock cycle?
the DMs are so insanely large that i wanted to cut large holes in them by not
having any lane-crossing entries. this allows:
* every modulo 4 DM group to effectively have its own mini DM (4 of them: one
when modulo regs is 0, separate for 1, and 2 and 3)
* the top regfile numbers become 4 separate batches of 4R1W. not insane
12R10W.
writing to multiple destinations is therefore nowhere as easy as it sounds.
"just" write to multiple destinations, when the output from MultiCompUnit is a
single result?
clearly this does not work.
what *would* work is:
* under micro-coding the result is written into the first element
* subsequent micro-coded operations are a *mv* operation, using the "Whopping
Great Shift Register FSM" the one wot has 12 incoming and 12 outgoing
registers.
here, that can broadcast-splat the value across multiple lanes.
> It
> doesn't matter if we push the scalar op through the scalar ALU for as many
> clock cycles as needed, we don't have to have the scalar alu be used just
> once.
>
> All I wanted to avoid is the scalar ALU having 1 op per element, taking 4x
> more cycles than needed.
lane crossing is always going to be a pig.
the choices are:
* insane regfile porting
* insane crossbar routing
* cyclic shift registers with latency
* single bus with go-get-a-coffee latency
* s*** out of luck
--
You are receiving this mail because:
You are on the CC list for the bug.
More information about the Libre-SOC-ISA
mailing list