[Libre-soc-isa] [Bug 552] single-predication has "splat" capability, needs review

Wed Dec 23 22:41:00 GMT 2020

https://bugs.libre-soc.org/show_bug.cgi?id=552

--- Comment #5 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
(In reply to Jacob Lifshay from comment #4)

> wouldn't it work to have the scalar op just have a whole pile of dest regs
> in the dependency matrix, and the data path can just use all 4 reg-file
> write buses enabled simultaneously, allowing 4 writes per clock cycle?

the DMs are so insanely large that i wanted to cut large holes in them by not
having any lane-crossing entries.  this allows:

* every modulo 4 DM group to effectively have its own mini DM (4 of them: one
when modulo regs is 0, separate for 1, and 2 and 3)

* the top regfile numbers become 4 separate batches of 4R1W.  not insane
12R10W.

writing to multiple destinations is therefore nowhere as easy as it sounds.
"just" write to multiple destinations, when the output from MultiCompUnit is a
single result?

clearly this does not work.

what *would* work is:

* under micro-coding the result is written into the first element
* subsequent micro-coded operations are a *mv* operation, using the "Whopping
Great Shift Register FSM" the one wot has 12 incoming and 12 outgoing
registers.

here, that can broadcast-splat the value across multiple lanes.

> It
> doesn't matter if we push the scalar op through the scalar ALU for as many
> clock cycles as needed, we don't have to have the scalar alu be used just
> once.
> 
> All I wanted to avoid is the scalar ALU having 1 op per element, taking 4x
> more cycles than needed.

lane crossing is always going to be a pig.

the choices are:

* insane regfile porting
* insane crossbar routing
* cyclic shift registers with latency
* single bus with go-get-a-coffee latency
* s*** out of luck

-- 
You are receiving this mail because:
You are on the CC list for the bug.