[Libre-soc-isa] [Bug 213] SimpleV Standard writeup needed

Fri Oct 23 21:28:49 BST 2020

https://bugs.libre-soc.org/show_bug.cgi?id=213

--- Comment #88 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
i keep thinking of small things here and need to record them :)

the issues with not breaking predicates down into "one per element" is:

1) the mask must be read (as a scalar) by a specially created Function Unit
that, like the Branch Unit, has a Shadow Matrix row that pulls die/pass for
each unit depending on each bit being 0 or 1 respectively.

2) for each in-flight instruction that we want predicated *there must also be a
corresponding predicate Function Unit*

3) the distribution of those bits to SIMD units gets particularly hairy.

contrast this with the situation where predicate bits come from a register *on
a per element basis*. (CRs happen to already exist in PowerISA and consequently
are a good match here)

1) predicated vector instructions may be issued where:

* the source register element
* the dest register element **AND**
* the predicate bit register element

these may all be issued **DIRECTLY** to a Function Unit **WITHOUT** requiring
an intermediary Predicate Unit in the way whose role is to split out bits of a
larger register.

in other words a "non-predicated" scalar operation is one where by default the
predicate source is implicitly hardwired to an immediate "1" indicating "always
do this operation".

this is pretty trivial

2) where SIMD is involved is a little trickier but also reasonably practical.

look at where we have had to add CR "full" read ports.  rather than have the CR
Pipeline be forced to have to do 8 CR reads or writes (mtcr, mfcr) the *entire*
CR 0-7 is read/written via a special 32-bit-wide regfile port.

on detecting the situation where SIMD needs to be deployed the "full" CR port
may be read, giving 32 bits containing 8 CRs.

a) for 8x8bit SIMD these entire 8 CRs can be thrown at a single 64 bit SIMD FU.

one 32bit CR read will go along with one 64 bit source reg read, it is just
that VL jumps by 8 elements at a time.

b) for 4x16bit SIMD this is slightly hairy in that the 1st 4 CRs (CR0-CR3) need
to be thrown at one FU (element n) and the 2nd 4 CRs (CR4-7) at another FU
(element n+1)

here VL will be jumping by 4 each time, and although the exact same 32bit CR
read is a Read Hazard for odd/even FUs we *may* be able to reduce the number of
regfile reads by "broadcasting" the read to 2 simultaneous CompUnits.

c) for 2x32bit it is actually potentially more optimal to just have 2x possible
single-CR predicate registers per FU

OR

d) we considered splitting FUs down into 32 bit anyway (HI32 reg, LO32 reg, HI
and LO *collaborate* to do 64 bit calculations) and under these circumstances a
32 bit predicated vector operation would have *2 CR predicates anyway*: one for
HI32, one for LO32

however in all circumstances it is critical to note that a "special Predicate
Function Unit" neither exists nor is needed.

in other words the Issue Unit can calculate simply based on elwidth, VL, and
the current Sub-PC value (0 to VL-1) exactly what Dependency Requests to send
on to the DMs.

this simplicity and predictable regularisation will become critically important
when it comes to doing multi-issue, which requires that transitive DepMatrix
relationships be set up between instructions in the same issue batch.

note that the above is also possible when using an int register as a predicate
however the caveats (design complexity disadvantages) apply from comment #86
and *do not apply* when using CRs.

-- 
You are receiving this mail because:
You are on the CC list for the bug.