[Libre-soc-isa] [Bug 213] SimpleV Standard writeup needed
bugzilla-daemon at libre-soc.org
bugzilla-daemon at libre-soc.org
Fri Oct 9 01:07:10 BST 2020
--- Comment #44 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
(In reply to Jacob Lifshay from comment #43)
> yes, but we *can* treat each bit or group of 2 or 4 bits like independent
> registers, avoiding the need for blocking exceptions during cmp and also
> allowing starting a masked operation before the cmp that produces the mask
> finishes executing -- the dependencies just operate on lanes rather than the
> whole mask.
bit tired (00:20 here) will go over implications in followup another time.
> The other benefit of using the integer registers for mask registers is that
> we can use all the weird and wonderful bitwise ops on them (popcount, find
> first set, or find last set) that aren't supported on CR registers
i remember now. that was one of the big advantages of SV-RV.
> as well
> as trivially having many more registers for storing multiple masks -- which
> is quite important on vectorized code with many different conditions that
> get combined together; we would have very high pressure on the CR registers
> due to only being able to fit a few masks in them since a 64-lane compare
> would overwrite all 4 bits in all registers.
arg. i really am not keen on pausing the execution of vector ops to read an
int reg. it's doable: a predicate "shadow" unit reads the int, and pulls "die"
or "release shadow" on its respective element once available.
that's pretty straightforward.
writing individual bits out to an int reg from a batch of cmps is however a
very different matter.
micro-architecturally it would be better to extend the CRs to 128 (16x 64 bits)
other options: just to expect people to reduce VL to saner sizes when issuing
then use mtocrf, isel, something anything to transfer the required CR bits to
an "int as predicate".
sort-of like the micro-op idea you mentioned a few comments back except using
an explicit instruction for it instead.
the micro-op route is how RISC gets turned into CISC if you're not careful
with 128 CRs there is less pressure plus bitselection and transfer to INT GPRs
allows further processing (popcount) plus use it as a backup cache.
i need to mull this over.
> > if the answer to that is "yes", i didn't make a fuss about it but the OoO
> > scheduling for that is an absolute pig.
> if we split those 2 integer registers that you can write cmp results to (and
> not any others) into many small groups of bits that resolves the OoO
> scheduling concern.
no: it massively increases the size of the Dependency Matrices. we are already
at the "alarmingly large size to the point where we may have to do a PRF-ARF
i.e. have register caches".
and when you have bit-divisions of a reg you need to simultaneously pull the
64 bit int DM column *and the bits as well*
but.. which bits? the real implications are that you need to have full bitlevel
DMs - as in 64x128 DM columns! times 20 FU rows!
which is where the PRF-ARF allocation comes in. intregs allocated as
predicates would be allocated the "64 bit-wise DM columns" (which is far too
many, so we'd need to drastically cut that back to e.g. 16 max)
now we need a register cache mapping from registers in the PRF to *parts*
(ranges of bits!) of the reg bitlevel cache!
you see how insanely complex that's getting?
for CRs it is (slightly) less complex because they're already subdivided into
> Instructions should be able to use more than 2 integer registers as mask
> inputs since read dependencies are not a concern.
yeah, they are. all dependencies have to be respected and analysed. they
absolutely cannot be ignored.
> > it basically means that batches of elements actually depend on the same
> > (integer) register, making it a *multi* targetted Write Hazard.
> > contrast this to each cmp independently targetting a completely independent
> > CR that has a completely independent Dependency Matrix column that in no way
> > overlaps with any other element, and it should be pretty clear that the use
> > of CRs is an absolutely massive design simplification.
> I'm advocating for those two integer registers being the targets for all
> vector instructions that produce masks because integer registers have the
> benefits of being an integer register without the drawbacks of the CR
> registers as explained above.
int regs as dest for mask really is much more complex than you currently
> ummm... you only need the 8 existing CRs if using the integer registers for
then how are Rc=1 operations treated?
do all vector INT ops try to all write to CR0, and all FP ops write to CR1?
last element (VL-1) writes, all other writes are destroyed?
this is where the idea of extending CR to at least 64 elements comes from -
nothing to do with predication.
> no, it *is* basically the same. you don't generally need eq, gt, and lt all
> from the same compare
i can see exactly this being very useful, particularly when involving predicate
masks (on the crand/or as well) to e.g. generate a min-max range-selection or
other complex suite.
> -- all you need is predication based on "did the
> corresponding lane of the vector compare spit out a true" and the
> corresponding "did it spit out a false" case, both of which can be achieved
> with a single mask by having a invert-mask bit in the SimpleV prefix for
> predicated instructions.
to create more complex compound masking (range exclusion being one example i
can think of initially) would i think you'll find need quite a few more
instructions that then have to move to the yes/no mask.
where CRs have already been designed to cover quite a bit more than other RISC
that said popcount and ffirst and the bitpattern propagation etc these are not
just valuable they're essential in vector ISAs.
it's going to need a lot more thought, however at the moment i am leaning
towards a hybrid that includes best practical parts of both.
You are receiving this mail because:
You are on the CC list for the bug.
More information about the Libre-SOC-ISA