[Libre-soc-isa] [Bug 213] SimpleV Standard writeup needed

Thu Oct 8 22:59:51 BST 2020

https://bugs.libre-soc.org/show_bug.cgi?id=213

--- Comment #40 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
(In reply to Jacob Lifshay from comment #38)
> (In reply to Luke Kenneth Casson Leighton from comment #35)
> > (In reply to Jacob Lifshay from comment #34)
> > 
> > > the idea is that the compare would produce 1 bit per vector lane and
> > > essentially directly generate a predicate mask into an integer register. For
> > > that to work, the compare would need extra bits (normally in the branch
> > > instruction for scalar powerpc) to know which of lt, le, eq, ne, etc. it
> > > should use, those bits come from the prefix.
> > > 
> > > As long as it's one bit per lane, scalar integer ops are even better than cr
> > > ops for the required bit manipulations.
> > 
> > i came up with an architectural plan to implement the hidden bitsetting in
> > 6600 style OoO and to be honest it was a bit of a pig.
> > 
> > an exception in the middle required a very messy design.
> 
> Why would you ever need to handle exceptions in the middle of a cmp
> instruction? 

not in the middle of one cmp instruction: an exception in the middle of a
*vector* batch of up to *64* cmp instructions.

> cmp instructions won't ever generate their own exceptions and
> interrupts would generate a fake interrupt instruction which would go either
> before or after the sequence of cmp instructions.

if you forcibly mask out interrupts (exceptions) in the middle of a vector,
latency goes to s*** :)

if you decide to "throw away results and start again" but have written *any*
part of those results - including any bits of the hidden predicate - to any
regfile, now you have irrecoverable data corruption when "restarting from
element zero".

reasons why you would want to write partial results include that the
microarchitectural internal vector length (back-end SIMD in our case) may only
be 4-wide or 8-wide and the requested vector length is 16 or above.

> All we need to do is design cmp to target only one of 2 integer registers

2 integer registers as reserved as predicates for the whole of the vector
of cmps?

if the answer to that is "yes", i didn't make a fuss about it but the OoO
scheduling for that is an absolute pig.

it basically means that batches of elements actually depend on the same
(integer) register, making it a *multi* targetted Write Hazard.

contrast this to each cmp independently targetting a completely independent CR
that has a completely independent Dependency Matrix column that in no way
overlaps with any other element, and it should be pretty clear that the use of
CRs is an absolutely massive design simplification.

> and tell the scheduler that they only write to their destination register
> and add individual bit write enables on those 2 integer registers or just
> treat each bit as a separate register. 

right.  this requires the creation of at least a 32-wide Dependency Matrix just
to cover individual bits of a register.

i mean - it _works_... but here's the thing: *that's exactly what's going to
have to be done for the Condition Registers*.

so in addition to (say) a minimum of 32-wide DM columns added for CRs, on top
of that you're proposing an ADDITIONAL 32 DM columns for covering single-bit
predication...

... when we can entirely skip that by using one of the bits *of* the CRs *as*
the very predicate source/target bit that you're proposing

and skip a whopping 32 x 20 (or so) extra DM entries.

> The rest of the destination register
> selection bits can be used to encode the compare condition, along with the
> two other reserved bits in the cmp[l] instructions.

ah there's spare bits?  ah!  well, we need to be very careful about using them,
in particular we need explicit approval from the OpenPOWER Foundation to do so
(even though we are doing this entirely behind a "libresoc modeswitch").

> 
> We could also keep the above encoding with cmp targeting one of 2 integer
> registers and have a internal non-ISA-visible register for mask accumulation
> that is copied to the destination integer register.

this idea i came up with for SV-RISC-V "branch", and it's quite dodgy.  doable,
but dodgy.

> The final copy could be
> included in the final lane compare instruction or split out as a separate
> micro-op.

if we didn't have any other better options (using CRs as-is for their intended
purpose in scalar world, just "vectorised") i'd say yes, let's do it, because i
know what you're referring to, and had to design it for SV-RISC-V "branch".

OpenPOWER ISA, due to the fact that CRs exist, has no need for this kind of
hard-hack.

> An entire int reg --- we have 128, losing 1 won't hurt much, especially
> since we'd need it for masking vector ops anyway (what compare results are
> usually used for).

some cross-over occurred here: i'm proposing that we use *CRs* for predicate
masking, not an int from the int regfile :)

as i mentioned above this results in a massive simplification of the
microarchitectural implementation, in particular it removes a thorny /
problematic area that i never really liked, which was that the predicated
vector operation is forced to stall until the INT regfile predicate mask has
been read.

worse than that: if the bits turn out to be mostly zero, you just wasted pretty
much the entire bandwidth of the CPU, maxed out the Reservation Stations *for
no good reason*.

using separate and distinct CRs (as in the pseudocode from comment #36) these
are *really easily* schedulable, and resolve incredibly easily as well, without
slowing down the entire OoO engine waiting for reading of one single integer
register.

> > PowerISA *specifically* has these 4bit CRs  and i feel we should go with the
> > flow on that rather than try to invent an alternative condition scheme that
> > does not mesh with what the original PowerISA designers envisaged (for
> > scalar)
> 
> There are other issues with the CRs: several of them are callee-save so any
> function using vectorized compare would usually need to save and restore the
> CRs.

callee-save... this is an ABI design issue?  that's solveable by avoiding
conflicting with CR0-7 i.e. vectorised-CR uses CR8 and above.

> > think of it this way: a single bit predicate of compares effectively throws
> > away the other 2 bits of the same op if using CR, doesn't it?
> > 
> > so to replicate that exact same behaviour it would be necessary to call at
> > least 3 vector compares (single bit predicate) and even use 3 separate int
> > regs to do so just to get what could have been done with only a single
> > vector CR based compare.
> 
> Except that you rarely need more than one compare result, so all the extra
> bits are usually ignored.

in standard scalar operations yes, however in predicated vector operations it's
a different story.

> Also, the isel instruction doesn't seem to have the right semantics:

yeah i can't seem to find something in the scalar ISA that "just operates on
transferring of bits from CR to GPR".  anything involving RA/RT works on
batches of 4 bits of the full CR.

> what if
> you want floating-point ge where the defined semantics are you need the
> output to be set if the greater or equal bits are set, but not the less or
> unordered bits? 

errr... i've not looked at the FP instructions (chapter 4, p123 v3.0B) at all.
i know it generates CR1 (when Rc=1)

thinking it through "out loud" so to speak, i'd say... mmmm... you'd do a FP
subtract (with Rc=1), this would normally store in CR1.

however if vectorised, it would put them into say.... CR8....CR(8+VL-1)

then you could perform standard crnand/cror operations on them to compute the
required predicate bit (in those exact same CRs if you wanted) and go from
there

> (the unordered bit is where integer compares put SO) There
> isn't any one bit you could pick out of the CR that gives the required
> combination.

ah that's interesting.  no FP ops (p154) have an OE=1.  soOooo...
continuing this in reply to comment #39

> The other benefit of having the compare instruction directly generate the
> mask is that now or in the future implementations could need less clock
> cycles to execute due to taking less instructions, and also it takes less
> i-cache space.

this by turning the underlying microarchitecture into CISC (explained above). 
it's a complex trade-off.

-- 
You are receiving this mail because:
You are on the CC list for the bug.