[Libre-soc-isa] [Bug 213] SimpleV Standard writeup needed

Thu Oct 8 23:55:31 BST 2020

https://bugs.libre-soc.org/show_bug.cgi?id=213

--- Comment #43 from Jacob Lifshay <programmerjake at gmail.com> ---
(In reply to Luke Kenneth Casson Leighton from comment #40)
> (In reply to Jacob Lifshay from comment #38)
> > (In reply to Luke Kenneth Casson Leighton from comment #35)
> > > an exception in the middle required a very messy design.
> > 
> > Why would you ever need to handle exceptions in the middle of a cmp
> > instruction? 
> 
> not in the middle of one cmp instruction: an exception in the middle of a
> *vector* batch of up to *64* cmp instructions.

that's what I meant.

> > cmp instructions won't ever generate their own exceptions and
> > interrupts would generate a fake interrupt instruction which would go either
> > before or after the sequence of cmp instructions.
> 
> if you forcibly mask out interrupts (exceptions) in the middle of a vector,
> latency goes to s*** :)

not by that much ... if doing f32x32 with a f32x4 SIMD execution unit you get
only 8 + pipeline-length cycles of latency -- not that much. Just think, we can
process an interrupt in less time than it takes to load the interrupt handler
code from DRAM, since it's probably not in the cache anyway.
> > All we need to do is design cmp to target only one of 2 integer registers
> 
> 2 integer registers as reserved as predicates for the whole of the vector
> of cmps?

yes, but we *can* treat each bit or group of 2 or 4 bits like independent
registers, avoiding the need for blocking exceptions during cmp and also
allowing starting a masked operation before the cmp that produces the mask
finishes executing -- the dependencies just operate on lanes rather than the
whole mask.

The other benefit of using the integer registers for mask registers is that we
can use all the weird  and wonderful bitwise ops on them (popcount, find first
set, or find last set) that aren't supported on CR registers as well as
trivially having many more registers for storing multiple masks -- which is
quite important on vectorized code with many different conditions that get
combined together; we would have very high pressure on the CR registers due to
only being able to fit a few masks in them since a 64-lane compare would
overwrite all 4 bits in all registers.

> if the answer to that is "yes", i didn't make a fuss about it but the OoO
> scheduling for that is an absolute pig.

if we split those 2 integer registers that you can write cmp results to (and
not any others) into many small groups of bits that resolves the OoO scheduling
concern.

Instructions should be able to use more than 2 integer registers as mask inputs
since read dependencies are not a concern.

> it basically means that batches of elements actually depend on the same
> (integer) register, making it a *multi* targetted Write Hazard.
> 
> contrast this to each cmp independently targetting a completely independent
> CR that has a completely independent Dependency Matrix column that in no way
> overlaps with any other element, and it should be pretty clear that the use
> of CRs is an absolutely massive design simplification.

I'm advocating for those two integer registers being the targets for all vector
instructions that produce masks because integer registers have the benefits of
being an integer register without the drawbacks of the CR registers as
explained above.
> 
> 
> > and tell the scheduler that they only write to their destination register
> > and add individual bit write enables on those 2 integer registers or just
> > treat each bit as a separate register. 
> 
> right.  this requires the creation of at least a 32-wide Dependency Matrix
> just to cover individual bits of a register.
> 
> i mean - it _works_... but here's the thing: *that's exactly what's going to
> have to be done for the Condition Registers*.
> 
> so in addition to (say) a minimum of 32-wide DM columns added for CRs, on
> top of that you're proposing an ADDITIONAL 32 DM columns for covering
> single-bit predication...

ummm... you only need the 8 existing CRs if using the integer registers for
masking.

> > > think of it this way: a single bit predicate of compares effectively throws
> > > away the other 2 bits of the same op if using CR, doesn't it?
> > > 
> > > so to replicate that exact same behaviour it would be necessary to call at
> > > least 3 vector compares (single bit predicate) and even use 3 separate int
> > > regs to do so just to get what could have been done with only a single
> > > vector CR based compare.
> > 
> > Except that you rarely need more than one compare result, so all the extra
> > bits are usually ignored.
> 
> in standard scalar operations yes, however in predicated vector operations
> it's a different story.

no, it *is* basically the same. you don't generally need eq, gt, and lt all
from the same compare -- all you need is predication based on "did the
corresponding lane of the vector compare spit out a true" and the corresponding
"did it spit out a false" case, both of which can be achieved with a single
mask by having a invert-mask bit in the SimpleV prefix for predicated
instructions.

-- 
You are receiving this mail because:
You are on the CC list for the bug.