[Libre-soc-isa] [Bug 213] SimpleV Standard writeup needed

Thu Oct 8 20:36:42 BST 2020

https://bugs.libre-soc.org/show_bug.cgi?id=213

--- Comment #38 from Jacob Lifshay <programmerjake at gmail.com> ---
(In reply to Luke Kenneth Casson Leighton from comment #35)
> (In reply to Jacob Lifshay from comment #34)
> 
> > the idea is that the compare would produce 1 bit per vector lane and
> > essentially directly generate a predicate mask into an integer register. For
> > that to work, the compare would need extra bits (normally in the branch
> > instruction for scalar powerpc) to know which of lt, le, eq, ne, etc. it
> > should use, those bits come from the prefix.
> > 
> > As long as it's one bit per lane, scalar integer ops are even better than cr
> > ops for the required bit manipulations.
> 
> i came up with an architectural plan to implement the hidden bitsetting in
> 6600 style OoO and to be honest it was a bit of a pig.
> 
> an exception in the middle required a very messy design.

Why would you ever need to handle exceptions in the middle of a cmp
instruction? cmp instructions won't ever generate their own exceptions and
interrupts would generate a fake interrupt instruction which would go either
before or after the sequence of cmp instructions.

All we need to do is design cmp to target only one of 2 integer registers and
tell the scheduler that they only write to their destination register and add
individual bit write enables on those 2 integer registers or just treat each
bit as a separate register. The rest of the destination register selection bits
can be used to encode the compare condition, along with the two other reserved
bits in the cmp[l] instructions.

We could also keep the above encoding with cmp targeting one of 2 integer
registers and have a internal non-ISA-visible register for mask accumulation
that is copied to the destination integer register. The final copy could be
included in the final lane compare instruction or split out as a separate
micro-op.

> CRs on the other hand by being treated as actual "real" registers respected
> and each given their own Dependency Matrix column are far easier to handle.
> 
> exceptions in the middle of that, no problem, just restore VL forloop where
> it left off.
> 
> bortom line is that PowerISA has condition registers which store results
> that you then decide which bits to test to make different branches, i.e. the
> compare is separated from the branch *by* the CR.
> 
> this is conceptually similar to RV FP compare except it wastes an entire 64
> bit int reg to do it (RV FP cmp stores 1 or 0 in an int reg for FP
> GT/LT/LE/NE ops which you then follow up with an integer BEQzero)

An entire int reg --- we have 128, losing 1 won't hurt much, especially since
we'd need it for masking vector ops anyway (what compare results are usually
used for).

> PowerISA *specifically* has these 4bit CRs  and i feel we should go with the
> flow on that rather than try to invent an alternative condition scheme that
> does not mesh with what the original PowerISA designers envisaged (for
> scalar)

There are other issues with the CRs: several of them are callee-save so any
function using vectorized compare would usually need to save and restore the
CRs.

> think of it this way: a single bit predicate of compares effectively throws
> away the other 2 bits of the same op if using CR, doesn't it?
> 
> so to replicate that exact same behaviour it would be necessary to call at
> least 3 vector compares (single bit predicate) and even use 3 separate int
> regs to do so just to get what could have been done with only a single
> vector CR based compare.

Except that you rarely need more than one compare result, so all the extra bits
are usually ignored.

Also, the isel instruction doesn't seem to have the right semantics: what if
you want floating-point ge where the defined semantics are you need the output
to be set if the greater or equal bits are set, but not the less or unordered
bits? (the unordered bit is where integer compares put SO) There isn't any one
bit you could pick out of the CR that gives the required combination.

The other benefit of having the compare instruction directly generate the mask
is that now or in the future implementations could need less clock cycles to
execute due to taking less instructions, and also it takes less i-cache space.

-- 
You are receiving this mail because:
You are on the CC list for the bug.