[Libre-soc-isa] [Bug 213] SimpleV Standard writeup needed

Mon Oct 19 03:34:51 BST 2020

https://bugs.libre-soc.org/show_bug.cgi?id=213

--- Comment #53 from Jacob Lifshay <programmerjake at gmail.com> ---
(In reply to Luke Kenneth Casson Leighton from comment #52)
> (In reply to Jacob Lifshay from comment #51)
> > Implementation strategies:
> > Optimize for the common case when VL is smaller than 16 (or 8). Using a
> > larger VL means we're likely to run out of registers very quickly for all
> > but the simplest of shaders, and our current microarchitecture is highly
> > unlikely to give much benefit past VL=8 or so.
> 
> see above about the POWER10 multi-issue strategy, and about the big.little
> idea.

I like the idea of having more execution capacity, however the issue I'm
pointing out is that the compiler will just run out of ISA-level registers, if
each SIMT shader needs space for 1 4x4 f32 matrix (very common for vertex
shaders), that means VL can't be >16 because there aren't enough registers to
even hold that many matrixes. (admittedly, the matrix is often the same for all
shaders and can be shared, but you get my point: there's not much space.)

> 
> > We can split up the one or two integer registers optimized for masking into
> > subregisters, but, to allow instructions to not have dependency matrix
> > problems,
> 
> ahh actually, a single scalar intreg as a predicate mask is dead simple. 
> it's one read.  that's it.

That's true ... if you completely ignore the need to generate masks.

> now, all the preficated element ops have to have a shadow column waiting
> *for* that read to complete, but this is not hard.
> 
> > we split it up differently:
> > every 8th (or 16th) bit is grouped into the same subregister.
> 
> i *think* what you are saying is that the VL-based for-loop should do 8
> elements at a time, push these into SIMD ALUs 8 at a time, so if FP32 then
> that would be 4x SIMD 2xFP32 issue in one cycle.

Nope, what I had meant was to go back to the idea of having a
microarchitectural register for every bit of an ISA-level integer register,
which allows the equivalent of Cray-style vector instruction chaining. Then,
since having that many columns (rows? icr) in the scheduling dependency matrix
isn't good, we group bits together, reducing the number of microarchitectural
registers:
microarchitectural reg 0: bits 0, 8, 16, 24, 32, 40, 48, and 56 of ISA-level
reg
microarchitectural reg 1: bits 1, 9, 17, 25, 33, 41, 49, and 57 of ISA-level
reg
microarchitectural reg 2: bits 2, 10, 18, 26, 34, 42, 50, and 58 of ISA-level
reg
microarchitectural reg 3: bits 3, 11, 19, 27, 35, 43, 51, and 59 of ISA-level
reg
microarchitectural reg 4: bits 4, 12, 20, 28, 36, 44, 52, and 60 of ISA-level
reg
microarchitectural reg 5: bits 5, 13, 21, 29, 37, 45, 53, and 61 of ISA-level
reg
microarchitectural reg 6: bits 6, 14, 22, 30, 38, 46, 54, and 62 of ISA-level
reg
microarchitectural reg 7: bits 7, 15, 23, 31, 39, 47, 55, and 63 of ISA-level
reg

This allows a vector compare followed by a masked op to have elements in a
masked op start when those corresponding compare elements finish, without
having to wait for all compares to finish -- just like vector chaining. This is
another reason to not have CRs hold the mask result of a vector compare
operation (which *can* be different than scalar compare), since that just
doubles the number of registers that the scheduler has to handle to get
chaining right, and introduces another instruction of delay.

> 
> this just leaves the completely separate issue of whether to vectorise CR
> production, *including* all current scalar production of CR0 (and CR7).
> 
> i am referring to all "add." "or." operations (Rc=1) as well as cmp, and
> also to CR operstions themselves (crand etc).
> 
> the reasons are as follows:
> 
> 1) any modification of the ISA to replace CR generation with storage in an
> integer scalar reg bitfield is a "hard sell" to OPF, as well as gcc and llvm
> scalar PowerISA maintainers.

I'm advocating for vector ops to target integer registers, scalar ops still do
the standard CR things. Rc=1 for vector ops can just generate a mask for eq/ne
(I think the most common compare op) or we can just reassign Rc to mean
something else for vector ops (one option is to just declare it invalid).

binutils is just about the only place where you might want to treat scalar and
vector instructions the same, everywhere else (e.g. LLVM) treats vectors
differently.
> 
> 2) for suboptimal (easy, slow) microarchitectures it is easy, but for
> parallel architectures the Write Hazard of a single int reg becomes a
> serious bottleneck.

My idea for splitting up the integer register(s) optimized for masks into
seperate bits handles this.

> 
> 3) the codepath in HDL actually requires modification to add "if in VL mode
> fo something weird to select only one cmp bit otherwise do normal CR stuff".
> whereas if if it is left as-is the *existing* CR handling HDL can be
> parallelised alongside the ALU element operation and it's an easy sell to
> HDL engineers.
> 
> bottom line is that it is not hard to vectorise CR production right
> alongside the result production, in fact if we *don't* do that i think we're
> going to face some tough questions from experienced OPF and IBM ISA people
> (once they grasp SV which Paul definitely does)

Well, I think the benefits of using integer registers as masks and skipping the
extra copy through CRs outweighs the loss of orthogonality (though one could
argue that having less register files to deal with increases orthogonality).

-- 
You are receiving this mail because:
You are on the CC list for the bug.