[Libre-soc-isa] [Bug 213] SimpleV Standard writeup needed

Mon Oct 19 02:32:35 BST 2020

https://bugs.libre-soc.org/show_bug.cgi?id=213

--- Comment #52 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
(In reply to Jacob Lifshay from comment #51)
> After spending some time to think, I think I came up with an idea:
> 
> I think we should go back to our ideal requirements for the ISA, therefore,
> I think we should account for the following:
> 1. We should design the ISA to be what would work well on future processors.
> 2. We should not add in extra ISA-level steps just because our current
> microarchitecture might require them, that would hamstring future
> microarchitectures that don't need the extra steps.

bear in mind that the basic fundamental principle is to squeeze SV in as a
conceptual for-loop between instruction decode and instruction issue.  it turns
out that if you stick to this then it is irrelevant what microarchitecture and
even to a large extent what ISA the SV concept is applied to.

in addition, keeping to this fundamental principle also makes modifications to
existing binutils, compilers and simulators also very simple because, again, it
turns out to be literally a for-loop.

> 3. It's fine for our current microarchitecture to be non-optimal for less
> common cases, such as very large VL values.

yes.  two things here: firstly, POWER10 shows a way forward where multi-issue
SIMD is possible.  we can do the same.  VL=8 can have multi-issue do 2 or 4
backend SIMD Vector ops per clock.

or

interestingly on the phoronix discussion a big.little concept came up, where
the idea was for little cores to have massive SIMD backends.  16 or 32
elements, dog slow at scalar but huge throughput on vector.

> 4. It's ok to add new instructions where necessary, we're doing new things
> after all.

yes.

> 5. It's ok to deviate from how Power's scalar ISA does things when there's a
> better way.

well... there is, as long as the development cost implications for the
hardware, toolchain and compilers is not too high.

this is really important because if we deviate too much then we face resistance
from the Power community as well.

remember that everything we come up with also has to be justified to the
OpenPOWER Foundation ISA WG.  it's not going to get rubberstamped, and if we
get resistance and have to maintain our own hard forks of toolchains it
completely defeats the objectives of the project.

> Based on the previous points, therefore I think we should do the following:
> 
> Use integer registers for vector masks.

i agree very much that the *application* of predicates should be done from
intregs.  the reason is that for the VL=64 max, no matter the microarchitecture
it's a single scalar regfile read.

in the dead-simple microarchitectural case (one element issued per clock, like
in microwatt's VSX patch that Paul Mackerras is doing) it's ridiculously easy
to add in an extra reg read at the start of the VL loop, shift it down on each
loop, test bit zero, and skip or not skip the operation.

this illustrates very clearly and cleanly precisely why it's called "Simple" V.

> I honestly think the CR registers are somewhat of a wart of Power's scalar
> ISA, 

true, except it's there, and it has some interesting advantages.  one of them
is that tests which take a long time to do (DIV ops) can be stored and
manipulated later.

the second is that branches do not delay significantly by having extra gates
that would cause them to have to be split across extra cycles.

the combination of these factors reduces in-flight ops in loops and when coming
up to a branch point in high performance OoO designs.

> it works more-or-less fine for scalar, but should not be extended to
> vectors of CR registers.

no i agree: we should not try to use CRs for predicates (on the input side).  i
looked at the possible implementation: it's hell.  64 CRs being read even
before issuing the predicated operation would require... 64 CR Read Hazards.

compared to *ONE* scalar 64 bit int reg (regardless of microarchitecture) it is
blindingly obvious that int regs for predicate masks is the winning option.

> Running out of integer registers just because of
> masks is not a concern, we have 128. 

yes.

> Using CR registers violates point 2
> because one of the top 3 or 4 most common operations we want is testing to
> see if no lanes are active and skipping some section of code based on that
> (used to implement control flow in SIMT programs) -- We can just compare the
> generated mask to zero or all ones using scalar compare instructions,

funny i just described the same thing to Alain in a phone call yesterday.

> additionally we can have mask-generating instructions store to CR0 (or CR1
> for FP) if all unmasked lanes, no unmasked lanes, or some unmasked lanes
> generate set mask bits.

interesting.  do note this on the wiki page please.  the reason i like it is
because it means that branch can be entirely left alone.  we do not need to
modify PowerISA branch *at all*.

i wonder if some pre-existing integer operations already effectively do exactly
this.  for example a compare of an int reg against zero will produce a CR0
"equal to zero" bit.

this is important to check, because if int reg operations already do the job we
have less deviation and therefore a higher chance of acceptance.

> Other benefits of integer registers as masks:
> load/store for spilling takes 1 instruction, not several. Also, takes waay
> less memory.
> All the fancy bit manipulation instructions operate directly on integer
> registers: find highest/lowest set bit, popcount, shifts, rotates,
> bit-cyclone, etc.

there are some additional crucial bitmanipulation instructions needed here,
including one that propagates from the first 1 and stops at the next 1.  some
of these are listed in RVV's "mask" opcodes, and they are essential for
efficiently doing strncpy and other operations.

we can drop these in as "scalar bitmanip" where they will benefit scalar as
well.

> Implementation strategies:
> Optimize for the common case when VL is smaller than 16 (or 8). Using a
> larger VL means we're likely to run out of registers very quickly for all
> but the simplest of shaders, and our current microarchitecture is highly
> unlikely to give much benefit past VL=8 or so.

see above about the POWER10 multi-issue strategy, and about the big.little
idea.

> We can split up the one or two integer registers optimized for masking into
> subregisters, but, to allow instructions to not have dependency matrix
> problems,

ahh actually, a single scalar intreg as a predicate mask is dead simple.  it's
one read.  that's it.

now, all the preficated element ops have to have a shadow column waiting *for*
that read to complete, but this is not hard.

> we split it up differently:
> every 8th (or 16th) bit is grouped into the same subregister.

i *think* what you are saying is that the VL-based for-loop should do 8
elements at a time, push these into SIMD ALUs 8 at a time, so if FP32 then that
would be 4x SIMD 2xFP32 issue in one cycle.

below... let us say that this is an elwidth of 8.

> register bit         subregister number
> bit  0               subregister 0
> bit  1               subregister 1
> bit  2               subregister 2
> bit  3               subregister 3
> bit  4               subregister 4
> bit  5               subregister 5
> bit  6               subregister 6
> bit  7               subregister 7

so this would go into one 64bit SIMD-aware ALU, with the Dynamic Partitions set
to 8 bit, and the first 8 bits of the integ predicate would also be sent in as
"write enable" lines.

if however all 8 bits of the predicate mask 0-7 were ALL zero then the Shadow
Matrix would pull "GO_DIE" on that entire FU's operation.

> bit  8               subregister 0
> bit  9               subregister 1
> bit 10               subregister 2
> bit 11               subregister 3
> bit 12               subregister 4
> bit 13               subregister 5
> bit 14               subregister 6
> bit 15               subregister 7

likewise this would be sent to a separate SIMD-aware ALU, but this time using
bits 8-15 of the intregs predicate.

again it would still be shadowed, and again, if bits 8-15 were zero GODIE would
be pulled.

in each case the operation goes ahead but is not allowed to write until the
predicate has actually been read from the regfile, and its bits divided up and
analysed.

> bit 16               subregister 0
> ...
> 
> This allows us to, for VL smaller than the number of subregisters per
> register, act as if every mask bit was an independent register, giving us
> all the dependency-matrix goodness that comes with that.
> 
> This also requires waay less additional registers than even the extending CR
> to 64 fields idea.

indeed.

this just leaves the completely separate issue of whether to vectorise CR
production, *including* all current scalar production of CR0 (and CR7).

i am referring to all "add." "or." operations (Rc=1) as well as cmp, and also
to CR operstions themselves (crand etc).

the reasons are as follows:

1) any modification of the ISA to replace CR generation with storage in an
integer scalar reg bitfield is a "hard sell" to OPF, as well as gcc and llvm
scalar PowerISA maintainers.

2) for suboptimal (easy, slow) microarchitectures it is easy, but for parallel
architectures the Write Hazard of a single int reg becomes a serious
bottleneck.

3) the codepath in HDL actually requires modification to add "if in VL mode fo
something weird to select only one cmp bit otherwise do normal CR stuff". 
whereas if if it is left as-is the *existing* CR handling HDL can be
parallelised alongside the ALU element operation and it's an easy sell to HDL
engineers.

bottom line is that it is not hard to vectorise CR production right alongside
the result production, in fact if we *don't* do that i think we're going to
face some tough questions from experienced OPF and IBM ISA people (once they
grasp SV which Paul definitely does)

it is also not hard to vectorise the tranfer operations between CRs and
intregs, and if we allow transfer of Vectors of CRs into one scalar intreg
(which is already what "mfcr" already does!) then we keep to existing PowerISA
design concepts, have the benefits of VRs, yet can still transfer vectors of CR
tests to an intreg and perform bitmanip operations, clz, popcount and many more
on it, efficiently and effectively.

-- 
You are receiving this mail because:
You are on the CC list for the bug.