[Libre-soc-isa] [Bug 213] SimpleV Standard writeup needed
bugzilla-daemon at libre-soc.org
bugzilla-daemon at libre-soc.org
Mon Oct 19 00:57:45 BST 2020
https://bugs.libre-soc.org/show_bug.cgi?id=213
--- Comment #51 from Jacob Lifshay <programmerjake at gmail.com> ---
After spending some time to think, I think I came up with an idea:
I think we should go back to our ideal requirements for the ISA, therefore, I
think we should account for the following:
1. We should design the ISA to be what would work well on future processors.
2. We should not add in extra ISA-level steps just because our current
microarchitecture might require them, that would hamstring future
microarchitectures that don't need the extra steps.
3. It's fine for our current microarchitecture to be non-optimal for less
common cases, such as very large VL values.
4. It's ok to add new instructions where necessary, we're doing new things
after all.
5. It's ok to deviate from how Power's scalar ISA does things when there's a
better way.
Based on the previous points, therefore I think we should do the following:
Use integer registers for vector masks.
I honestly think the CR registers are somewhat of a wart of Power's scalar ISA,
it works more-or-less fine for scalar, but should not be extended to vectors of
CR registers. Running out of integer registers just because of masks is not a
concern, we have 128. Using CR registers violates point 2 because one of the
top 3 or 4 most common operations we want is testing to see if no lanes are
active and skipping some section of code based on that (used to implement
control flow in SIMT programs) -- We can just compare the generated mask to
zero or all ones using scalar compare instructions, additionally we can have
mask-generating instructions store to CR0 (or CR1 for FP) if all unmasked
lanes, no unmasked lanes, or some unmasked lanes generate set mask bits. That
would shrink to just 2 instructions the common sequence:
outer_mask = ...
then_mask = vector_compare_le(outer_mask, a, b);
if(then_mask != 0) {
// then_part with then_mask
}
...
assembly code:
...
vec_cmp_le. then_mask, a, b, mask=outer_mask
branch_not cr0.any, skip
// then_part with then_mask
skip:
...
which comes from the inner `if` in the following SIMT code:
if(outer_condition) { // could also be a loop instead
...
if(a <= b) {
// then_part
}
...
}
Other benefits of integer registers as masks:
load/store for spilling takes 1 instruction, not several. Also, takes waay less
memory.
All the fancy bit manipulation instructions operate directly on integer
registers: find highest/lowest set bit, popcount, shifts, rotates, bit-cyclone,
etc.
Implementation strategies:
Optimize for the common case when VL is smaller than 16 (or 8). Using a larger
VL means we're likely to run out of registers very quickly for all but the
simplest of shaders, and our current microarchitecture is highly unlikely to
give much benefit past VL=8 or so.
We can split up the one or two integer registers optimized for masking into
subregisters, but, to allow instructions to not have dependency matrix
problems, we split it up differently:
every 8th (or 16th) bit is grouped into the same subregister.
register bit subregister number
bit 0 subregister 0
bit 1 subregister 1
bit 2 subregister 2
bit 3 subregister 3
bit 4 subregister 4
bit 5 subregister 5
bit 6 subregister 6
bit 7 subregister 7
bit 8 subregister 0
bit 9 subregister 1
bit 10 subregister 2
bit 11 subregister 3
bit 12 subregister 4
bit 13 subregister 5
bit 14 subregister 6
bit 15 subregister 7
bit 16 subregister 0
...
This allows us to, for VL smaller than the number of subregisters per register,
act as if every mask bit was an independent register, giving us all the
dependency-matrix goodness that comes with that.
This also requires waay less additional registers than even the extending CR to
64 fields idea.
--
You are receiving this mail because:
You are on the CC list for the bug.
More information about the Libre-SOC-ISA
mailing list