[Libre-soc-isa] [Bug 213] SimpleV Standard writeup needed

Mon Oct 19 17:56:22 BST 2020

https://bugs.libre-soc.org/show_bug.cgi?id=213

--- Comment #64 from Jacob Lifshay <programmerjake at gmail.com> ---
(In reply to Luke Kenneth Casson Leighton from comment #63)
> wait... wait... arrgh no that doesn't quite work, because in some cases you
> actually want 4 bits of the predicate mask to go to the SIMD-capable ALU,
> sometimes you want 2 bits (for 2xFP32), sometimes 1 bit (for 1xFP64) and so
> even an 8-bit subdivision is going to be sub-optimal.
> 
> argh.
> 
> haha.  you're going to find this amusing / ironic: this is precisely where
> using CRs as predicate masks would shine.
> 
> the load on the DMs would be horrendous unless we worked out a way to
> "batch" them.  and funnily enough, i've already implemented 8xCR "whole_reg"
> reading (and noted a bugreport to implement that "cascade" system when it
> comes to adding the DMs).

I'm advocating doing a similar thing, except at the bit group level for 1 or 2
specially optimized integer registers instead of the CR field level with a
vector of CRs.

See comment 53 for details of how a 64-bit register can be broken into 8 8-bit
registers. I think we should support either 8 or 16 subgroups for 2 integer
registers designated as optimized for vector masks. We would design the
SVprefix such that compare ops target only those 2 registers and the execution
mask for an instruction can be set to those 2 registers or their complement. If
we have space in SVprefix, we could expand to more than 2 registers -- also,
not all of the registers we can use for masks have to be split up, we can fall
back to the less efficient non-vector-chaining approach when the compiler
intentionally picks a different register than the 1 or 2 we have optimized for
masks.

As for which registers to use for masks, I think at least 1 of them should be
the 1st argument register/return register (since passing execution masks
between functions is common) and one should be a callee-saved register, the
rest can be selected as needed.

Importantly, I'd strongly argue for a dense bitvector as masks, rather than
using the LSB of 8-bit (or more) elements, since that works much better with
the bit manipulation instructions, e.g. find first clear mask lane becomes a
not (potentially combinable with the op generating the mask) and a find lowest
set bit. This directly gives the lane index.

By contrast, using 8-bit lanes for masks means we'd have to add extra logic to
handle VL > 8 and we'd have to handle scaling the result (an extra shift
instruction), and we'd have to handle making sure that lanes have all bits set
before inverting them. If we instead decide to have an on lane generate 0xFF
instead of 0x01, then popcount is likewise messed up.

All of the above mess is solved efficiently by just having 1 bit per lane.

Vectorized CRs still have a bunch of the above mess, because they aren't 1 bit
per lane. Also, they have a ISA-level limiting effect on large VLs because of
quickly running out of the 64 CRs when you need multiple masks (common in
non-trivial shaders).

-- 
You are receiving this mail because:
You are on the CC list for the bug.