[Libre-soc-isa] [Bug 213] SimpleV Standard writeup needed

Mon Oct 19 21:29:07 BST 2020

https://bugs.libre-soc.org/show_bug.cgi?id=213

--- Comment #70 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
(In reply to Jacob Lifshay from comment #64)

> By contrast, using 8-bit lanes for masks means we'd have to add extra logic
> to handle VL > 8 and we'd have to handle scaling the result (an extra shift
> instruction), and we'd have to handle making sure that lanes have all bits
> set before inverting them. If we instead decide to have an on lane generate
> 0xFF instead of 0x01, then popcount is likewise messed up.
> 
> All of the above mess is solved efficiently by just having 1 bit per lane.

ok, the problem is that it's not that simple (never is).  there is no concept
of "lanes" in SV.  or there is: they're the ALU widths (which will be either
64-SIMD or we did discuss doing 32-SIMD and splitting the regfile into HI-32
and LO-32, so that 64-bit operations need a pair of 32-wide ALUs to
collaborate)

these ALU widths are completely divorced from architectural (ISA, SV) element
widths, and consequently no amount of choice of bit-width for predicate lanes -
whether it be 8-bit, 16-bit, is going to cut it.

the reason is because:

* when you request elwidth=8bit operations, you need *8* predicate bits
  to be allocated (routed) to a given 64-bit SIMD ALU
* when you request elwidth=16bit, that's 4 predicate bits
* elwidth=32 bit that's 2 predicate bits
* elwidth-64 bit is only 1

the routing and DM allocation on that - the subdivision of the 8-bit masks
concept - is going to be a pig.

> Vectorized CRs still have a bunch of the above mess, because they aren't 1
> bit per lane.

again you're conflating the (false/inapplicable) concept of "lanes" as being an
architectural concept in SV elements, where it can't actually be applied.  i
know it works in Cray-style Vector ISAs, but it doesn't work here.

the only thing that's really going to work is to have *element* based
predicates.  Cray-style architectures (including RVV) do this by allocating an
entire element of a vector as a predicate (ignoring all but the *one* LSB).

our equivalent is "registers".  actual scalar registers.

in other words: to solve the problem that you highlighted (overlaps) we *need*
each predicate to be in *independent* scalar registers.

and it turns out that PowerISA has something that we happen to already have
planned to allocate DM space for them, even though they're only 4 bit wide:
CRs.

so Vectorised CRs _are_ a bit of a mess, but they're a mess because unusual
bitmanip ops don't exist for them (only AND/OR/NAND/XOR etc.) and that can be
solved by just vectorising mfcr, and running int scalar bitmanip ops.  which we
can macro-op fuse if we really want to (later).

> Also, they have a ISA-level limiting effect on large VLs
> because of quickly running out of the 64 CRs when you need multiple masks
> (common in non-trivial shaders).

i think we can solve that one by doing 128 CRs.  that gives a total of
128x4=512 bits worth of predicate mask space.  and intregs can be used as
"spill" if we absolutely have to.

-- 
You are receiving this mail because:
You are on the CC list for the bug.