[Libre-soc-isa] [Bug 213] SimpleV Standard writeup needed
bugzilla-daemon at libre-soc.org
bugzilla-daemon at libre-soc.org
Mon Oct 19 21:29:07 BST 2020
https://bugs.libre-soc.org/show_bug.cgi?id=213
--- Comment #70 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
(In reply to Jacob Lifshay from comment #64)
> By contrast, using 8-bit lanes for masks means we'd have to add extra logic
> to handle VL > 8 and we'd have to handle scaling the result (an extra shift
> instruction), and we'd have to handle making sure that lanes have all bits
> set before inverting them. If we instead decide to have an on lane generate
> 0xFF instead of 0x01, then popcount is likewise messed up.
>
> All of the above mess is solved efficiently by just having 1 bit per lane.
ok, the problem is that it's not that simple (never is). there is no concept
of "lanes" in SV. or there is: they're the ALU widths (which will be either
64-SIMD or we did discuss doing 32-SIMD and splitting the regfile into HI-32
and LO-32, so that 64-bit operations need a pair of 32-wide ALUs to
collaborate)
these ALU widths are completely divorced from architectural (ISA, SV) element
widths, and consequently no amount of choice of bit-width for predicate lanes -
whether it be 8-bit, 16-bit, is going to cut it.
the reason is because:
* when you request elwidth=8bit operations, you need *8* predicate bits
to be allocated (routed) to a given 64-bit SIMD ALU
* when you request elwidth=16bit, that's 4 predicate bits
* elwidth=32 bit that's 2 predicate bits
* elwidth-64 bit is only 1
the routing and DM allocation on that - the subdivision of the 8-bit masks
concept - is going to be a pig.
> Vectorized CRs still have a bunch of the above mess, because they aren't 1
> bit per lane.
again you're conflating the (false/inapplicable) concept of "lanes" as being an
architectural concept in SV elements, where it can't actually be applied. i
know it works in Cray-style Vector ISAs, but it doesn't work here.
the only thing that's really going to work is to have *element* based
predicates. Cray-style architectures (including RVV) do this by allocating an
entire element of a vector as a predicate (ignoring all but the *one* LSB).
our equivalent is "registers". actual scalar registers.
in other words: to solve the problem that you highlighted (overlaps) we *need*
each predicate to be in *independent* scalar registers.
and it turns out that PowerISA has something that we happen to already have
planned to allocate DM space for them, even though they're only 4 bit wide:
CRs.
so Vectorised CRs _are_ a bit of a mess, but they're a mess because unusual
bitmanip ops don't exist for them (only AND/OR/NAND/XOR etc.) and that can be
solved by just vectorising mfcr, and running int scalar bitmanip ops. which we
can macro-op fuse if we really want to (later).
> Also, they have a ISA-level limiting effect on large VLs
> because of quickly running out of the 64 CRs when you need multiple masks
> (common in non-trivial shaders).
i think we can solve that one by doing 128 CRs. that gives a total of
128x4=512 bits worth of predicate mask space. and intregs can be used as
"spill" if we absolutely have to.
--
You are receiving this mail because:
You are on the CC list for the bug.
More information about the Libre-SOC-ISA
mailing list