[Libre-soc-dev] cray-style vector of 40 years setting VL=0 at runtime

Sun Oct 2 18:22:59 BST 2022

On Sun, Oct 2, 2022, 09:41 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
wrote:

> > and therefore complaining about how they don't match Cray is pointless
> > because Cray didn't set any precedent for how they should behave.
>
> you still cannot put ambiguous comments into high-profile examples
> that could be *interpreted* by readers looking for excuses to deliberately
> misjudge our work as being that we have no idea what we are talking
> about, or that we have designed something that's "complete incompetent
> rubbish".
>

ok, good point. though if they're looking for ways to discredit us, this is
far from the only place that can be misinterpreted or twisted, so imho we
should generally instead worry about the average reader who just wants to
understand and isn't out to get us.

> This was committed before SVP64 Scalar prefixes existed, so at that
> > point in time scalar operations that access high registers *were* poorly
> > thought out, as I have pointed out multiple times in the past (SVP64
> > Scalar prefix mostly fixed that -- imho we still need something for
> > subvectors, having all arguments be scalar (even if subvl!=1)
> > for the standard SVP64 prefix should ignore VL and only execute 1
> subvector).
>
> that's a tricky one, which would need some strong justification as to
> why simply using setvl with VL=2/3/4 instead is insufficient, or if there
> is sufficient usage to warrant the 2 bit budget for subvl in SVP64Single.
>

imho it should be in SVP64, not SVP64Single, simply reuse all scalar/vector
bits being set to scalar to mean temporarily override VL=1. a much weaker
justification is necessary there since we don't need extra encoding bits
since subvl is already in SVP64. This allows SVP64Single's spare bits to be
used for other important purposes such as not taking up all the encoding
space or for other options needed for SVP64Single.

One justification for why subvl VL=1 is needed is compiling traditional
SIMD code that isn't rewritten to be SV-style (e.g. WASM's 128-bit SIMD
extension, currently/soon widely supported as the main SIMD/vectorization
extension for WASM), there the code will often change which element size
(and therefore vector length) it operates on every few instructions,
scalar-mode SVP64 with subvl=2 (f64x2 or i64x2 (128-bit SIMD)) or subvl=4
(f32x4 or i32x4 (128-bit SIMD) or f64x4 or i64x4 (256-bit SIMD)) captures a
large portion of that traditional SIMD code avoiding requiring setvl every
few instructions. actually if VL is left set to 4, then scalar-mode SVP64
(VL=1 override mode) and vector-mode SVP64 combined with subvl=1/2/4 covers
all possible element counts for 128-bit traditional SIMD, no setvl needed
(except at initial entry to the code).

>
> from this:
>    https://bugs.libre-soc.org/show_bug.cgi?id=905#c1
> it looks like there's 2 bits spare: the only question is, would even a
> small loop fly in the lower Compliancy Levels for Embedded?
>
> one of the advantages of SVP64Single (with no loops at all) is that
> it brings predication and elwidth overrides to the entire Scalar Power
> ISA as well as extending the regfile sizes, which is quite attractive
> on its own merit.  BF16 and FP16 is introduced right across the board
> with absolutely no need to design new opcodes, at all.
>
> adding even any kind of looping in there? i'm ambivalent but concerned
> about the cost of looping in an Embedded SVP64Single environment.
>

if you add subvl to SVP64Single, simply define SVP64Single subvl!=1 to
require an additional extension and trap on tiny embedded cpus that don't
support that extension.

Jacob