[Libre-soc-dev] cray-style vector of 40 years setting VL=0 at runtime

Tue Oct 4 18:28:36 BST 2022

On Tue, Oct 4, 2022 at 5:04 PM Jacob Lifshay <programmerjake at gmail.com> wrote:
>
> On Tue, Oct 4, 2022, 05:48 lkcl <luke.leighton at gmail.com> wrote:
>>
>> summary so far, almost complete:
>> https://libre-soc.org/openpower/sv/svp64/discussion/
>
>
> looks mostly good.
>
> it would also be useful to have an example where setvl is used to dynamically set vl where it could be zero and then need to use unconditional scalar/subvector ops.

given it may need justification to OPF ISA WG yes

> maybe extract the example from svp64 utf-8 validation?

good idea.

> i just realized packed SIMD isn't such a good motivation because if we know VL=4, sv.add/subvl=2 r12, r14, r16 already just executes exactly one subvector with 2 elements as is needed for emulating 128-bit packed SIMD.

which would need sv.offset and no subvl anyway or single-bit
predication to extract an elwidth-overridden element.... yes.

> imho given that register numbers have to be decoded in the issue pipeline to set the dependency matrixes correctly anyway (or register renaming for more traditional OoO microarchitectures, or register dependencies for stalling on in-order microarchitectures), it shouldn't be that huge of a problem.

it's a huge concern, you are drastically underestimating the
complexity of multi-issue which is critically dependendent on
kbowing the exact length VL at all times.

multi-issue is *not* straightforward, you need to sequentially
identify and issue *VL* worth of data to backend ALUs, all
sequentially interleaved with Scalar (v3.0) instructions.

without Auto-Scalar (Auto-VL=1) then aside from FFirst it is
dirt simple "is vector yes reserve VL RSes for first batch, otherwise
reserve QTY 1of RSes for Scalar op, do same for next op".

that means there is a *chain* of critical dependencies from VL
on multi-issue (similar to identifying if the instruction is 32 or
64 bit)

now throw partial-decode to get the EXTRA2/3 into the mix on
that one.

it's doable, and luckily it's still "forward-progressive".  no need to
try to go back in time or try to link in state information half way
down one of the parallel Decode/Issue pipelines

>> 2) is losing the ability to test all *relevant* bits of a predicate mask
>> worth it?

> imho if fail-first is illegal in scalar mode, the fail-first bits could instead be reused only for scalar mode to instead be please-give-me-full-vl mode (with a better name). 

things are getting pretty complex already, and there is now a
high cost associated with changes in decode (binutils)

> this retains basically all relevant behavior since afaict fail-first wasn't that useful for scalar mode even without VL=1 overrides.

there's only one element, VL ends up being truncated to the first nonzero predicate mask bit.  i can see that being useful even on a nominal sv.nop (sv.ori./ff=eq/m=r3 r0,r0,0)

l.