[Libre-soc-dev] scalar instructions and SVP64

Wed Mar 10 18:15:15 GMT 2021

On Tue, Mar 9, 2021, 19:39 Adam Van Ymeren <adam at vany.ca> wrote:

> On March 9, 2021 5:05:02 p.m. PST, Jacob Lifshay <programmerjake at gmail.com>
> wrote:
> >On Tue, Mar 9, 2021, 16:50 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
> >wrote:
> >
> >> On Wednesday, March 10, 2021, Jacob Lifshay
> ><programmerjake at gmail.com>
> >> wrote:
> >>
> >> > https://libre-soc.org/irclog/%23libre-soc.2021-03-10.log.
> >> > html#t2021-03-10T00:11:21
> >> >
> >> > You see why we need VL to be ignored when there aren't any vector
> >> > arguments?
> >>
> >>
> >> late (1am), the short answer is no.
> >>
> >
> >Well, then we need to change the spec since a "no" answer is
> >effectively
> >unworkable from a SW perspective, as (poorly) illustrated by that
> >example
> >code.
>
> Can you outline in more detail why it is unworkable from a SW
> perspective?  It seems from a cursory look easily workable by just setting
> VL=1 and element size to 64-bit.
>

The reason is that GPU code has scalar and vector code mixed together
everywhere, so SV not having separate scalar instructions could increase
instruction count by >10% due to the additional setvl instructions, also it
could greatly increase pipeline stalls on dynamic setvl, since the VL-loop
stage has to wait for the setvl execution to know what VL to use (for
setvli a stall isn't needed since the decoder can just use the value from
the immediate field, no wait required). This comes from the basic compiler
optimization of using scalar instructions to do an operation once, then
splat if needed, instead of using vector instructions when all vector
elements are known to be identical. Basically all modern GPUs have that
optimization, also described as detecting the difference between uniform
and varying SIMT variables before the whole-function-vectorization pass. If
that optimization isn't done, it will increase power consumption
substantially and will take longer to run due to the many additional
instructions jamming up the issue width.

Decoding should only take a few more gates (around 10, less than 100) since
you just have a few separate gates to OR all vector/scalar SVP64 bits
together for each SVP64 prefix kind (there's only around 5 kinds) and use a
mux based on the decoded number of registers (which I expect the decoder to
need anyway for dependency matrix stuff) to select which OR gate's output
to use. This produces a vector_and_not_scalar signal that should be easy to
add to the VL-loop stage.

if the VL-loop logic is a stage after the decoder instead of merged into
the last stage of the decoder, we won't need any of the speculative decode
logic in the following paragraph since all the required information already
propagated the previous clock cycle or has plenty of time to propagate in
the current clock cycle if just the vector/scalar part is decoded in the
VL-loop logic stage. This would be on the order of 10 gates with a
propagation delay of 2-3 gates from the beginning of the clock cycle --
easily achievable.

If the above doesn't work, we can fall back to the more complex
implementation: decoding still won't take longer since the decoder can
speculatively issue max(1, min(issue_width, VL)) instructions and cancel
all but 1 the next cycle if it determines that the instruction is scalar --
the canceled instructions may not even need to be scheduled since they can
be removed before getting to the scheduler if there's enough pipeline time.

Though,

> I can see it being a bit of extra work for compilers to know that
> accessing high numbered registers requires a bit of extra work but that
> really doesn't sound unworkable to me.
>

Technically that would work, it would just mean that scalar code could be
severely limited.

>
> I can however imagine that adding a special case for VL==0 could result in
> many more gates/complication for the OoO/scoreboard execution engine.
>

That's handled in the fetch pipeline before the scheduler in the VL-loop
stage, so the stuff after that doesn't know/care since the instructions are
removed before then.

Also, in my mind SV should have always had full separate scalar
arguments/instructions, since otherwise we get a half-done attempt at
having scalar code that makes the compiler *more* complex -- the compiler
already has to handle having separate codepaths for scalar and vector
instructions, just, without the ISA-level concept of scalar SVP64
instructions, it adds many more special cases to translating scalar
instructions, since they may need to be converted to effectively vector
instructions with VL needing to be guaranteed non-zero, often requiring
saving VL (if it was modified from fail-on-first), overwriting VL, running
the scalar op, then overwriting VL again to restore VL for the surrounding
vector ops.

Jacob

>