[Libre-soc-isa] [Bug 614] New: Modify SVP64 to support scalar instructions (where VL==0 doesn't convert them to nop)

Thu Mar 11 17:37:11 GMT 2021

https://bugs.libre-soc.org/show_bug.cgi?id=614

            Bug ID: 614
           Summary: Modify SVP64 to support scalar instructions (where
                    VL==0 doesn't convert them to nop)
           Product: Libre-SOC's first SoC
           Version: unspecified
          Hardware: Other
                OS: Linux
            Status: CONFIRMED
          Severity: enhancement
          Priority: ---
         Component: Specification
          Assignee: programmerjake at gmail.com
          Reporter: programmerjake at gmail.com
                CC: libre-soc-isa at lists.libre-soc.org
   NLnet milestone: ---

Motivation:
http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-March/002116.html
> The reason is that GPU code has scalar and vector code mixed together
> everywhere, so SV not having separate scalar instructions could increase
> instruction count by >10% due to the additional setvl instructions, also it
> could greatly increase pipeline stalls on dynamic setvl, since the VL-loop
> stage has to wait for the setvl execution to know what VL to use (for
> setvli a stall isn't needed since the decoder can just use the value from
> the immediate field, no wait required). This comes from the basic compiler
> optimization of using scalar instructions to do an operation once, then
> splat if needed, instead of using vector instructions when all vector
> elements are known to be identical. Basically all modern GPUs have that
> optimization, also described as detecting the difference between uniform
> and varying SIMT variables before the whole-function-vectorization pass. If
> that optimization isn't done, it will increase power consumption
> substantially and will take longer to run due to the many additional
> instructions jamming up the issue width.
> ...
> Also, in my mind SV should have always had full separate scalar
> arguments/instructions, since otherwise we get a half-done attempt at
> having scalar code that makes the compiler *more* complex -- the compiler
> already has to handle having separate codepaths for scalar and vector
> instructions, just, without the ISA-level concept of scalar SVP64
> instructions, it adds many more special cases to translating scalar
> instructions, since they may need to be converted to effectively vector
> instructions with VL needing to be guaranteed non-zero, often requiring
> saving VL (if it was modified from fail-on-first), overwriting VL, running
> the scalar op, then overwriting VL again to restore VL for the surrounding
> vector ops.

Alternative #1:
http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-March/002124.html
> An alternative option that achieves the same end goal without needing to
> move the decoder is to use the scalar/vector-bit for the first/dest reg
> (which is always in the same spot -- instruction forms without a dest reg
> can have their SVP64 register fields moved one reg field over to make
> space) as a whole-instruction scalar/vector-bit, the operations that that
> removes (those with scalar dest but vector arguments -- which are not
> common instructions) can be effectively substituted with scalar mv.x.
> Since the bit is always in the same spot and all instructions have that
> bit, decoding it from the SVP64 prefix then becomes utterly trivial.
> This also simplifies the logic for the SV loop FSM since it no longer needs
> to implement the write-once-then-finish logic which I expect to be quite
> complex.

Alternative #2:
Move register decoder (required in fetch pipeline anyway to correctly add
instruction to scheduling matrixes) to before VL-loop stage, allowing us to use
that to get the vector/scalar bits of all svp64 register fields and OR them
together to form a whole-instruction vector/scalar bit.
http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-March/002116.html
> Decoding should only take a few more gates (around 10, less than 100) since
> you just have a few separate gates to OR all vector/scalar SVP64 bits
> together for each SVP64 prefix kind (there's only around 5 kinds) and use a
> mux based on the decoded number of registers (which I expect the decoder to
> need anyway for dependency matrix stuff) to select which OR gate's output
> to use. This produces a vector_and_not_scalar signal that should be easy to
> add to the VL-loop stage.

All alternatives don't increase the number of instructions since all that
happens is we're reinterpreting some combinations of vector/scalar register
arguments as making the instruction bypass the VL loop, thereby executing once
no matter what value VL currently has. These instructions will ignore vstart
and not modify it, since vstart is only used/modified by the VL-loop. This
doesn't need a new opcode mnemonic. The SUBVL loop still runs, allowing using
SUBVL=1 for scalar operations and SUBVL=2/3/4 for SIMT-uniform vec2/3/4
operations.

-- 
You are receiving this mail because:
You are on the CC list for the bug.