[Libre-soc-dev] sv.mv x: the instruction from hell

Sun Jun 5 08:39:35 BST 2022

On Sat, Jun 4, 2022, 00:50 lkcl <luke.leighton at gmail.com> wrote:

> On Sat, Jun 4, 2022 at 1:19 AM Jacob Lifshay <programmerjake at gmail.com>
> wrote:
>
> > there's a pretty simple fix, make the *scalar* instruction limit itself
> to VL:
> > idx = GPR(RA)
> > GPR(RT) = idx < VL ? GPR(RB + idx) : 0
>
> as it's a MV (2-operand) it'd be
>

sorry, having mv.x be 2-operand like that is totally unworkable -- sv.mv.x
is supposed to be dynamic swizzle where the indexes read from one vector
are used as the indexes of which elements to copy from a second vector to
the destination. It inherently needs 3 operands.

   GPR(RT) = idx < VL ? GPR(RT + idx) : 0
>

that won't work, because a RA of [1, 0] is supposed to swap the two vector
elements (that's what it does on every other vector ISA), but with those
semantics it instead duplicates element 1 into both elements:
(RT) = (RT+1) # overwrites element 0 with the contents of element 1
(RT+1) = (RT) # overwrites element 1 with the contents of element 0 which
was copied from element 1

A similar instruction to the 3-operand mv.x is ldx -- mv.x is just loading
from registers instead of memory.

>
> which is the solution discussed a while back.  this still makes the
> cross-interference from actually modifying RT a problem.
> WAR and RAW Hazards are created *in between each scalar element*.
>

simple solution: the compiler won't ever have the data input vector overlap
with the output vector (though the index vector can be the same as RT, but
shouldn't be required to since the index vector doesn't need to be the same
type as the data input/output vector -- the index vector being in the same
registers as the output doesn't cause problems because it's read in the
same order the output is written, only the data input vector is read out of
order).

>
> there are strict inviolate rules at play here: SV's inviolate
> rule is that the elements are as if they were done as actual
> scalar instructions.  therefore with each index being read
> and the next instruction potentially having an *index*
> modified, the entire sequence basically grinds to a halt.
>

simple: using the VL limit I proposed you know exactly what range of
registers can be read, so all that happens is the compiler picks the data
input vector to not overlap the output vector. the cpu knows the dynamic
reads must be in the data input vector (even for the scalar instruction,
the data input is still effectively a vector of 64-bit elements), so the
dependencies only cause it to run slowly when the output overlaps the data
input, which no compiler will do.

>
> >> by setting the rule that the Hazards are *NOT* to be observed,
> >>during the usage of this type of remap, all of the problems go away.
> >
> >
> > elaborate on what you mean by not to be observed...i don't understand
> what you mean.
>
> https://www.thesaurus.com/browse/observed
>
> examined followed heeded regarded noted inspected watched checked.
>
> probably the best one is checked.  "not to be checked".
>

imho ignoring dependencies is a pretty bad idea in general, it's a good way
for the cpu state to get all messed up, or to have security problems
(because in a register renaming cpu, if you read from stale registers you
can see whatever is in them which might be from a different
process/privilege level/etc.)

>
> > going off what I do understand, i think it's a pretty bad idea because
> > it takes a slow/expensive instruction and just makes it slower and more
> > expensive, also you'd need an additional instruction to set the remap
> > nearly every time.
>
> it's always going to be awful, because of retrofitting to a scalar ISA.
> VSX doesn't have this problem at all because the indices are in
> one single VSX register.
>
> the biggest advantage of the remap concept is that we do not
> have to propose a scalar mv.x instruction.  given that this is a
> prerequisite for being able to use it in SVP64, and given that as a
> scalar instruction it's a total nightmare where i would expect the
> OPF ISA WG to fight it tooth and nail *and i would agree with them*,
> any alternative is better because it can be, how to put it best...
> "slipped under the carpet" if you know what i mean, there.
>
> additionally, the whole reason for having the mv.x is so as to
> shuffle registers around so that they can be used elsewhere,
> e.g. by arithmetic operations.
>

it's to support the dynamic swizzle operation -- dynamic meaning which
elements go where is dynamic. just moving data around with the pattern not
selected dynamically you can usually use more efficient instructions (e.g
sv.mv not-x on shorter pieces of the vector).

>
> well... um... if they can be "shuffled" as inputs *to* the arithmetic
> operations because the re-indexing applies to one (or more)
> of the arithmetic operations' inputs *instead* of using mv.x,
> you just saved an instruction.
>

you saved a sv.mv.x but instead have several instructions to setup the
remap, tear it down after, and then you're still running the same complex
operation but it's also combined with an arithmetic operation so the cpu
gets more complex.

Jacob