[Libre-soc-dev] twin predication and svp64

Fri Dec 11 17:25:58 GMT 2020

---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68

On Fri, Dec 11, 2020 at 7:36 AM Jacob Lifshay <programmerjake at gmail.com>
wrote:

> On Thu, Dec 10, 2020, 22:10 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
> wrote:
>
> > and if srcpred *equals* 1<<n *then* it is *as if* the op was
> > macro-fused with mv.x n
> >
>
> can you fully write that out in pseudo-code since that sounds like a
> single-element mv operation (a "vector extract" op, basically
> scalar_dest=vec_src[index]) and not a splat at all. the key part of what
> makes a splat is that one value/element is duplicated and written into
> every element of the dest vector (except masked-off vector elements, of
> course).
>
>
see the pseudo-code i posted out-of-order because i realised it's missing
context
http://lists.libre-soc.org/pipermail/libre-soc-dev/2020-December/001541.html

it should be clear that "splat" occurs when:

* the src pred is set to "all 1s"
* the src is set to "scalar"
* the dest pred is set to "all 1s"
* the dest is set to "vector"

in that instance, the source register does not "increment", whilst the dest
register does.  therefore, the src (scalar) reg gets copied repeatedly to
the dest.

now let us take the case where:

* the src pred is set to "all 1s"
* the src is set to "scalar"
* the dest pred is set to "1<<r3"
* the dest is set to "vector"

this will basically take the scalar reg, then, because there is only 1 bit
in the dest pred, the for-loop will walk *all* the way up to that one bit
(which happens to be r3) and consequently will happen to put the src into
ireg[rd+ireg[r3]]

therefore the "operation" boils down to:

     ireg[rs] = ireg[rd+ireg[r3]]

which is by a total coincidence EXACTLY the definition of a mv.x.

now let us take the case where:

* the src pred is set to "1<<r3"
* the src is set to "vector"
* the dest pred is set to "all 1s"
* the dest is set to "scalar" *OR* "vector" (it doesn't matter which)

this will do a walk on the src pred, walking *all* the way up to that one
bit, and in effect get the register ireg[rs+ireg[r3]] as the src.  the dest
will hit only the very first item - ireg[rd].

therefore the operation boils down to:

    ireg[rs+ireg[r3]] = ireg[rd]

which is a different *type* of mv.x operation, but it is still a mv.x
operation.

it gets exceptionally weird if we apply twin-predication *to* mv.x.  i'm
not going to go there quite just yet :)

> only for that specific mask, I was taking about the fully general vector
> case.
>

you've lost me, and because you're not familiar with SV and
twin-predication, can you come back to this once it's clear?

as we found with the discussion on Compressed with Alexandre a couple weeks
back, the issue we have here is that you don't quite fully understand the
way that twin-predication works... yet are recommending changes to the
algorithm and implementation (which took me about 4-5 months to work out)
before that understanding is complete.

this makes it extremely difficult to have discussions because i have to do
a "three-way diff": (1) way SV works (2) model in my head of how you
*might* think SV works (3) trying to understand the merit of ideas that
you're putting forward... and getting hopelessly lost.

> > ok so you are referring to the arith FU which means you definitely are
> > not aware of the Predicate FU for INT preds and its connection to
> > shadows.
> >
>
> it would be an *augumented* arith FU -- those are also useful for
> conditional move and int/fp select operations.
>

right.  so, again, i emphasise: you don't understand how predication works
(in a practical sense) at the hardware level.  it is critically important
that you understand the currently-designed microarchitecture before making
recommendations and suggestions.

otherwise i am the one that is burdened with the task of explaining why a
type 3 (part of the three-way-diff, see above) concept... you get the idea:
it's too much for me to handle, Jacob.

> For those ops it is waay more efficient to calculate the input needed then
> do the regfile read (if that element is not the output of another in-flight
> instruction),

exactly: and that's unfortunately where everything about the idea that
you're advocating collapses.  it is *fundamentally* critical that you
understand and accept that predicates are register resources that are *not
accessible* - not readable - by Function Units.  *no* FU is permitted
*arbitrary* non-hazard-managed access to regfiles.  *ever*.

the *only* way that FUs are permitted access to data is via
Dependency-Matrix-Managed access.  they are supplied *with* the data that
they need *including the predicate bit(s)* - they *cannot* and *must not*
get or be permitted to get register file contents in a way that bypasses
the Dependency Matrices.

the only way that such bypassing is permitted is if the *ENTIRE* execution
completely grinds to a halt, flushes or waits for completion of literally
all pipelines (except for itself), *then*, once it is literally the only
unit still outstanding with partial execution, gets access to whatever
regfile, and then signals to the Issue Engine that it may continue.

this would result in such piss-poor performance that it should be clear
that it is not a viable option except in emergency or very rare
circumstances where performance is non-critical.

rather than read all possible inputs and have 64 input
> latches. Those augumented FUs could also be quite useful for vector mv.x,
> since each FU is 1 element of a mv.x.
>

given that the fundamental principle of SV is that the predication applies
uniformly, it is *all* FUs that need to be so "augmented".  consequently in
the micro-architectural design i abstracted that out into a special
"Predicate Function Unit" that, just like the Branch Unit, performs and
leverages "Shadowing".

please: i have said this at least two to three times now: please try to
understand in full how Predication Function Shadow Units work before
suggesting alternative hardware implementations that will take *literally
two weeks* to evaluate in full as to whether they are viable.

each bit of a predicate mask - *when obtained* and remember *the Function
Units cannot read regfiles directly* - will link into the Shadow
success/fail lines shown in this diagram:

    https://libre-soc.org/3d_gpu/shadow.jpg

for memcpy with compile-time constant size (vast majority, e.g. struct
> copy), we can use setvli, which can be executed in-order in the decode
> pipe, no pipe flush needed.

in the case where the length is known and fixed, yes, no problem.

This is part of the reason I advocated for
> setvli to be non-complicated.

again: it's already been taken into account.  setvli is a pseudo-op that
takes the immediate from the operation and places it, exactly as you
advocate and expect, into both RA and VL.  please, review the pseudo-code
again (which i reworked last week to take out some bugs), you should find
that the behaviour that you are expecting is in fact there.  if it's not,
then that's a fundamental design flaw: raise it at the bugreport.

   https://libre-soc.org/openpower/sv/setvl/
<https://libre-soc.org/openpower/sv/setvl/>

If it's just a little smaller, it can compile
> directly to a 64-bit load and a 64-bit store or similar code for other
> sizes.
>

no... it really can't.  this is the dangers of the SIMD approach.

see strncpy example (memcpy is a simplified version of that)
https://libre-soc.org/simple_v_extension/appendix/#strncpy

you need to read up about "fail-on-first" when applied to LOAD and STORE.
a full academic paper is somewhere in the resources, to do with ARM SVE-2,
which describes the concept well.

the essence is is that data-dependent fail-on-first truncates a
previously-set VL to a *new value* - one that is based on whether a Vector
of LOAD operations had a page-fault or not.  the value of VL is *modified*
such that only those LOADs that did *not cause a page-fault* are covered.

subsequent parallel (Vector) operations can then complete successfully
knowing full well that the results *will* go back into memory (STORE)
without causing a page fault, because VL has been auto-truncated
specifically to the amount that will succeed.

assuming that byte-level LOAD/STORE may be aggregated into a 64-bit LOAD
results in serious problems when either crossing page boundaries or when
reaching the upper limit of memory.

*no SIMD architecture* has this solved.  as in: a pure SIMD architecture is
*guaranteed* by design to be problematic (that definitely includes VSX
SIMD).  it is only predicated and Vector architectures that have the
building blocks for adding data-dependent fail-on-first and solving this
problem.

l.