[Libre-soc-dev] twin predication and svp64

Fri Dec 11 18:26:58 GMT 2020

On Fri, Dec 11, 2020, 09:26 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
wrote:

> ---
> crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68
>
>
> On Fri, Dec 11, 2020 at 7:36 AM Jacob Lifshay <programmerjake at gmail.com>
> wrote:
>
> > On Thu, Dec 10, 2020, 22:10 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
> > wrote:
> >
> > > and if srcpred *equals* 1<<n *then* it is *as if* the op was
> > > macro-fused with mv.x n
> > >
> >
> > can you fully write that out in pseudo-code since that sounds like a
> > single-element mv operation (a "vector extract" op, basically
> > scalar_dest=vec_src[index]) and not a splat at all. the key part of what
> > makes a splat is that one value/element is duplicated and written into
> > every element of the dest vector (except masked-off vector elements, of
> > course).
> >
> >
> see the pseudo-code i posted out-of-order because i realised it's missing
> context
>
> http://lists.libre-soc.org/pipermail/libre-soc-dev/2020-December/001541.html
>
> it should be clear that "splat" occurs when:
>
> * the src pred is set to "all 1s"
> * the src is set to "scalar"
> * the dest pred is set to "all 1s"
> * the dest is set to "vector"
>

yes, but that happens for any scalar -> vector op, not just twin predicated
ones, so I wouldn't call it a benefit of twin predication specifically.

>
> in that instance, the source register does not "increment", whilst the dest
> register does.  therefore, the src (scalar) reg gets copied repeatedly to
> the dest.
>
> now let us take the case where:
>
> * the src pred is set to "all 1s"
> * the src is set to "scalar"
> * the dest pred is set to "1<<r3"
> * the dest is set to "vector"
>

this happens for any scalar -> vector with a mask of 1<<r3, twin
predication is not necessary for it to work.

>
> this will basically take the scalar reg, then, because there is only 1 bit
> in the dest pred, the for-loop will walk *all* the way up to that one bit
> (which happens to be r3) and consequently will happen to put the src into
> ireg[rd+ireg[r3]]
>
> therefore the "operation" boils down to:
>
>      ireg[rs] = ireg[rd+ireg[r3]]
>

no, you get:
ireg[rd+ireg[r3]] = ireg[rs]

> which is by a total coincidence EXACTLY the definition of a mv.x.
>

exactly the reverse of scalar mv.x

>
Ok, I think the issue is that when I was saying mv.x, I meant the
vectorized version:
for i in 0..VL {
    let idx = reg[ra + i];
    if idx >= VL {
        trap();
    }
    reg[rd + i] = reg[rb + idx];
}

where by setting ra[0..VL] to [3, 7, 2, 5, 1, 0, 4, 6]
you can get in 1 vector mv.x instruction:
dest = [src[3], src[7], src[2], src[5], src[1], src[0], src[4], src[6]];

also, by setting ra[0..VL] to [5, 5, 3, 3, 4, 4, 4]
you can get in 1 vector mv.x instruction:
dest = [src[5], src[5], src[3], src[3], src[4], src[4], src[4]];
which isn't possible if the mv.x adds idx to rd instead of rb.

> now let us take the case where:
>
> * the src pred is set to "1<<r3"
> * the src is set to "vector"
> * the dest pred is set to "all 1s"
> * the dest is set to "scalar" *OR* "vector" (it doesn't matter which)
>
> this will do a walk on the src pred, walking *all* the way up to that one
> bit, and in effect get the register ireg[rs+ireg[r3]] as the src.  the dest
> will hit only the very first item - ireg[rd].
>
> therefore the operation boils down to:
>
>     ireg[rs+ireg[r3]] = ireg[rd]
>

no, you get:
ireg[rd] = ireg[rs+ireg[r3]]

which is scalar mv.x except it doesn't trap if r3 is out of range and just
doesn't write rd

>
> which is a different *type* of mv.x operation, but it is still a mv.x
> operation.
>
> it gets exceptionally weird if we apply twin-predication *to* mv.x.  i'm
> not going to go there quite just yet :)
>
>
>
> > only for that specific mask, I was taking about the fully general vector
> > case.
> >
>
> you've lost me,

what I meant was the pseudo-code I wrote earlier which is the vector mv.x.
You can't replace the fully general vector mv.x with a single
twin-predicated vector mv no matter how hard you try, it's just not
possible, except in case of some special index vectors -- sorted low to
high and without any duplicates.
Replacing a scalar mv.x with twin predicated vector mv is possible, but
seems less efficient unless we have the special hw support for reading r3
then the selected input I mentioned earlier.

and because you're not familiar with SV and
> twin-predication, can you come back to this once it's clear?
>

AFAIK I am familiar with SV and twin predication...

>
> as we found with the discussion on Compressed with Alexandre a couple weeks
> back, the issue we have here is that you don't quite fully understand the
> way that twin-predication works... yet are recommending changes to the
> algorithm and implementation (which took me about 4-5 months to work out)
> before that understanding is complete.
>
> this makes it extremely difficult to have discussions because i have to do
> a "three-way diff": (1) way SV works (2) model in my head of how you
> *might* think SV works (3) trying to understand the merit of ideas that
> you're putting forward... and getting hopelessly lost.
>
>
>
> > > ok so you are referring to the arith FU which means you definitely are
> > > not aware of the Predicate FU for INT preds and its connection to
> > > shadows.
> > >
> >
> > it would be an *augumented* arith FU -- those are also useful for
> > conditional move and int/fp select operations.
> >
>
> right.  so, again, i emphasise: you don't understand how predication works
> (in a practical sense) at the hardware level.  it is critically important
> that you understand the currently-designed microarchitecture before making
> recommendations and suggestions.
>
> otherwise i am the one that is burdened with the task of explaining why a
> type 3 (part of the three-way-diff, see above) concept... you get the idea:
> it's too much for me to handle, Jacob.
>
>
> > For those ops it is waay more efficient to calculate the input needed
> then
> > do the regfile read (if that element is not the output of another
> in-flight
> > instruction),
>
>
> exactly: and that's unfortunately where everything about the idea that
> you're advocating collapses.  it is *fundamentally* critical that you
> understand and accept that predicates are register resources that are *not
> accessible* - not readable - by Function Units.

I'm saying we should treat it differently than normal predicates since it's
known to be 1<<r3.

  *no* FU is permitted
> *arbitrary* non-hazard-managed access to regfiles.  *ever*.
>

I always meant that the augumented FUs would respect dependencies, reading
from result latches of preceding in-flight ops if necessary, reg file
otherwise. Perhaps that wasn't sufficiently clear.

>
> the *only* way that FUs are permitted access to data is via
> Dependency-Matrix-Managed access.  they are supplied *with* the data that
> they need *including the predicate bit(s)* - they *cannot* and *must not*
> get or be permitted to get register file contents in a way that bypasses
> the Dependency Matrices.
>
> the only way that such bypassing is permitted is if the *ENTIRE* execution
> completely grinds to a halt, flushes or waits for completion of literally
> all pipelines (except for itself), *then*, once it is literally the only
> unit still outstanding with partial execution, gets access to whatever
> regfile, and then signals to the Issue Engine that it may continue.
>
> this would result in such piss-poor performance that it should be clear
> that it is not a viable option except in emergency or very rare
> circumstances where performance is non-critical.
>
>
> rather than read all possible inputs and have 64 input
> > latches. Those augumented FUs could also be quite useful for vector mv.x,
> > since each FU is 1 element of a mv.x.
> >
>
> given that the fundamental principle of SV is that the predication applies
> uniformly, it is *all* FUs that need to be so "augmented".

not necessarily, we can have some FUs that can't be used by 1<<r3 or mv.x
ops since those are relatively rare. all we need is at least some
augumented FUs for every pipeline where we want those optimizations to work.

  consequently in
> the micro-architectural design i abstracted that out into a special
> "Predicate Function Unit" that, just like the Branch Unit, performs and
> leverages "Shadowing".
>
> please: i have said this at least two to three times now: please try to
> understand in full how Predication Function Shadow Units work

I'm pretty sure I get the gist of how they work...

before
> suggesting alternative hardware implementations that will take *literally
> two weeks* to evaluate in full as to whether they are viable.
>

well, I'm not saying to get rid of predication based on a predicate fu,
that's still needed for the much more common case of fully general masks.

>
> each bit of a predicate mask - *when obtained* and remember *the Function
> Units cannot read regfiles directly* - will link into the Shadow
> success/fail lines shown in this diagram:
>
>     https://libre-soc.org/3d_gpu/shadow.jpg
>
>
> for memcpy with compile-time constant size (vast majority, e.g. struct
> > copy), we can use setvli, which can be executed in-order in the decode
> > pipe, no pipe flush needed.
>
>
> in the case where the length is known and fixed, yes, no problem.
>
> This is part of the reason I advocated for
> > setvli to be non-complicated.
>
>
> again: it's already been taken into account.

yup, never said it wasn't.

  setvli is a pseudo-op that
> takes the immediate from the operation and places it, exactly as you
> advocate and expect, into both RA and VL.  please, review the pseudo-code
> again (which i reworked last week to take out some bugs), you should find
> that the behaviour that you are expecting is in fact there.  if it's not,
> then that's a fundamental design flaw: raise it at the bugreport.
>
>    https://libre-soc.org/openpower/sv/setvl/
> <https://libre-soc.org/openpower/sv/setvl/>
>
>
> If it's just a little smaller, it can compile
> > directly to a 64-bit load and a 64-bit store or similar code for other
> > sizes.
> >
>
> no... it really can't.  this is the dangers of the SIMD approach.
>

it *can* and *does already* when the memory block size is known at
compile-time (assuming alignment requirements are met).

>
> see strncpy example (memcpy is a simplified version of that)
> https://libre-soc.org/simple_v_extension/appendix/#strncpy

That all applies to dynamic-size which is optimized differently. fully
general dynamic code is waaay less optimal than just a single load and
store instruction when sizes are statically known.

>
>
> you need to read up about "fail-on-first" when applied to LOAD and STORE.
> a full academic paper is somewhere in the resources, to do with ARM SVE-2,
> which describes the concept well.
>

fail-on-first is not applicable to memcpy since it doesn't check for
specific byte values to stop at.

>
> the essence is is that data-dependent fail-on-first truncates a
> previously-set VL to a *new value* - one that is based on whether a Vector
> of LOAD operations had a page-fault or not.  the value of VL is *modified*
> such that only those LOADs that did *not cause a page-fault* are covered.
>

if memcpy hits a page fault, it traps, lets the OS fill in the missing
page, then resumes where it left off. no VL size changing necessary.

>
> subsequent parallel (Vector) operations can then complete successfully
> knowing full well that the results *will* go back into memory (STORE)
> without causing a page fault, because VL has been auto-truncated
> specifically to the amount that will succeed.
>
> assuming that byte-level LOAD/STORE may be aggregated into a 64-bit LOAD
> results in serious problems when either crossing page boundaries or when
> reaching the upper limit of memory.
>
> *no SIMD architecture* has this solved.  as in: a pure SIMD architecture is
> *guaranteed* by design to be problematic (that definitely includes VSX
> SIMD).  it is only predicated and Vector architectures that have the
> building blocks for adding data-dependent fail-on-first and solving this
> problem.
>

that's all well and good for data-dependent things like strcpy, however
memcpy *isn't* data-dependent so fail-on-first actually is unnecessary for
it.

Jacob