[Libre-soc-dev] conflictd (was: Triangular REMAP schedule)

Tue May 31 17:07:02 BST 2022

continuing from:
https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-May/004884.html

i thought it best to go over this in a separate post.  context: Tom Forsyth's excellent and funny presentation on Larabee / AVX512.
gaah this took longer than expected to find, luckily i'd put it on the resources page
https://player.vimeo.com/video/450406346

tom describes how they really wanted conflict detection (for the reasons you gave, jacob) and they talked directly with the hardware designers, who said, "no you can't have exactly what you want but if you change it to this then it's exactly what is already in the x86 hardware"

retrospectively i worked out that what "this" is that the hw engineers must have been referring to: the Multi-Issue LDST Address Conflict detection allowing multiple LDST Reservation Station Requests to be found and merged in one single cycle.

[the cost is absolutely insane btw: a full 64-bit O(N^2/2) XOR gate bank.]

let's look at the pseudocode:
https://libre-soc.org/openpower/sv/vector_ops/

    for i in range(VL):
        for j in range(1, i):
            if src1[i] == src2[j]:
                result[j] |= 1<<i

that's where the bat-shit-crazy resources come from, if you try
to do that in 64-bit say at 8-wide, it's 10 gates per XOR, so that's 640 per 64-bit compare, then triangular wise it's 8x8/2 so times another 32, so that's an astounding 20,480 gates for a full 8-to-8 conflictd comparator.

we have up to VL=64, there is absolutely no way in hell anyone is going to lay down over half a milluon gates to do a full single-cycle conflict detector.

therefore it *has* to be done a la FSM or as micro-code, over several clock cycles.

and, that micro-code is basically a triangular pair of nested loops over a single instruction [look again at the pseudocode, it's pretty obvious]

and, following the chain of logic: if it *has* to be micro-coded, and that micro-code is virtually identical to the entire purpose and definition of SVREMAP, then it is almost irresponsible *not* to separate them out!

why?

because if adding conflictd the hardware engineer is going to have to *duplicate the logic behind SVREMAP anyway*, as an FSM/micro-code.

the problem area is the bit-level ORing, that is where things go awry.  i *think* now that i am considering it, there is the possibility to use scalar mapreduce mode... but only if a very weird variant of cmp were added which performed that inner OR:

   weirdcmp RT, RA, RB, RC
      if RA == RB then
         RT |= 1<<RC

which is a 4-in 1-out instruction, all regs 64-bit wide.  NO on that one.

i won't completely copy the pseudocode from the vector_ops page, but highlight how it works:

* the outer loop is explicit, the inner loop uses varying VL to get the triangular effect
* uses "1<<r3" to perform a VEXTRACT to get the scalar to be compared against the 2nd vector.
* mfcrweird extracts the CR EQ Field bits and blats *all* of those bits sequentially into one single GPR
* an explicit scalar OR performs the simple "result |= comparison"

the assembler is amazingly elegant and short, relatively speaking, for such a complex operation.

my biggest problem with conflictd aside from the complexity is that it is so specialist.  the above pseudocode could do GT or LE or SO.  SO would be *really* handy, to check for example that some vector elements were not going to overflow (absdiff for example if doing a convolution or hamming window or motion estimation)

bottom line is that in-depth investigation turns up more than first seems, and it's not just that result |= comparison which causes the problems, there are a *lot* of problems.

l.

On May 31, 2022 3:27:00 PM GMT+01:00, lkcl <luke.leighton at gmail.com> wrote:
>continuing from:
>https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-May/004884.html
>
>i thought it best to go over this in a separate post.  context: Tom
>Forsyth's excellent and funny presentation on Larabee / AVX512.
>gaah this took longer than expected to find, luckily i'd put it on the
>resources page
>https://player.vimeo.com/video/450406346
>
>tom describes how they really wanted conflict detection (for the
>reasons you gave, jacob) and they talked directly with the hardware
>designers, who said, "no you can't have exactly what you want but if
>you change it to this then it's exactly what is already in the x86
>hardware"
>
>retrospectively i worked out that what "this" is that the hw engineers
>must have been referring to: the Multi-Issue LDST Address Conflict
>detection allowing multiple LDST Reservation Station Requests to be
>found and merged in one single cycle.
>
>[the cost is absolutely insane btw: a full 64-bit O(N^2/2) XOR gate
>bank.]
>
>let's look at the pseudocode:
>https://libre-soc.org/openpower/sv/vector_ops/
>
>    for i in range(VL):
>        for j in range(1, i):
>            if src1[i] == src2[j]:
>                result[j] |= 1<<i
>
>that's where the bat-shit-crazy resources come from, if you try
>to do that in 64-bit say at 8-wide, it's 10 gates per XOR, so that's
>640 per 64-bit compare, then triangular wise it's 8x8/2 so times
>another 32, so that's an astounding 20,480 gates for a full 8-to-8
>conflictd comparator.
>
>we have up to VL=64, there is absolutely no way in hell anyone is going
>to lay down over half a milluon gates to do a full single-cycle
>conflict detector.
>
>therefore it *has* to be done a la FSM or as micro-code, over several
>clock cycles.
>
>
>
>
>
>
>On May 31, 2022 2:53:45 PM GMT+01:00, lkcl <luke.leighton at gmail.com>
>wrote:
>>On Tue, May 31, 2022 at 5:27 AM Jacob Lifshay
>><programmerjake at gmail.com> wrote:
>>
>>> just a pure triangle schedule isn't sufficient for Knuth's Algorithm
>>D,
>>> since it has the conditionally-executed fix-up loop.
>>>
>>> Knuth's Algorithm M imho isn't a good justification for a triangle
>>> schedule since once you get past around 5 words you should be
>>> using Karatsuba multiplication (or more complex algorithms)
>>instead...
>>> it's faster. 5 words is short enough that just having the outer loop
>>> be fully unrolled is sufficient...it'd be like 10-15 instructions
>for
>>the
>>> whole outer loop.
>>
>>>>
>>>>
>>>> it would be a fascinating application of SVP64 in that for the
>first
>>time some of SVSTATE would need to be context-switched, i.e. the
>>computation of a partial intermediary result within an outer loop
>would
>>roll SVSTATE forward, and would need rolling back before calculating
>>the next stage.
>>>
>>>
>>> imho that just too complex....
>>
>>it's a complex conceptual / software problem not a complex hardware
>>problem, at all.  fairly close to the brain-melting threshold, i have
>>to
>>admit.
>>
>>>> this should be dead easy to calculate these kinds of offsets: i
>>would like to make it more general, and include simple triangular
>>mapping so as not to have to add conflictd.
>>>
>>>
>>> conflictd is still necessary...
>>
>>i'll answer this separately (separate thread). the background behind
>>why conflictd exists at all is itself quite involved.
>>
>>l.