[Libre-soc-dev] [RFC] svp64 "source zeroing" makes no sense

Mon Mar 22 08:31:08 GMT 2021

On Sun, Mar 21, 2021 at 10:13 PM Luke Kenneth Casson Leighton
<lkcl at lkcl.net> wrote:
>
> On Monday, March 22, 2021, Richard Wilbur <richard.wilbur at gmail.com> wrote:
>
> >
> > This is not an optimization for a particular workload.  It will never
> > be slower than the comparison loop.  I only ask about the expected
> > sparseness in order to gauge how much of an improvement to expect.
>
>
> we have no idea, it's literally impossible to say.
>
> bear in mind two things.
>
> 1) Dynamic Partitioned ALUs will require receiving MULTIPLE predicate mask
> bits i.e. cover multiple src/dest steps.

Will the Dynamic Partitioned ALUs be receiving more than a source and
destination mask?  Will the mask size be related to the partitioned
size of the ALU?

> 2) Multi-issue will require multiple src/dest steps per clock.

How many predicate masks are we talking about per operation?  The code
you posted had two:  source and destination.  Each ALU sounds like
they might have a unique pair.

> > It wouldn't necessarily change how the predicate masks are used
> > anywhere else.  Just in determining the next value of {src|dst}step in
> > the vector execution loop.
>
>
> high petformance implementations cannot stall just because of a few mask
> bits being zero.  the only reason for doing it that way right now is
> because it is about 5 lines of code even in HDL.
>
> there is an instruction which hunts for the next 1 after a trigger point,
> "set before first" and a twin "set after first".
>
> also there is a count leading zeros instruction.  aka a "Priority Encoder".
>
> we need "count leading zeros starting from bit X".

If I'm not mistaken, the logic design I mentioned implements the
"count leading zeros starting from bit X".  That is basically what I
outlined.  It also is ready for the next iteration right after you use
the count of leading zeros because it generates its next state while
generating the count.

Here is an improved, simplified implementation of what I described earlier:
Load a 64-bit register with the value of the mask (bits 0 to
VL-1) with all bits at and above VL set to 1.
Connect the lowest 32 bits of register to zero-detector =
32-input NOR whose output runs a MUX { 0 := passes the 64 bits
through, 1 := shifts them down 32 bits (and passes 1's in the high 32
bits of output)}.
Connect the output of the previous stage to a stage of half the test
and shift size:  16, 8, 4, 2.

The last stage has a slightly more complex test,
if (mask.LSB == 0 && mask.NLSB == 1):  zero1 = 1
if (mask.LSB == 0 && mask.NLSB == 0):  zero64 = 1
where LSB = Least Significant Bit, NLSB = Next to Least Significant Bit
zero64 flags the situation when VL is a power of 2 (twice the first
zero-detector width) and the whole mask is zero. (This will happen
only on the first iteration, as after the first update the mask
register will always have removed all counted zeros.)
The MUX on the 64-bit mask runs as follows { zero1 = 0 => the 64 bits
are shifted down 1 bit with a 1 passed in the MSB, zero1 = 1 => the 64
bits are shifted down 2 bits with 1's passed in the 2 most significant
bits}.  (We don't worry about updating the register if zero64 = 1 as
we have counted the whole register as 0 and thus are done.)

{dst|src}step increment output:  If zero64 is 1 a mux is set to output
the bit pattern 0b100000 = 64, otherwise (zero64 = 0) the mux connects
the other zero detectors to the output (providing a value between
0-63).

After the step has been incremented to point to the next non-zero bit
of the mask, the register is
reloaded from the MUX output of the last stage if zero64 = 0,
otherwise we're done.

> for multi issue and Dynamic Partitioned SIMD, multiple of those are
> required.

A pair of source and destination predication masks for each ALU, right?

> > by contrast skipping the pipeline and inserting a zero into the outputs is
> > > relatively straightforward.
> >
> > I guess I was thinking of vector sums and multiplies.
>
>
> the normal Vector processor only has zeroing.  ORing in parallel is done to
> merge bit-inverted-masked operations together to do parallel if then else.

By this do you mean "the normal Vector processor only has" output "zeroing"?

I guess I can see that if you don't want certain elements of a vector
to be multiplied or added you can simply exclude them from the source
predication mask, no need to send it zero operands!

> > What if we made the semantics be that by default an instruction
> > ignores the src_zeroing flag and only considered the src_zeroing flag
> > for instructions for which it made sense?
>
>
> this requires going through every single instruction and marking it with a
> CSV File entry.
>
> will probably take about a week.
>
> we don't have a week to waste.

I realize we don't have a lot of time to do this design.  I also
realize we will have plenty of time to regret the mistakes we make at
this stage--especially the ones that can't be corrected easily in a
compatible fashion.