[Libre-soc-dev] [RFC] svp64 "source zeroing" makes no sense

Mon Mar 22 01:13:24 GMT 2021

On Sun, Mar 21, 2021 at 5:44 PM Luke Kenneth Casson Leighton
<lkcl at lkcl.net> wrote:
>
> On Sunday, March 21, 2021, Richard Wilbur <richard.wilbur at gmail.com> wrote:
> > The optimization is very simple.
>
> it's not going to be as simple as a single bit test inside a loop.  that is
> the absolute top priority right now because we have people waiting on the
> critical path for working hardware and simulators.

Pretty close to that simple, here's the idea:
Load a 64-bit shift register with the value of the mask (bits 0 to
VL-1) with all bits at and above VL set.
Connect the lowest 32 bits of shift register to zero-detector =
32-input NOR whose output runs a MUX { 0 := passes the 64 bits
through, 1 := shifts them down 32 bits (and passes 1's in the high 32
bits of output)}.
Connect to the output of the previous stage a stage of half the test
and shift size:  16, 8, 4, 2, 1.
The output of the zero detectors is the increment to the {dst|src}step
(providing a value between 0-63).  After the step has been incremented
to point to the next non-zero bit of the mask, the shift register is
reloaded from the MUX output of the last stage.  In the outside loop,
when the step is incremented by one after we finish the operation we
shift the register down 1 bit (shifting in a 1 at the top).

The only other thing would be to have logic to detect when the mask is
all zeros and skip to the next instruction because there really isn't
anything to do in that case (all inputs and/or all outputs excluded).

> > How sparse do you expect these
> > predication masks to be?
>
>
> literally absolutely all and anything.  complete arbitrary all set, one bit
> not set right the way down to single bit set, at any point.
>
> all and anything, from 0/1 when VL=1 right the way to all and any possible
> permutations 2^64 when VL=64.
>
> trying to optimise for one particular workload of predicate masks is
> guaranteed to backfire, basically.

This is not an optimization for a particular workload.  It will never
be slower than the comparison loop.  I only ask about the expected
sparseness in order to gauge how much of an improvement to expect.

> > > this will be possible as a choice for individual implementors where it
> > > makes sense based on gate count, performance and power consumption for
> > > their needs.
> > >
> > > it will be helpful to record such optimisations for when there is time to
> > > implement them.
> >
> > I'm happy to do that.  It's just so simple that I was thinking it
> > sounded like an easy win if we expect predication masks to be fairly
> > sparse as you save the cycles every time you perform an instruction
>
>
> all and any spent on optimisations prevents and prohibits people waiting
> from proceeding.
>
> this is a critical path right now and we cannot afford the luxury at this
> time.
>
> please do record it so that as i already said, when there is time, it may
> be examined, and at that point, further time will be saved because we have
> a procedure.
>
> bear in mind that i have been planning this for a long while.  the
> predicate masks when element width overrides are implemented will go
> directly into the PartitionedSignal as well as into the byte-level
> write-enable lines on the register file.

It wouldn't necessarily change how the predicate masks are used
anywhere else.  Just in determining the next value of {src|dst}step in
the vector execution loop.

> > I guess that's because I don't understand the intent.  To me, source
> > zeroing just passes 0's into whatever you were going to do.
>
>
> yes, that was the old behaviour, which is "nice and logical".  the problem
> is, it makes no sense for e.g. LD or ST to try to LD or ST from address 0
> when the input parameters have zero-predication, does it? in fact it would
> be dangerous to try because it will throw exceptions or worse produce
> garbage.
>
> and for divide operations this will cause overflow, garbage, or in FP it
> will cause spurious exceptions.
>
> etc etc.
>
> the task of going through "what does it mean for inputs to be zero" on each
> and every single operation is a very large one.
>
> worse than that it is necessary to define a procedure for people in the
> future.
>
> worse than that it interferes with the logic for reading operands, in ways
> that i am not looking forward to implementing.
>
> by contrast skipping the pipeline and inserting a zero into the outputs is
> relatively straightforward.

I guess I was thinking of vector sums and multiplies.

What if we made the semantics be that by default an instruction
ignores the src_zeroing flag and only considered the src_zeroing flag
for instructions for which it made sense?

I guess the other question is, "What advantage is there to having two
output zeroing masks, one of which could be the source predication
mask you just used for some other operation?"  If that is a useful
semantic, I'm all for it.

I agree that for a number of operations feeding in zeros for the input
is setting ourselves up for a world of hurt.