[Libre-soc-dev] [RFC] svp64 "source zeroing" makes no sense

Mon Mar 22 04:12:25 GMT 2021

On Monday, March 22, 2021, Richard Wilbur <richard.wilbur at gmail.com> wrote:

>
> This is not an optimization for a particular workload.  It will never
> be slower than the comparison loop.  I only ask about the expected
> sparseness in order to gauge how much of an improvement to expect.

we have no idea, it's literally impossible to say.

bear in mind two things.

1) Dynamic Partitioned ALUs will require receiving MULTIPLE predicate mask
bits i.e. cover multiple src/dest steps.

2) Multi-issue will require multiple src/dest steps per clock.

>
> It wouldn't necessarily change how the predicate masks are used
> anywhere else.  Just in determining the next value of {src|dst}step in
> the vector execution loop.

high petformance implementations cannot stall just because of a few mask
bits being zero.  the only reason for doing it that way right now is
because it is about 5 lines of code even in HDL.

there is an instruction which hunts for the next 1 after a trigger point,
"set before first" and a twin "set after first".

also there is a count leading zeros instruction.  aka a "Priority Encoder".

we need "count leading zeros starting from bit X".

for multi issue and Dynamic Partitioned SIMD, multiple of those are
required.

> by contrast skipping the pipeline and inserting a zero into the outputs is
> > relatively straightforward.
>
> I guess I was thinking of vector sums and multiplies.

the normal Vector processor only has zeroing.  ORing in parallel is done to
merge bit-inverted-masked operations together to do parallel if then else.

it works even for FP if you use add because zero is FP Zero.

>
> What if we made the semantics be that by default an instruction
> ignores the src_zeroing flag and only considered the src_zeroing flag
> for instructions for which it made sense?

this requires going through every single instruction and marking it with a
CSV File entry.

will probably take about a week.

we don't have a week to waste.

>
> I guess the other question is, "What advantage is there to having two
> output zeroing masks, one of which could be the source predication
> mask you just used for some other operation?"

see the  4 combinations i described.

now combine them with the fact that if then else is done as a parallel
construct using masks.

even not having to do the AND of two mask sets is useful as it will save
one instruction

l.

-- 
---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68