[Libre-soc-dev] [RFC] svp64 "source zeroing" makes no sense

Wed Mar 24 04:49:36 GMT 2021

On Wednesday, March 24, 2021, Richard Wilbur <richard.wilbur at gmail.com>
wrote:

>
> being similar to "just store 0 in the destination".  I asked the above
> question because of something you said in an E-mail message in this
> thread which I received on 21 Mar (you sent on 22 Mar),
> "bear in mind two things.
>
> 1) Dynamic Partitioned ALUs will require receiving MULTIPLE predicate mask
> bits i.e. cover multiple src/dest steps.

right.  sorry.  wrong words.

because 64 bit DynamicPartitioned ALUs can represent / compute multiple
elements (8 in the case of elwidth=8 overrides) this is *effectively*
multi-issue.

therefore @ ew=8 you need to shove 8 lots of elements into each one.

therefore that requires 8 src/dest steps

where the squishy brown (blue in the case of myrhbusters because it was
easier for the cameras) hits the axially rotating bladed device is in
routing arbitrary data from regfiles in different lanes.

it may be easier if the predicate masks are full of holes to just go "screw
it" and run say 50% of the 8 bit DynamicPartitions empty.

however if there happens to be a masked parallel if-then-else construct,
these typically use the exact same predicate just with inverted mask.

if those parallel then-else clauses *happen* to require the exact same ALU
*and* the registers are in the same lanes we *might* be able to.match up
the opposing masks and fill the ALUs to run 100%.

i stress might.

>
> So, it sounds like the source and destination predication masks are
> important to the issuer in determining which parts of the source
> vector to read and process and which parts of the destination vector
> to write.

ahh yes, very.

>  The byte-level write-enable lines look like they have more
> to do with how the SIMD ALUs are partitioned

>
ahh no.  the partition sizes are determined by the element-width overrides.

as the element widths are 8, 16, 32 and 64 this tells you *how many*
byte-write lines to enable per element, where the mask bits apply *per
element*.

> and store their results.

 yes.

> > start from a position other than the start.  basically shift the value
> > down, trash N bits, then count.
>
> Latest revision has that as well.  That is what is required to start
> where we left off after returning from an interrupt.

to cope with reentrancy, my feeling is the algirithm should be like this:

* already_done = (1<<srcstep) - 1 # zero on start
* temp = predicatemask | already_done
* startfrom_srcstep = cntlzero(temp)

>
> > > Where in the loop is the valid exit point if an interrupt occurs?
> >
> > at any time.  it's a Sub-Program-Counter and should be treated as such.
>
> I don't see the "Sub-Program-Counter" in the SVSTATE documentation.

it's a conceptual one.

>   I
> see the srcstep, dststep,

that's the concrptual Sub-PC

> and svstep.

that's the conceptual Sub-Sub-PC

> Do we always finish an issue in
> progress?

no choice there.  at least not until we add OperationCancellation
(Shadowing)

>  In other words, after we update srcstep, if we get an
> interrupt (hardware) before we update dststep, do we jump out of the
> loop before we update dststep?

yes... where the interrupt handling is REQUIRED to save SVSTATE along with
MSR and PC.

>   If this is how it works, this could be
> difficult to restart at that particular spot.

nope.  not at all.  the rfid instruction restores SVSTATE PC and MSR from
the SPRs.

l.

-- 
---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68