[Libre-soc-dev] [RFC] svp64 "source zeroing" makes no sense

Hendrik Boom hendrik at topoi.pooq.com
Wed Mar 31 12:33:52 BST 2021


On Tue, Mar 30, 2021 at 05:24:55PM -0600, Richard Wilbur wrote:
> > On Mar 23, 2021, at 22:50, Luke Kenneth Casson Leighton <lkcl at lkcl.net> wrote:
> >
> > On Wednesday, March 24, 2021, Richard Wilbur <richard.wilbur at gmail.com>
> > wrote:
> >>
> >> being similar to "just store 0 in the destination".  I asked the above
> >> question because of something you said in an E-mail message in this
> >> thread which I received on 21 Mar (you sent on 22 Mar),
> >> "bear in mind two things.
> >>
> >> 1) Dynamic Partitioned ALUs will require receiving MULTIPLE predicate mask
> >> bits i.e. cover multiple src/dest steps.
> >
> >
> > right.  sorry.  wrong words.
> >
> > because 64 bit DynamicPartitioned ALUs can represent / compute multiple
> > elements (8 in the case of elwidth=8 overrides) this is *effectively*
> > multi-issue.
> >
> > therefore @ ew=8 you need to shove 8 lots of elements into each one.
> >
> > therefore that requires 8 src/dest steps
> 
> This implies we have 8 byte-size adds that are called for in the code.
> If this were a couple of vectors of 8 bytes the operation fits the ALU
> perfectly but the operands are also likely stored in a couple 64-bit
> registers making the job of marshalling the arguments very
> straight-forward.
> 
> What seems significantly less likely is having several unrelated 8-bit
> (byte-sized) adds to marshal together.  Do you have a particular
> algorithm in mind?
> 
> Likewise, having a dynamically partitioned ALU is great for changing
> width of parameters but my guess is we are still running the same
> instruction across the whole business at one time.
> 
> > where the squishy brown (blue in the case of myrhbusters because it was
> > easier for the cameras) hits the axially rotating bladed device is in
> > routing arbitrary data from regfiles in different lanes.
> >
> > it may be easier if the predicate masks are full of holes to just go "screw
> > it" and run say 50% of the 8 bit DynamicPartitions empty.
> >
> > however if there happens to be a masked parallel if-then-else construct,
> > these typically use the exact same predicate just with inverted mask.
> >
> > if those parallel then-else clauses *happen* to require the exact same ALU
> > *and* the registers are in the same lanes we *might* be able to.match up
> > the opposing masks and fill the ALUs to run 100%.
> >
> > i stress might.
> 
> I guess here it would be wonderful to consider what applications we
> might have that could use these constructs to good effect.  If they
> are something the compiler could be expected to generate, what code
> constructs would be translated this way?  If not, would this amount to
> a significant optimisation for a particular algorithm, in which case
> we could hand-code the assembly and provide it to implementers as a C
> function call.
> 
> […]
> 
> >>> start from a position other than the start.  basically shift the value
> >>> down, trash N bits, then count.
> >>
> >> Latest revision has that as well.  That is what is required to start
> >> where we left off after returning from an interrupt.
> >
> >
> > to cope with reentrancy, my feeling is the algirithm should be like this:
> >
> > * already_done = (1<<srcstep) - 1 # zero on start
> > * temp = predicatemask | already_done
> > * startfrom_srcstep = cntlzero(temp)
> 
> According to my understanding of cntlzero(mask) it is "count leading
> zero's in mask":
> start counting zero's from least-significant bit to first set bit (=1).

I find these two sentences confusing.
Aren't the leading zeros the most significant bits, not the least 
significant bits? 

-- hendrik

> 
> If that is the case, the method you outline above isn't likely to work.
> 
> It erases the zero's that have all been already counted but then
> starts over at the bottom.
> For predicatemask = 0x00ffff00, srcstep = 8
> already_done = (1<<8) -1 = 0x000000ff
> temp = 0x00ffff00 | 0x000000ff = 0x00ffffff
> startfrom_srcstep = cntlzero(0x00ffffff) -> 0
> 
> With my proposed solution:
> initialize(mask = 0x00ffff00, count = 8):
>   register = mask
>   # next_state = register >> count, shifting in 1's from above
>   next_state = 0x00ffff00 >> 8 = 0xff00ffff
>   register = next_state # load new state into register
> 
> >>>> Where in the loop is the valid exit point if an interrupt occurs?
> >>>
> >>> at any time.  it's a Sub-Program-Counter and should be treated as such.
> >>
> >> I don't see the "Sub-Program-Counter" in the SVSTATE documentation.
> >
> >
> > it's a conceptual one.
> 
> Is there a state-transition diagram for this?
> 
> _______________________________________________
> Libre-soc-dev mailing list
> Libre-soc-dev at lists.libre-soc.org
> http://lists.libre-soc.org/mailman/listinfo/libre-soc-dev



More information about the Libre-soc-dev mailing list