[Libre-soc-dev] [RFC] svp64 "source zeroing" makes no sense

Richard Wilbur richard.wilbur at gmail.com
Wed Mar 31 00:24:55 BST 2021


> On Mar 23, 2021, at 22:50, Luke Kenneth Casson Leighton <lkcl at lkcl.net> wrote:
>
> On Wednesday, March 24, 2021, Richard Wilbur <richard.wilbur at gmail.com>
> wrote:
>>
>> being similar to "just store 0 in the destination".  I asked the above
>> question because of something you said in an E-mail message in this
>> thread which I received on 21 Mar (you sent on 22 Mar),
>> "bear in mind two things.
>>
>> 1) Dynamic Partitioned ALUs will require receiving MULTIPLE predicate mask
>> bits i.e. cover multiple src/dest steps.
>
>
> right.  sorry.  wrong words.
>
> because 64 bit DynamicPartitioned ALUs can represent / compute multiple
> elements (8 in the case of elwidth=8 overrides) this is *effectively*
> multi-issue.
>
> therefore @ ew=8 you need to shove 8 lots of elements into each one.
>
> therefore that requires 8 src/dest steps

This implies we have 8 byte-size adds that are called for in the code.
If this were a couple of vectors of 8 bytes the operation fits the ALU
perfectly but the operands are also likely stored in a couple 64-bit
registers making the job of marshalling the arguments very
straight-forward.

What seems significantly less likely is having several unrelated 8-bit
(byte-sized) adds to marshal together.  Do you have a particular
algorithm in mind?

Likewise, having a dynamically partitioned ALU is great for changing
width of parameters but my guess is we are still running the same
instruction across the whole business at one time.

> where the squishy brown (blue in the case of myrhbusters because it was
> easier for the cameras) hits the axially rotating bladed device is in
> routing arbitrary data from regfiles in different lanes.
>
> it may be easier if the predicate masks are full of holes to just go "screw
> it" and run say 50% of the 8 bit DynamicPartitions empty.
>
> however if there happens to be a masked parallel if-then-else construct,
> these typically use the exact same predicate just with inverted mask.
>
> if those parallel then-else clauses *happen* to require the exact same ALU
> *and* the registers are in the same lanes we *might* be able to.match up
> the opposing masks and fill the ALUs to run 100%.
>
> i stress might.

I guess here it would be wonderful to consider what applications we
might have that could use these constructs to good effect.  If they
are something the compiler could be expected to generate, what code
constructs would be translated this way?  If not, would this amount to
a significant optimisation for a particular algorithm, in which case
we could hand-code the assembly and provide it to implementers as a C
function call.

[…]

>>> start from a position other than the start.  basically shift the value
>>> down, trash N bits, then count.
>>
>> Latest revision has that as well.  That is what is required to start
>> where we left off after returning from an interrupt.
>
>
> to cope with reentrancy, my feeling is the algirithm should be like this:
>
> * already_done = (1<<srcstep) - 1 # zero on start
> * temp = predicatemask | already_done
> * startfrom_srcstep = cntlzero(temp)

According to my understanding of cntlzero(mask) it is "count leading
zero's in mask":
start counting zero's from least-significant bit to first set bit (=1).

If that is the case, the method you outline above isn't likely to work.

It erases the zero's that have all been already counted but then
starts over at the bottom.
For predicatemask = 0x00ffff00, srcstep = 8
already_done = (1<<8) -1 = 0x000000ff
temp = 0x00ffff00 | 0x000000ff = 0x00ffffff
startfrom_srcstep = cntlzero(0x00ffffff) -> 0

With my proposed solution:
initialize(mask = 0x00ffff00, count = 8):
  register = mask
  # next_state = register >> count, shifting in 1's from above
  next_state = 0x00ffff00 >> 8 = 0xff00ffff
  register = next_state # load new state into register

>>>> Where in the loop is the valid exit point if an interrupt occurs?
>>>
>>> at any time.  it's a Sub-Program-Counter and should be treated as such.
>>
>> I don't see the "Sub-Program-Counter" in the SVSTATE documentation.
>
>
> it's a conceptual one.

Is there a state-transition diagram for this?



More information about the Libre-soc-dev mailing list