[Libre-soc-dev] memcpy optimization

Sat Dec 12 11:23:35 GMT 2020

On 12/11/20, Jacob Lifshay <programmerjake at gmail.com> wrote:

> what I meant by data-invariance is that memcpy doesn't suddenly change size
> because a byte is zero.

it does not change VL in strncpy for the same reason either.

please look again at at the assembly code.  *predicate masking* is
used to mask out up to the zero.

the predicate mask computation occurs *within the range of VL that has
been detected to be valid LDs*

not, repeat not, "ffirst analyses if the data being loaded is zero
data and truncates VL".

even when there are zeros in strncpy *VL IS NOT MODIFIED*.

it is PREDICATION that stops at the zero WITHOUT CHANGING VL.

> VL doesn't change because a byte is zero --
> data-invariant.

correct.  i did not ever say that it was.

>VL getting changed by ffirst isn't because it hit a zero
> byte, but because ffirst hit a page-fault.

correct.

therefore, conclusion: there is no difference between how LD-based
ffirst is used in strncpy as to how it is used in memcpy.

if you are tempted to believe otherwise please read the strncpy
assembly code again and again until that belief goes away :)

>>
>> the fact that VL need not be exactly the requested amount (i.e. is
>> modified by the LD) can be exploited to optimise subseqient LDs.
>>
>
> yes, but that only happens if there *is* an unmapped page. if there isn't
> an unmapped page, your still stuck with the bad alignment because the load
> succeeded.

i would advocate that it was reasonable to modify VL right from the
vey first call of the loop, such that subsequent iterations are
aligned.

the resource utilisation on the LDSTBuffers is far higher (double the
allocation of in-flight data) for misaligned which could easily create
a cascading backlog that impacts multi-issue.

> yes, hence why I proposed the 3-argument setvl instruction in the previous
> email -- we want good performance even if coping from/to already mapped
> pages.

apologies, far too late.  make a note, but otherwise this should have
been discussed 18 months ago at the time that ffirst was added to the
SV spec.

>> those exact same characteristics *also apply to memcpy and memset*.
>>
>
> not really, since we know the length ahead of time (VL, *not* maxvl) and
> don't have to try to load 64 bytes only to find out we only should have
> read 3 because that's where the null is.

again, i repeat, again, for about the fifth time in under 8 hours:
you're conflating the post-load analysis with the LD.

the LD is where VL is truncated.

the zero-analysis phase has absolutely nothing to do with the LD.

it is completely independent.

no connection whatsoever in any way shape or form

how many times do i have to repeat this.

what is it that is stopping you from going, "hang on why is he
repeating this five or six times, what have i missed?"

l.