[Libre-soc-dev] memcpy optimization

Fri Dec 11 22:33:28 GMT 2020

On Fri, Dec 11, 2020 at 2:25 PM Luke Kenneth Casson Leighton
<lkcl at lkcl.net> wrote:
>
> here however what is the max that VL.can be... ah, up to 64.
>
> so there will be up to 8x 64 bit LDs in one hit.
>
> that means that the 8 LDs are very likely to fault.
>
> that in turn, because there are so many, results in an average of 4 64
> bit LDs being chucked out of the LDST Buffer (cancelled) due to a page
> fault and associated trap handling.
>
> that throwing page faults is SERIOUSLY suboptimal and if they are all
> misaligned the resource utilisation is absolutely dreadful.
>
> so i repeat again: strncpy zero detection is *not* the driver behind
> the use of ffirst.  getting the parallel LDs to exclude misalignments
> (and other faulting) is the key driving factor behind why ffirst is
> used in strncpy.
>
> those exact same characteristics *also apply to memcpy and memset*.
>
> if that's really not clear can i recommend finding and reading the
> paper written by ARM's SVE team?

This one https://alastairreid.github.io/papers/sve-ieee-micro-2017.pdf?

Cole