[Libre-soc-dev] memcpy optimization

Fri Dec 11 22:25:10 GMT 2020

On 12/11/20, Jacob Lifshay <programmerjake at gmail.com> wrote:
> On Fri, Dec 11, 2020, 11:20 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
> wrote:
>
>> the general dynamic case [of memcpy], when either the count or alignment
>> is *not*
>> known, is however flat-out impossible to use 64 bit granularity:
>> that's the seductive SIMD way.
>
>
> agreed.
>
> the only way to make dynamic general
>> memcpy efficient is to use fail-on-first.
>>
>
> no, fail on first is used when you are using a data-dependent loop count,

again: this is incorrect.  it was a mistake for me to refer you to the
strncpy example to illustrate that ffirst applies to memcpy.

the end-of-string detection has *nothing to do with the LD*.

i repeat, again: ffirst has NOTHING to do with the LD or with zero detection.

look again at the assembly code.

the zero detection *uses* the (new) VL.  the zero detection uses mask
ops to find the zero point.

> memcpy is data-independent (copies the same number of bytes no matter what
> byte values it sees).

you're misunderstanding how vectors work, and not listening.  VL.is
not a hard fixed quantity, *MVL* is the invarying quantity.

the fact that VL need not be exactly the requested amount (i.e. is
modified by the LD) can be exploited to optimise subseqient LDs.

let me try again.  let's keep it to byte level LDs.

* a memcpy of bytes with VL=16 starts only 7 bytes from the end of the page.
* this is nonaligned to 64 bit by one byte.
* normally a misaligned pagefault would occur requiring bytes 7 thru
15 to be in memory in a following page.
* let us assume that they are not
* the data-dependent ffirst flag on the LD **STOPS** at loop index 6
(after only 7 items) and **TRUNCATES** VL (setting a new value in the
SPR) of 7
* the 7 LDs **SUCCEED WITHOUT A PAGE FAULT**

the next and subsequent LDs then continue from 16-byte-aligned
quantities **EVEN THOUGH THE LDs STARTED INITIALLY AT A NONALIGNED
POINT**

this is *hugely* beneficial to performance to have the loop
miraculously self-align, because those non-aligned LDs are actually
incredibly expensive at the hardware level.

in our implementation we need double the number of LDST Buffers to be
able to cope with misalignment, and coping with those misalignments
across page boundaries is going to get real hairy.

> if there's a page-fault (even if not using vector instructions at all)
> either that's a sigsegv or invisible to user code, so memcpy doesn't use
> fail-on-first.

scalar code with byte quantities does single bytes which is hugely
suboptimal and consequently yes no page fault hits in the middle of
the LDs.

> code (ignoring memcpy's return value):
> memcpy: # r3=dest, r4=src, r5=count
>     setvl r6, r5, maxvl=64
>     ld <vec>r64, (<scalar>r4), elwidth=1
>     st <vec>r64, (<scalar>r3), elwidth=1
>     sub. r5, r5, r6
>     add r3, r3, r6
>     add r4, r4, r6
>     bne memcpy
>     blr

here however what is the max that VL.can be... ah, up to 64.

so there will be up to 8x 64 bit LDs in one hit.

that means that the 8 LDs are very likely to fault.

that in turn, because there are so many, results in an average of 4 64
bit LDs being chucked out of the LDST Buffer (cancelled) due to a page
fault and associated trap handling.

that throwing page faults is SERIOUSLY suboptimal and if they are all
misaligned the resource utilisation is absolutely dreadful.

so i repeat again: strncpy zero detection is *not* the driver behind
the use of ffirst.  getting the parallel LDs to exclude misalignments
(and other faulting) is the key driving factor behind why ffirst is
used in strncpy.

those exact same characteristics *also apply to memcpy and memset*.

if that's really not clear can i recommend finding and reading the
paper written by ARM's SVE team?

l.