[Libre-soc-dev] memcpy optimization

Fri Dec 11 19:19:28 GMT 2020

On 12/11/20, Jacob Lifshay <programmerjake at gmail.com> wrote:

>> it *can* and *does already* when the memory block size is known at
>> compile-time (assuming alignment requirements are met).

ok right got it.  now that's clear what you mean (i was referring to
the general case)

yes.  fixed sizes and guaranteed alignment known, these can be
static-substituted with certain inline patterns.

here, yes, with static known sizes, the pseudo-op setvli can be used, no loop.

btw this is *the* primary reason why i specifically added a static VL
option to the old SVP-64, so that situations like this could literally
be covered with a single 64 bit instruction.

memcpy where the size is known to be 16 becomes only two static 2x 64
bit instructions (one LD, one ST) @ 64 bit wide:

    LD.VL=immed2 r4, 0(r5)
    ST.VL=immed2

hence i was pissed that we had to drop the old SVP-64 encoding because
we needed 27 bits to do it.  sigh.

it is not so bad though.  2x more instructions, one to set VL to
immediate, one to set it back to zero.

    setvli VL=2 # 2x 64-bit LD/STs
    LD.SV r4, 0(r5)
    ST.SV
    setvli VL=0 # disable SV

ok ok SUBVL could be used to set a vec2.

    LD.SUBVL=2 ...
    ST....

make VL larger then. say... 5.  when using SUBVL=2/3/4 that covers
quite a lot of use-cases and does not involve VL.

the general dynamic case, when either the count or alignment is *not*
known, is however flat-out impossible to use 64 bit granularity:
that's the seductive SIMD way.  the only way to make dynamic general
memcpy efficient is to use fail-on-first.

l.