[Libre-soc-dev] memcpy optimization

Sun Dec 13 16:48:18 GMT 2020

> On Dec 12, 2020, at 06:41, Luke Kenneth Casson Leighton <lkcl at lkcl.net> wrote:
[…]
> 
> memcpy is therefore pretty much exactly the same with the predicate
> mask detection and zero detection stripped out.
> 
>        c.mv a3, a0               # Copy dst
>    loop:
>        setvli x0, a2, vint8    # Vectors of bytes.
>        vlbff.v v1, (a1)        # Get src bytes
>        vseq.vi v0, v1, 0       # Flag zero bytes
>        vsb.v v1, (a3)        # Write out bytes
>        csrr t1, vl             # Get number of bytes fetched
>        c.bgez t1, exit           # Done
>        c.add a1, a1, t1          # Bump src pointer
>        c.sub a2, a2, t1          # Decrement count.
>        c.add a3, a3, t1          # Bump dst pointer
>        c.bnez a2, loop           # Anymore?
> 
>    exit:
>        c.ret
> 
> the vmfirst and vmsif have gone, the ST has the predicate mask gone,
> and the CSR load of VL has a bgez t1 after it instead of a bgez a3.
> 
> those are the *only* modifications.

Why is memcpy still doing the vector flag 0 bytes (vseq.vi)?  Seems that would be a waste of time, here.

I get your point about not needing vmfirst, vmsif, or direct manipulation of VL.