[Libre-soc-dev] memcpy optimization
programmerjake at gmail.com
Fri Dec 11 23:06:34 GMT 2020
On Fri, Dec 11, 2020, 14:25 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
> On 12/11/20, Jacob Lifshay <programmerjake at gmail.com> wrote:
> > On Fri, Dec 11, 2020, 11:20 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
> > wrote:
> >> the general dynamic case [of memcpy], when either the count or alignment
> >> is *not*
> >> known, is however flat-out impossible to use 64 bit granularity:
> >> that's the seductive SIMD way.
> > agreed.
> > the only way to make dynamic general
> >> memcpy efficient is to use fail-on-first.
> > no, fail on first is used when you are using a data-dependent loop count,
> again: this is incorrect. it was a mistake for me to refer you to the
> strncpy example to illustrate that ffirst applies to memcpy.
> the end-of-string detection has *nothing to do with the LD*.
> i repeat, again: ffirst has NOTHING to do with the LD or with zero
> look again at the assembly code.
> the zero detection *uses* the (new) VL. the zero detection uses mask
> ops to find the zero point.
> > memcpy is data-independent (copies the same number of bytes no matter
> > byte values it sees).
> you're misunderstanding how vectors work, and not listening. VL.is
> not a hard fixed quantity, *MVL* is the invarying quantity.
what I meant by data-invariance is that memcpy doesn't suddenly change size
because a byte is zero. VL doesn't change because a byte is zero --
data-invariant. VL getting changed by ffirst isn't because it hit a zero
byte, but because ffirst hit a page-fault.
> the fact that VL need not be exactly the requested amount (i.e. is
> modified by the LD) can be exploited to optimise subseqient LDs.
yes, but that only happens if there *is* an unmapped page. if there isn't
an unmapped page, your still stuck with the bad alignment because the load
this is *hugely* beneficial to performance to have the loop
> miraculously self-align, because those non-aligned LDs are actually
> incredibly expensive at the hardware level.
yes, hence why I proposed the 3-argument setvl instruction in the previous
email -- we want good performance even if coping from/to already mapped
> in our implementation we need double the number of LDST Buffers to be
> able to cope with misalignment, and coping with those misalignments
> across page boundaries is going to get real hairy.
good thing SV already has a way to indicate that an instruction is
partially complete: vstart
> > if there's a page-fault (even if not using vector instructions at all)
> > either that's a sigsegv or invisible to user code, so memcpy doesn't use
> > fail-on-first.
> scalar code with byte quantities does single bytes which is hugely
> suboptimal and consequently yes no page fault hits in the middle of
> the LDs.
> > code (ignoring memcpy's return value):
> > memcpy: # r3=dest, r4=src, r5=count
> > setvl r6, r5, maxvl=64
> > ld <vec>r64, (<scalar>r4), elwidth=1
> > st <vec>r64, (<scalar>r3), elwidth=1
> > sub. r5, r5, r6
> > add r3, r3, r6
> > add r4, r4, r6
> > bne memcpy
> > blr
> here however what is the max that VL.can be... ah, up to 64.
> so there will be up to 8x 64 bit LDs in one hit.
> that means that the 8 LDs are very likely to fault.
> that in turn, because there are so many, results in an average of 4 64
> bit LDs being chucked out of the LDST Buffer (cancelled) due to a page
> fault and associated trap handling.
> that throwing page faults is SERIOUSLY suboptimal and if they are all
> misaligned the resource utilisation is absolutely dreadful.
yeah, but page faults are really slow anyways, just dropping ops instead of
using vstart which is designed for this will not cost very much more than a
> so i repeat again: strncpy zero detection is *not* the driver behind
> the use of ffirst. getting the parallel LDs to exclude misalignments
> (and other faulting) is the key driving factor behind why ffirst is
> used in strncpy.
> those exact same characteristics *also apply to memcpy and memset*.
not really, since we know the length ahead of time (VL, *not* maxvl) and
don't have to try to load 64 bytes only to find out we only should have
read 3 because that's where the null is. ffirst for strcpy makes it so we
can try to load all 64 bytes without causing a sigsegv, which then allows
us to check those bytes for zeros using a later instruction.
memcpy doesn't have that issue since we don't have to speculatively read to
avoid causing a sigsegv (different than page-fault) for memcpy to be
correct -- we just read up to the end and use setvl to stop reading at the
right spot. if it causes a sigsegv, then the scalar version would have also
sigsegv-ed so that's correct.
now, ffirst can help with alignment, but only if a page was swapped-out
(not that common for a lot of code), otherwise it loads the full VL and
doesn't change it.
More information about the Libre-soc-dev