[Libre-soc-dev] twin predication and svp64

Fri Dec 11 20:54:17 GMT 2020

briefly, as i think the same things are being said multiple ways, i
git the mv.x vector thing finally, and the static memcpy.

On 12/11/20, Jacob Lifshay <programmerjake at gmail.com> wrote:

> I'm saying we should treat it differently than normal predicates since it's
> known to be 1<<r3.

i agree: just not in a first implementation.  i would like to do a
separate pass at macro-op fusion and other optimisations, and this
would be a good one.

however... with there being so much to do i advocate leaving it for a
separate incremental change once we have a stable base.

>
> I always meant that the augumented FUs would respect dependencies, reading
> from result latches of preceding in-flight ops if necessary, reg file
> otherwise. Perhaps that wasn't sufficiently clear.

appreciated.  i am having difficulty sustainig an 18+ month
architectural map in my head and mixing that with alternative designs.

> that's all well and good for data-dependent things like strcpy, however
> memcpy *isn't* data-dependent so fail-on-first actually is unnecessary for
> it

it is.  the end-of-string is a red herring.  when the sizeof block is
1, 2, 4 there is still the possibility that any given VL=16 (say) may
produce a suite of LDs that crosses a page boundary or hits an end of
memory point.

the page boundary crossover is considered unacceptably expensive, and
the end of memory causes SIMD operations to catastrophically fail when
they shouldn't even have been used.

even for memcpy the 16x LDs @ 2byte may be chopped off by reducing VL
to the point where the page fault doesn't occur.

on the next loop the page fault *does* occur but it occurs on an
entirely new page.

i.e. by using fail-on-first the need to keep 2 pages in memory is
gone, reducing VM working set maximum requirements.

also, the ffirst happens to get VL aligned onto a page boundary, such
that for really large memcpys *all* subsequent memcpy LD/STs will
never hit a page fault.

whereas without ffirst, if you started at a nonaligned batch of LDs,
you remain on a nonaligned batch of LDs and cause page faults
requiring both current and next page to be in memory, restarting the
LD operations every single time.

this is extremely costly, enough that ARM and RVV and i think one
other Vector ISA decided it was important to include.

l.