[Libre-soc-dev] [RFC] Matrix and DCT/FFT SVP64 REMAP
Luke Kenneth Casson Leighton
lkcl at lkcl.net
Mon Jul 5 02:09:55 BST 2021
On Mon, Jul 5, 2021 at 12:59 AM Jacob Lifshay <programmerjake at gmail.com> wrote:
>
> On Sun, Jul 4, 2021, 16:18 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
> wrote:
>
> > On 7/4/21, Cesar Strauss <cestrauss at gmail.com> wrote:
> >
> > > More precisely:
> > >
> > > for y in y_r:
> > > for x in x_r:
> > > for z in z_r:
> > > result[y][x] +=
> > > a[y][z] *
> > > b[z][x]
> >
> > ah thank you.
> >
> > > I don't think there can be such a thing as an "in-place" algorithm for
> > > matrix multiplication.
> >
> > indeed there can :)
>
>
> > by loading the entirety of A and B into registers, and assuming A B
> > and result have been flattened to 1D, then using a "REMAP" schedule,
> > which calculates sequentially the 3 offsets, FMAC can be scheduled
> > with the required sequence...
> >
>
> that only works if the destination isn't any of the source registers,
yes. and that can be arranged by reordering the loops. instead of:
for y in y_r:
for x in x_r:
for z in z_r:
result[y][x] +=
a[y][z] *
b[z][x]
it can be done:
for z in z_r:
for y in y_r:
for x in x_r:
result[y][x] +=
a[y][z] *
b[z][x]
what's good about that is that B can be done as sequential elements,
and the results likewise. this results in sequentlal allocations which
SIMD back-ends can pick up.
z remains static for prolonged periods, as well. so it's a SIMD vector
times a scalar constant, which reduces the number of reads.
it's perfectly possible and pretty well-suited to the back-end engine.
and yes, entirely in-place.
l.
More information about the Libre-soc-dev
mailing list