# [Libre-soc-dev] [RFC] Matrix and DCT/FFT SVP64 REMAP

Luke Kenneth Casson Leighton lkcl at lkcl.net
Mon Jul 5 02:09:55 BST 2021

```On Mon, Jul 5, 2021 at 12:59 AM Jacob Lifshay <programmerjake at gmail.com> wrote:
>
> On Sun, Jul 4, 2021, 16:18 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
> wrote:
>
> > On 7/4/21, Cesar Strauss <cestrauss at gmail.com> wrote:
> >
> > > More precisely:
> > >
> > > for y in y_r:
> > >  for x in x_r:
> > >    for z in z_r:
> > >      result[y][x] +=
> > >         a[y][z] *
> > >         b[z][x]
> >
> > ah thank you.
> >
> > > I don't think there can be such a thing as an "in-place" algorithm for
> > > matrix multiplication.
> >
> > indeed there can :)
>
>
> > by loading the entirety of A and B into registers, and assuming A B
> > and result have been flattened to 1D, then using a "REMAP" schedule,
> > which calculates sequentially the 3 offsets, FMAC can be scheduled
> > with the required sequence...
> >
>
> that only works if the destination isn't any of the source registers,

yes.  and that can be arranged by reordering the loops.  instead of:
for y in y_r:
for x in x_r:
for z in z_r:
result[y][x] +=
a[y][z] *
b[z][x]

it can be done:

for z in z_r:
for y in y_r:
for x in x_r:
result[y][x] +=
a[y][z] *
b[z][x]

what's good about that is that B can be done as sequential elements,
and the results likewise.  this results in sequentlal allocations which
SIMD back-ends can pick up.

z remains static for prolonged periods, as well.  so it's a SIMD vector
times a scalar constant, which reduces the number of reads.

it's perfectly possible and pretty well-suited to the back-end engine.
and yes, entirely in-place.

l.

```