[Libre-soc-dev] [RFC] Matrix and DCT/FFT SVP64 REMAP

Mon Jul 5 12:41:31 BST 2021

On Mon, Jul 05, 2021 at 02:09:55AM +0100, Luke Kenneth Casson Leighton wrote:
> On Mon, Jul 5, 2021 at 12:59 AM Jacob Lifshay <programmerjake at gmail.com> wrote:
> >
> > On Sun, Jul 4, 2021, 16:18 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
> > wrote:
> >
> > > On 7/4/21, Cesar Strauss <cestrauss at gmail.com> wrote:
> > >
> > > > More precisely:
> > > >
> > > > for y in y_r:
> > > >  for x in x_r:
> > > >    for z in z_r:
> > > >      result[y][x] +=
> > > >         a[y][z] *
> > > >         b[z][x]
> > >
> > > ah thank you.
> > >
> > > > I don't think there can be such a thing as an "in-place" algorithm for
> > > > matrix multiplication.
> > >
> > > indeed there can :)
> >
> >
> > > by loading the entirety of A and B into registers, and assuming A B
> > > and result have been flattened to 1D, then using a "REMAP" schedule,
> > > which calculates sequentially the 3 offsets, FMAC can be scheduled
> > > with the required sequence...
> > >
> >
> > that only works if the destination isn't any of the source registers,
> 
> yes.  and that can be arranged by reordering the loops.  instead of:
>  for y in y_r:
>   for x in x_r:
>     for z in z_r:
>       result[y][x] +=
>          a[y][z] *
>          b[z][x]
> 
> it can be done:
> 
>   for z in z_r:
>    for y in y_r:
>     for x in x_r:
>       result[y][x] +=
>          a[y][z] *
>          b[z][x]

And this ordering of loops *does* work for transitive closure, when a, 
b, and result are the very same matrix, updated while being used.

By the way, I believe there is a graph algorithm that does the
transitive closure thing, but instead of using boolean, "and", and "or",
they use real numbers, addition, and minimum.  I think that one computes 
shortest paths between vertices.

By the time the z'th iteration of the z loop begins, the algorithm has 
already peocessed paths that go through vertices numbered < z, and it 
adds paths that go through vertices numbered z.

For this to work, the outer loop has to be the one on teh subscript that 
bridges a and b (which in this case are teh same matrix, of course).

-- hendrik

> 
> what's good about that is that B can be done as sequential elements,
> and the results likewise.  this results in sequentlal allocations which
> SIMD back-ends can pick up.
> 
> z remains static for prolonged periods, as well.  so it's a SIMD vector
> times a scalar constant, which reduces the number of reads.
> 
> it's perfectly possible and pretty well-suited to the back-end engine.
> and yes, entirely in-place.
> 
> l.
> 
> _______________________________________________
> Libre-soc-dev mailing list
> Libre-soc-dev at lists.libre-soc.org
> http://lists.libre-soc.org/mailman/listinfo/libre-soc-dev