[Libre-soc-dev] [RFC] Matrix and DCT/FFT SVP64 REMAP
Hendrik Boom
hendrik at topoi.pooq.com
Mon Jul 5 12:41:31 BST 2021
On Mon, Jul 05, 2021 at 02:09:55AM +0100, Luke Kenneth Casson Leighton wrote:
> On Mon, Jul 5, 2021 at 12:59 AM Jacob Lifshay <programmerjake at gmail.com> wrote:
> >
> > On Sun, Jul 4, 2021, 16:18 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
> > wrote:
> >
> > > On 7/4/21, Cesar Strauss <cestrauss at gmail.com> wrote:
> > >
> > > > More precisely:
> > > >
> > > > for y in y_r:
> > > > for x in x_r:
> > > > for z in z_r:
> > > > result[y][x] +=
> > > > a[y][z] *
> > > > b[z][x]
> > >
> > > ah thank you.
> > >
> > > > I don't think there can be such a thing as an "in-place" algorithm for
> > > > matrix multiplication.
> > >
> > > indeed there can :)
> >
> >
> > > by loading the entirety of A and B into registers, and assuming A B
> > > and result have been flattened to 1D, then using a "REMAP" schedule,
> > > which calculates sequentially the 3 offsets, FMAC can be scheduled
> > > with the required sequence...
> > >
> >
> > that only works if the destination isn't any of the source registers,
>
> yes. and that can be arranged by reordering the loops. instead of:
> for y in y_r:
> for x in x_r:
> for z in z_r:
> result[y][x] +=
> a[y][z] *
> b[z][x]
>
> it can be done:
>
> for z in z_r:
> for y in y_r:
> for x in x_r:
> result[y][x] +=
> a[y][z] *
> b[z][x]
And this ordering of loops *does* work for transitive closure, when a,
b, and result are the very same matrix, updated while being used.
By the way, I believe there is a graph algorithm that does the
transitive closure thing, but instead of using boolean, "and", and "or",
they use real numbers, addition, and minimum. I think that one computes
shortest paths between vertices.
By the time the z'th iteration of the z loop begins, the algorithm has
already peocessed paths that go through vertices numbered < z, and it
adds paths that go through vertices numbered z.
For this to work, the outer loop has to be the one on teh subscript that
bridges a and b (which in this case are teh same matrix, of course).
-- hendrik
>
> what's good about that is that B can be done as sequential elements,
> and the results likewise. this results in sequentlal allocations which
> SIMD back-ends can pick up.
>
> z remains static for prolonged periods, as well. so it's a SIMD vector
> times a scalar constant, which reduces the number of reads.
>
> it's perfectly possible and pretty well-suited to the back-end engine.
> and yes, entirely in-place.
>
> l.
>
> _______________________________________________
> Libre-soc-dev mailing list
> Libre-soc-dev at lists.libre-soc.org
> http://lists.libre-soc.org/mailman/listinfo/libre-soc-dev
More information about the Libre-soc-dev
mailing list