[Libre-soc-dev] [RFC] Matrix and DCT/FFT SVP64 REMAP

Mon Jul 5 08:23:09 BST 2021

On Sun, Jul 4, 2021, 20:17 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
wrote:

> On 7/5/21, Jacob Lifshay <programmerjake at gmai
>
> >
> > that's even worse, since now you need a whole temporary matrix
>
> nonsense, it it not "worse", it has been the entire goal of REMAP
> since its inception.
>

it's only worse if, for example, a 4x4 f32 matrix "A" stored in
r32-r39 (ignoring
that it would actually use the fp regs), and an identically shaped matrix
"B" stored in r40-r47, and you want to multiply A * B producing the result
matrix "Y" where both the result matrix is stored in r32-r39 (or
alternatively r40-47) and no other registers are used as temporaries (other
than standard SV SPRs), and the requirement is to keep the semantics as if
each element operation is done serially (no using required OoO to create
more temporaries), then:
the operation resulting from a REMAPed fmuladd fails to be a
matrix-multiply:
pseudocode for Y same as A:
for z in range(4):
    for y in range(4):
        for x in range(4):
            A[y][x] += A[y][z] * B[z][x]
pseudocode for Y same as B:
for z in range(4):
    for y in range(4):
        for x in range(4):
            B[y][x] += A[y][z] * B[z][x]
both of the above loops completely lose the original value of one of the
input matrixes by the time the z == 1 iteration starts since during the z
== 0 iteration, all indexes [y][x] are written to.

it works great if Y doesn't overlap A or B (aka. not in-place) (and Y is
zeroed beforehand), which is my whole point about in-place mat mul not
working:
for z in range(4):
    for y in range(4):
        for x in range(4):
            Y[y][x] += A[y][z] * B[z][x]

None of the above has anything to do with load/store, just architectural
cpu registers.

Jacob