[Libre-soc-dev] [RFC] Matrix and DCT/FFT SVP64 REMAP

Mon Jul 5 12:00:12 BST 2021

On 7/5/21, Jacob Lifshay <programmerjake at gmail.com> wrote:
> On Sun, Jul 4, 2021, 20:17 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
> wrote:
>
>> On 7/5/21, Jacob Lifshay <programmerjake at gmai
>>
>> >
>> > that's even worse, since now you need a whole temporary matrix
>>
>> nonsense, it it not "worse", it has been the entire goal of REMAP
>> since its inception.
>>
>
> it's only worse if, for example, a 4x4 f32 matrix "A" stored in
> r32-r39 (ignoring
> that it would actually use the fp regs), and an identically shaped matrix
> "B" stored in r40-r47, and you want to multiply A * B producing the result
> matrix "Y" where both the result matrix is stored in r32-r39 (or
> alternatively r40-47) and no other registers are used as temporaries

right, got it, now. thought you were referring to LD/ST spill.

yes, you can't have the result registers overlap with the sources, yes
it's possible, but would only on OoO systems with an
implementation-specific amount of inflight buffers. can't do that, it
would punish lower resource systems.

one of the downsides of not having explicit Vector registers, oh well.

honestly if people want to overwrite registers, i'm sure there's a way
to do it: just not with a single instruction plus REMAP triple-loop.

there's nothing stopping people from using only 2 REMAP dimensions
instead of the full possible 3.

some register spill can be performed, or only partial loads of either
source etc etc.

also it's possible to do columns of the result or rows of the result,
using 2D plus an explicit outer loop.

treat B as a series of independent Nx1 Vectors that just happen to be
one after the other.

then i think you only need N extra registers.  if B is 4x3, and B
starts at r40-52, then result can begin 4 earlier, at r36.

it *might* even be possible to reorder the 3D schedule to fit that
pattern.... have to check and make sure it doesn't result in R-W
Hazard chaining...

hmm..

yes, regs need zeroing. hmm, two instructions, actually four because
of two setvls.  one of those setvls will be NxM for the zeroing. the
other will be NxMxO to create the schedule of FMACs.

   setvl 12
   sv.mv r12, r0
   setvl 12*4
   sv.fmac

that's slightly annoying.

ah.  yes, the overwrite would result in FMAC accumulation on top of B.

no, 3D overwrite isn't going to work.

2D plus explicit loop zeroing each overlapped row of B before use
should be fine though.

l.