[Libre-soc-dev] [RFC] Matrix and DCT/FFT SVP64 REMAP

Tue Jul 6 09:53:45 BST 2021

On Tue, Jul 6, 2021 at 3:42 AM Richard Wilbur <richard.wilbur at gmail.com> wrote:

> I suppose this is where having the semantics to code a dedicated “0”
> source in the register specification could be useful to allow the dispatcher
> to tell the vector unit to send 0’s for a particular operand.
> (Avoiding the explicit initialization of C.)

the general rule for SVP64 is (which has just been obliterated out
of necessity with a special DCT/FFT butterfly instruction):

     no new opcodes.

or more strictly:

   *definitely* no new opcodes that involve "re-interpretation"
   of base (scalar) 32-bit v3.0B ones.

that rule was broken for the very first time with a bit-reverse
LD (to add an RC shift field, partly-embedded into the LD
immediate) and i can tell you for free it will be a royal nuisance.
the mess it's made of PowerDecoder2 is... gaah.

however in this particular case (C=0) the cost of modifying
the 4-operand operations or of adding a new one is so
expensive that the "justification" cost is almost 100% likely
to be too high.

fnmadd etc. sit within an A-Form, maddhd etc. sit within
a VA-Form. these are extremely expensive in terms of opcode
space, to the point where maddhd etc. don't even have Rc=1
variants.

having an explicit zeroing series of instructions to initialise
the matrix to zero *when it is needed* is perfectly acceptable.
i mean, we're talking 5 instructions, now, for an arbitrary-sized
runtime-selectable matrix multiply (up to 64 FMACs).

in scalar form - a fixed size 4x4 matrices - it comes out at
80 instructions.  enable -O3 and it jumps to a whopping 340
(full explicit 3-nested loop-unrolling occurs)

      https://godbolt.org/z/1GrqfMMzf

given that we're looking at greater than an 8x reduction in
code size against the non-optimised version and a stunning 60x
reduction against the compiler-optimised version i don't believe
it's worthwhile pursuing further optimisations for zeroing.

l.