[Libre-soc-dev] [RFC] Matrix and DCT/FFT SVP64 REMAP

Fri Jul 2 01:03:06 BST 2021

i need some help with this one, i am not coping with the permutations
or the number of options.

http://libre-soc.org/openpower/sv/remap

the basic principle behind REMAP is that normally a Vector loop (a MAC
example) is this:

for i in 0..VL-1:
   GPR(RT+i) +=
       GPR(RA+i) *
       GPR(RB+i)

the insight is that if the registers are considered to be 1D linear
"remaps" of 2D Matrices then the following may be done instead:

for i in 0..VL-1:
  GPR(RT+remap0(i)) +=
   GPR(RA+remap1(i)) *
   GPR(RB+remap2(i))

and the functions may be arranged so as to produce the correct array
sequences for full *in-place* Matrix Multiply *of arbitrary size* (not
limited to power of two).

by using a *standard* Multi-Issue Superscalar execution engine, the
Issue Phase may spam out as many MACS/FMACS as the back-end hardware
can handle, and standard R/W Hazard Tracking takes care of the problem
of potentially overlapping registers.

as long as the same result registers are not issued simultaneously
then there is no stalling, either.

this is easy to arrange and has been coded up in the following demo:

https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/remapmatrix.py;hb=HEAD

three python generators, one for result, one for matrix X, one for
matrix Y, create the *appearance* of there being three nested
for-loops (of arbitrary non-power-of-two dinensions)

there are a number of issues:

1) i am not a mathematician, i am an engineer: i know what the right
answer should be and have gone through all possible permutations for
the algorithm until it produced the right answer.  therefore it needs
checking.

2) there is very little bit-space in the SPRs (SVSHAPE0-3) and DCT/FFT
Tukey-Cooley schedules have yet to be fitted in as well:

# Radix-2 decimation-in-time FFT
size = 2
while size <= n:
	halfsize = size // 2
	tablestep = n // size
	for i in range(0, n, size):
		k = 0
		for j in range(i, i + halfsize):
			temp = vec[j + halfsize] * exptable[k]
			vec[j + halfsize] = vec[j] - temp
			vec[j] += temp
			k += tablestep
	size *= 2

there are (again, like Matrices) three loops, and there are (again)
three indices.

i would very much like to see these merged into the same SPR however
there are very few bits available to do so.

the alternative is to start adding extra opcodes, one for Matrix
Multiply "schedule" establishment and another completely separate
opcode that establishes an FFT schedule.  and probably one for DCT as
well.

thoughts appreciated.

l.