[Libre-soc-dev] FFT, DCT, REMAP

Mon Jun 21 15:50:23 BST 2021

the paper was fascinating: a compiler for parallel algorithms, looks like IR.

sigh why do papers leave out critical context, pepijn was particularly
funny here
http://pepijndevos.nl/2018/07/04/loefflers-discrete-cosine-transform-algorithm-in-futhark.html

the listing 3, what is the definition of the dimensions? i can't
understand it sigh.  luckily that fft.py is blindingly obvious.

being able to claim and state, boldly, "we do power of two FFT kernels
with one instruction", i cannot emphasise enough how much of a big
damn deal that is.  being able to do it "in-place" even more so.
maximising the size of "Load, process, Store" batches is critical.

i also found this which explains *another* group of crucial scientific
algorithms which need bitreverse

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6879380/#!po=19.6429

three crucial steps are needed for this to work

1) the Load must be capable of bitreversed indices.

2) Coeff(k) from fft.py must be possible to put on a different REMAP
schedule from the mul/adds

3) twin mul/adds must not corrupt each other no matter the internal
architecture.

going through each of these:

LOADS
=====

ordinarily i would suggest using vectorised RA RB Load but the
calculation itself of the indices is quite inefficient... and also
requires a Vector RB.

sv.iota (an alias for "sv.addi/mr r1.v, r2.v, 1" when elwidth=default)
seems like a good candidate for adding bitreverse as an option, but
again, even if it were added, some shifts are needed *and* it still is
the RB of LD which takes up an entire Vector for the offsets.

therefore, bit-reverse option to LDST it is, and luckily there is a
free mode in SVP64 RM MODE field.

therefore the bitreverse of 0..VL-1 is straightforward to add,
important thing being it is at the right bitwidth (log2(VL)).

REMAP Schedules
===============

there are four REMAP Shapes, each of them independent and all sourced
from the same "stepping".  they therefore all remain in lockstep,
however k goes like this:

# Radix-2 decimation-in-time FFT
  size = 2
  while size <= n:
    halfsize = size // 2
    tablestep = n // size
    for i in range(0, n, size):
      k = 0
      for j in range(i, i + halfsize):
          k += tablestep
  size *= 2

whilst i and j are sequentially incrementing.  hmm, k is actually
slaved to j, isn't it?  as in: k could be expressed as:

   k = (j-i) * tablestep

hmm, this *should* be possible to specify with the "permute" field of
REMAP Shape.  with Shape0 being used for srces, Shape1 for dest,
Shape2 with a different permute option can be used for k

have to work out the details but i think it's good.

Data Corruption
===============

the simplest way to think of this is that each alterating MULADD has
an extra immediate, -1 1 -1 1 ...

then it becomes a matter of optimisation on a per microarchitectural basis.

* out-of-order: dead simple.  set the read/write hazards on the
*pair*.  neither can write back in-place until *both* have read.

* in-order single-issue: as long as the pair are done in time such
that the results are in in-flight pipelines it's fine

* in-order multi-issue, also fine.

* FSM, not fine! ironically TestIssuer is screwed.  it would be
necessary to add some sort of state capture or better a special
pipeline with multiple sets of input operands and double outputs.
whoops.

further optimisations involve micro-ops just like for the FSM case,
double-result production basically.

Conclusion
==========

it's doable.  there are some niggles, but it's doable.  given that FFT
is one of the world's top algorithms, and so is DCT and others like
it, it's worth doing.

i'd also like to see if it can also be used directly for in-place DCT
as well: there are indications in that nih paper that the top half of
the input may be bit-reversed, and also that multiple applications of
bitreverse result in a full butterfly.

we should also explore that as well because DCT is crucial to Video
and Audio processing.

l.