[Libre-soc-dev] [RFC] REMAP, SHAPE, complex (full) FFT

Mon Jul 19 12:29:15 BST 2021

---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68

On Mon, Jul 19, 2021 at 10:20 AM Hendrik Boom <hendrik at topoi.pooq.com> wrote:

> I'm not sure how helpful I can be, but I'd like to have a look at the
> problem.

that'd be fantastic.

> Could you show me the versions of the recursive and buttefly code you
> are trying to iterativize?

https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/fastdctlee.py;hb=HEAD

and associated test program:
https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/fastdct-test.py;hb=HEAD

these now work.  very frustrating to have worked out, but they're working.
iterative DCT, but *not* in-place.... yet.

now... this diagram could be wrong, but notice the inputs are in
order 0 1 2 3.... 7 but the output is in bit-reversed order 0 4 2 6 1 5 3 7?
https://ftp.libre-soc.org/plonka_dct1.png

now notice how the outer butterly does **NOT** have to have bit-reversing
done after it?

that's because i *PRE-LOADED* the data in bit-reversed order...
*and then created an index lookup table to invert that, in-place*
during the inner butterfly.

however... sigh... the original code does this:

0 MUL 7 --> +/- 0 4
1 MUL 6 -> +/- 1 5
2 MUL 5 -> +/- 2 6
3 MUL 4 -> +/- 3 7

which of course is not in-place, at all.  the first half is in-place, but
the second half is reverse-number order:
* 7 goes in position 4
* 6 in position 5
* 5 in position 6
* 4 in position 7

the next trick to investigate, therefore, is to *pre-load* the data with
a *pre-reversed* top half, such that the MULs can be done as:

0 MUL 7 --> +/- 0 7
1 MUL 6 -> +/- 1 6
2 MUL 5 -> +/- 2 5
3 MUL 4 -> +/- 3 4

> And any iterative versions you've been able to derive from them?
> If I remember correctly, the essence of FFT is recursive decomposition,
> and making it iterative is likely tricky.

it's a standard computer science trick from Functional Programming
that any recursive algorithm may be proven to be the same as its
iterative variant.

FFT has already been done as an iterative variant: solved many years
ago and even in-place, the trick being there to *bit-reverse* the indices
of the data being loaded *while* loading it into memory.  algorithm can
be found here:

   https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/remap_fft_yield.py;hb=HEAD

where the original is here:

   https://www.nayuki.io/res/free-small-fft-in-multiple-languages/fft.py

l.