[Libre-soc-dev] [RFC] REMAP, SHAPE, complex (full) FFT

Sun Jul 11 12:02:53 BST 2021

---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68

On Sun, Jul 11, 2021 at 6:44 AM Lauri Kasanen <cand at gmx.com> wrote:
>
> Hi Luke,
>
> Just a warning, what you're showing here is dangerously close to what
> you described earlier as "10 hours per line of asm". The complexity is
> reaching high heaven.

if it wasn't for this massive list just on DCT alone i would have
stopped this exercise weeks ago

    https://en.wikipedia.org/wiki/Discrete_cosine_transform#Applications

*10 speech encoding algorithms
* 14 general audio encoding algorithmms
* 13 video encoding algorithms
* 6 image compression standards

for RADIX-2 DFT and DCT it's only ever going to be around 7 lines of
assembler (with fixed tables).

for FFT:
* i expect it to be 25 lines of assembler that would normally
(commercially) go in an app note (see TI ref below),
* that app note needs writing *once*
* it's hidden in the hardware
* REMAP has been planned for 2 years,
* and it's exactly what high-end DSPs do:

* Qualcom's Hexagon DSP (used in signal processing)
    https://en.wikipedia.org/wiki/Qualcomm_Hexagon#Code_sample
    the inner loop is 1 VLIW instruction.

* TMS320C80 DSP, DIF FFT
    https://www.ti.com/lit/an/spra152/spra152.pdf
    equation 17 and 18 can be done in 1 clock cycle,
    Zero-Overhead Loops apply to get the inner loop 100% pipelined.

* ZOLC Hardware-looping, Nikolaus Kavvadias
   https://opencores.org/websvn/filedetails?repname=hwlu&path=%2Fhwlu%2Ftrunk%2Fdoc%2Fhwlu_spec.pdf
   this was actually commercially implemented in actual hardware
   by ST Microelectronics.

> It may still be worth doing,

that's the point of the exercise, to see if it is indeed worth it.
i'm about 30% the
way through things at the moment, so i haven't enough pieces to spot
"patterns" that can bring down code-size.

> not saying that, but it will highly limit
> who can (and will) write such a fft for SV. fftw is a popular lib yes,
> but each of the specialized math libs will have their own.

there's nothing stopping people from *not* using these
optimisations: the standard Power ISA code-paths and
ISA will not be "punished" by the addition of REMAP and
twin +/- MADD and ADD operations.

they'll just end up with an "industry-standard-normal" amount
of hard-coded assembler (macros that get unrolled to hundreds
of instructions).

bottom line is, the bang-per-buck ratio for this is so high, and its
applications so important in computer science *and* it's a "general"
abstraction, that i feel it's worth continuing.

remember: the abstractions (app notes) only need be written once.

l.