[Libre-soc-dev] [RFC] REMAP, SHAPE, complex (full) FFT

Luke Kenneth Casson Leighton lkcl at lkcl.net
Mon Jul 19 08:50:56 BST 2021

crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68

On Mon, Jul 19, 2021 at 6:44 AM Jacob Lifshay <programmerjake at gmail.com> wrote:
> On Sun, Jul 18, 2021 at 2:58 PM Luke Kenneth Casson Leighton
> <lkcl at lkcl.net> wrote:
> >
> > i've moved on to DCT now, not the version that uses FFT followed by
> > extraction of real coefficients.
> >
> > most DCT algorithms unfortunately are either expressed inline as a
> > massive batch of unintelligible hand-created MULADDs, or as assembler,
> > or as recursive algorithms.
> >
> > i found *one* iterative butterfly implementation in c++ and it
> > completely failed to work.
> >
> > currently struggling with converting a recursive implementation to iterative.
> >
> > i could really do with some assistance.
> I wasn't able to find any really good examples, but I did find some
> slightly-bitrotted code in the AV1 reference library:
> https://aomedia.googlesource.com/aom/+/refs/heads/master/tools/txfm_analyzer/txfm_gen_code.cc

it looks like it's generating inline (loop-unrolled) c-code, where what i need -
in order to be able to replace the nested loops with a REMAP Schedule -
is *actual* nested loops, like this:

    size = n
    while size >= 2:
        halfsize = size // 2
        tablestep = n // size
        ir =
        for i in list(range(0, n, size)):
            k = 0
            j = list(range(i, i + halfsize))
            jr = list(range(i+halfsize, i + size))
            vec2 = deepcopy(vec)
            for ci, (jl, jh) in enumerate(zip(j, jr)):
                t1, t2 = vec[jl], vec[jh]
                coeff = (math.cos((ci + 0.5) * math.pi / size) * 2.0)
                vec2[jl] = t1 + t2
                vec2[jl+halfsize] = (t1 - t2) * (1/coeff)
                k += tablestep
            vec = vec2

that code actually works, and does the job.  it's not "in-place" (yet),
i will be able to sort that out later.

however this is just the "inner" butterfly: there is a second
triple-loop afterwards called the "outer" loop.

> Apply the below patch to switch it to generate generic C instead of
> SSE intrinsics,

unfortunately it's not useful.  if it generated actual loops - actual
for-loops - rather than inline c code - it would be useful.

by generating inline code it's one of many dozens of examples
of what's deeply significantly wrong with how DCT and FFT are
being approached (because programmers don't have any other

trying to understand what on earth such code is doing is much
easier in python than it is in any other language.

currently i have some code which generates a pretty-printed
"schedule" of the actual adds that need to be performed,
from the recursive variant of the outer butterfly:

 ('add', 4, ('add', 6, 7)),
 ('add', 2, 3),
 ('add', ('add', 6, 7), 5),
 ('add', 5, 7),

which has allowed me to write it out as a 2D graph.


More information about the Libre-soc-dev mailing list