[Libre-soc-dev] FFT, DCT, REMAP

Mon Jun 21 18:42:51 BST 2021

On Mon, Jun 21, 2021, 07:51 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
wrote:

> the paper was fascinating: a compiler for parallel algorithms, looks like
> IR.
>
> sigh why do papers leave out critical context, pepijn was particularly
> funny here
>
> http://pepijndevos.nl/2018/07/04/loefflers-discrete-cosine-transform-algorithm-in-futhark.html
>
> the listing 3, what is the definition of the dimensions? i can't
> understand it sigh.  luckily that fft.py is blindingly obvious.
>
> being able to claim and state, boldly, "we do power of two FFT kernels
> with one instruction", i cannot emphasise enough how much of a big
> damn deal that is.  being able to do it "in-place" even more so.
> maximising the size of "Load, process, Store" batches is critical.
>
> i also found this which explains *another* group of crucial scientific
> algorithms which need bitreverse
>
> https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6879380/#!po=19.6429
>
> three crucial steps are needed for this to work
>
> 1) the Load must be capable of bitreversed indices.
>

we'd need to correctly handle loading from the middle of a bigger array
using bit-reversed indices, I suggest using indexed load/store where RA is
bit-reversed:
assembly (adjust mnemonic as needed):
lwzux_brev r10.v, r6.s, r7.s
Rust pseudo-code:
let rt = 10;
let ra = 6;
let rb = 7;
let index_step: u64 = read_from_some_spr(); // maybe CTR?
let mut index: u64 = regs[ra];
for i in 0..vl {
    let address = regs[rb].wrapping_add(index.reverse_bits());
    index = index.wrapping_add(index_step);
    regs.write_u32(rt, i, mem.load_u32(address)); // store instructions
store here
}
// writing the index back instead of the address allows the next
instruction to handle
// the next 64 elements for arrays >64 elements
regs[ra] = index;

index_step is dependent on the array size:
so for a 1kB array (1 << 10), index_step = 1 << (64 - 10)
for a 16kB array (1 << 14), index_step = 1 << (64 - 14)
for a 2^N byte array, index_step = 1 << (64 - N)

remember that not all code has a constant array size, hence why index_step
can't just be an immediate.

example of indexing a 16B array at address 0x1_0000 with VL == 4:
index_step = 1 << (64 - 4) == 0x1000_0000_0000_0000

index (in RA) starts as 0
the array address (in RB) is 0x1_0000

i = 0:
address == 0x1_0000
index <- 0x1000_0000_0000_0000 (is 0x8 bit-reversed)

i = 1:
address == 0x1_0008
index <- 0x2000_0000_0000_0000 (is 0x4 bit-reversed)

i = 2:
address == 0x1_0004
index <- 0x3000_0000_0000_0000 (is 0xC bit-reversed)

i = 3:
address == 0x1_000C
index <- 0x4000_0000_0000_0000 (is 0x2 bit-reversed, but it doesn't matter
since it's unused)

Jacob