[Libre-soc-dev] [RFC] REMAP, SHAPE, complex (full) FFT

Luke Kenneth Casson Leighton lkcl at lkcl.net
Sat Jul 10 23:46:26 BST 2021


i am half-way through the FFT unit test, Vertical-First is in place,
a crude "svremap" instruction is in place, twin +/- add is in place,
twin +/- mul-add is in place, al the components needed are in
place and it's all gone to hell as far as optimisation is concerned.
here's the code that needs to be run in each inner loop:

                tpre =  vec_r[jh] * cos_r[k] + vec_i[jh] * sin_i[k]
                vec_r[jh] = vec_r[jl] - tpre
                vec_r[jl] += tpre

                tpim = -vec_r[jh] * sin_i[k] + vec_i[jh] * cos_r[k]
                vec_i[jh] = vec_i[jl] - tpim
                vec_i[jl] += tpim

now, if that was DFT - or if we had first-order support for complex
numbers, it would be:

        lst = SVP64Asm( ["setvl 0, 0, 11, 1, 1, 1",
                        # twin in-place +/- mul-add
                        "svremap 8, 1, 1, 1",
                        "sv.ffmadds 0.v, 0.v, 0.v, 8.v",
                        # svstep and loop
                        "setvl. 0, 0, 0, 1, 0, 0",
                        "bc 4, 2, -16"
                        ])

which is like... utterly cool.  yes, really, that's the entire DFT triple
nested loop, in 5 instructions.

this is what it currently looks like for complex FFT:

        lst = SVP64Asm( ["setvl 0, 0, 11, 1, 1, 1",
                        # tpre
                        "svremap 8, 1, 1, 1",
                        "sv.fmuls 24, 0.v, 16.v",
                        "svremap 8, 1, 1, 1",
                        "sv.fmuls 25, 8.v, 20.v",
                        "fadds 24, 24, 25",
                        # tpim
                        "svremap 8, 1, 1, 1",
                        "sv.fmuls 26, 0.v, 20.v",
                        "svremap 8, 1, 1, 1",
                        "sv.fmuls 26, 8.v, 16.v",
                        "fsubs 26, 26, 27",
                        # vec_r jh/jl
                        "svremap 8, 1, 1, 1",
                        "sv.ffadds 0.v, 24, 25",
                        # vec_i jh/jl
                        "svremap 8, 1, 1, 1",
                        "sv.ffadds 8.v, 26, 27",

                        # svstep loop
                        "setvl. 0, 0, 0, 1, 0, 0",
                        "bc 4, 2, -16"

blegh.  anything prefixed "sv." is a 64-bit instruction.  svermap is
only 32-bit but it's a one-off application to the "following instruction"
and consequently needs to be repeated.

the issue is that it needs the REMAP to be set up just the once,
and *occasionally* redirecting which registers the SVSHAPE0-3
actually apply to.... obviously it's even better if the instructions
they're applied to do not have to be altered.

the route i'd like to take is to split svremap out into two:

1) the current svremap, create a new "setvl-with-remap-butterfly-for-FFT"
    instruction, to be merged with the functionality of setvl.  this will
    set up the SVSHAPE0/1/2 as can currently be seen here:
    https://libre-soc.org/openpower/isa/simplev/

    (yes there would also be a corresponding setvl-with-remap-for-matrix-mul)

2) a new SPR, called SVREMAP, whose format and purpose would be to set
    what is currently established in REMAP "propagation"
    https://libre-soc.org/openpower/sv/propagation/

SVREMAP needs a *lot* of space.  up to *five* registers need REMAP
(ffmuls has 3 inputs FRA FRB and FRC and produces 2 outputs FRT and FRS)
and there are 2 bits to select which SVSHAPE (0-3).  that's a total of 5*2 bits,
plus an extra bit per register (qty 5) to say if they're enabled or
not.  that's 15 bits.

that's as far as i've got, so far.

l.

---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68



More information about the Libre-soc-dev mailing list