[Libre-soc-dev] [RFC] SVP64 Vertical-First Mode loops
luke.leighton at gmail.com
Wed Aug 18 20:06:29 BST 2021
On August 18, 2021 5:08:09 PM UTC, Luke Kenneth Casson Leighton <lkcl at lkcl.net> wrote:
>extreme large DCTs and FFTs, you end up strip-mining the L2 cache *as
basically, to do large DCT / FFT recursively, you split into two halves, do each half at half the DCT/FFT size, then recombine the results.
the further down the recursion depth you first get offsets of 2 for every element, then 4, then 8 etc etc.
by the time you get to an offset of 64 you've hit the L1 cache row size, and thereafter EVERY SINGLE LD/ST for the ENTIRE sub-FFT/DCT hits the EXACT same cache line.
Mitch pointed out very plainly and simply that using a reasonably efficient pipelined cos implementation is therefore way faster than hammering L1 and L2 even more than they already are.
so for this one example alone it justifies VF Mode's existence.
having thought it through i don't think it's going to be possible to do batches. only one element (one srcstep, one dststep) at a time.
which in turn makes the VFHint field also kinda unnecessary.
More information about the Libre-soc-dev