[Libre-soc-dev] [RFC] SVP64 Vertical-First Mode loops

Thu Aug 19 01:13:56 BST 2021

On August 18, 2021 11:50:41 PM UTC, Richard Wilbur <richard.wilbur at gmail.com> wrote:

>> you can share those values *in* a row but unlike an FFT you cannot
>*reuse* them on a *different* row due to the +0.5
>
>That stinks for the DCT.

yes it does.

>  Thanks for reminding me!  How much
>coefficient sharing could be done on a single row?

each row being a butterfly goes double the distance and therefore needs half the coefficients compared to the previous row.

* when you have 16 values you need twin mul-add on 8 values and therefore 8 coefficients
* next row you have 2 batches of 8 values therefore the 4 coefficients can be used 2x
* next row you have 4 batches of 4 values, 2 coefficients can be used 4x
* next row 8 batches of 2 values, the 1 coefficient can be used 8x

8 4 2 1 on batch sizes and num coefficients
1 2 4 8 times reuse on coefficients

>I agree about specialist single-purpose instructions unless we can make
>a good case for how such instructions would be clearly superior
>performance-wise for an important algorithm!

if the cache was purely transparent i.e. spotted input values and returned a lookup that would need no ISA augmentation.

only thing being: multiple lookups of 64 bit FP numbers is one hell of a lot of gates.  10 gates per XOR bit, 640 gates per reg, you want say 32 cache entries, that's 20k gates just in lookups let alone DFFs.

that's likely competing with an efficient COS pipeline for gate count

>An MRI (Magnetic Resonance Imaging) workload would likely use a large
>number of same-sized FFTs to reconstruct the 2-D slice images.

FFT you can do a couple of tricks, one of which is a Matrix variant of FFT.

* FFT small rows
* transpose
* FFT small col.. errr now rows again
* transpose

something like that.

if it's a square matrix you can reuse the same coefficients.

l.