# [Libre-soc-dev] [RFC] SVP64 Vertical-First Mode loops

Richard Wilbur richard.wilbur at gmail.com
Thu Aug 19 02:46:59 BST 2021

```> On Aug 18, 2021, at 18:13, lkcl <luke.leighton at gmail.com> wrote:

>> On August 18, 2021 11:50:41 PM UTC, Richard Wilbur <richard.wilbur at gmail.com> wrote:
>> How much coefficient sharing could be done on a single row?
>
> each row being a butterfly goes double the distance and therefore needs half the coefficients compared to the previous row.
>
> * when you have 16 values you need twin mul-add on 8 values and therefore 8 coefficients
> * next row you have 2 batches of 8 values therefore the 4 coefficients can be used 2x
> * next row you have 4 batches of 4 values, 2 coefficients can be used 4x
> * next row 8 batches of 2 values, the 1 coefficient can be used 8x
>
> 8 4 2 1 on batch sizes and num coefficients
> 1 2 4 8 times reuse on coefficients

Is this for FFT?  Very cool, I suspected it would be pretty good reuse.

I wasn’t specific enough when I asked, “How much coefficient reuse in a particular row?”  I meant to ask concerning the DCT since it isn’t an option to share coefficients between rows in that algorithm.

>> I agree about specialist single-purpose instructions unless we can make
>> a good case for how such instructions would be clearly superior
>> performance-wise for an important algorithm!
>
> if the cache was purely transparent i.e. spotted input values and returned a lookup that would need no ISA augmentation.
>
> only thing being: multiple lookups of 64 bit FP numbers is one hell of a lot of gates.  10 gates per XOR bit, 640 gates per reg, you want say 32 cache entries, that's 20k gates just in lookups let alone DFFs.
>
> that's likely competing with an efficient COS pipeline for gate count

Except that the input numbers are rationals with a common denominator for a particular row in DCT.  I think we could effectively store them with a particular structure based on the denominator, indexed with the integer count along the row.  More of a coefficient array/RAM than cache (your usage of this term was more loaded than mine, I simply was referring to a convenient place to stow the numbers where we could easily and quickly get them back when needed).

>> An MRI (Magnetic Resonance Imaging) workload would likely use a large
>> number of same-sized FFTs to reconstruct the 2-D slice images.
>
> FFT you can do a couple of tricks, one of which is a Matrix variant of FFT.
>
> * FFT small rows
> * transpose
> * FFT small col.. errr now rows again
> * transpose
>
> something like that.
>
> if it's a squa

Did you mean to describe the case where the matrix is square?

```