[Libre-soc-dev] [RFC] SVP64 Vertical-First Mode loops

Thu Aug 19 00:50:41 BST 2021

> On Aug 18, 2021, at 16:17, lkcl <luke.leighton at gmail.com> wrote:
> 

>> On August 18, 2021 10:02:49 PM UTC, Richard Wilbur <richard.wilbur at gmail.com> wrote:
> 
>> Each half [of FFT] could use the same scalar coefficients. 
> 
> could... but remember: FFT of size N you need N coefficients. now you can only hold in regfile half an FFT as if you did it with Vertical-First Mode

That’s why I proposed a coefficient cache.

>> Seems for a
>> particular size data set that if we are doing recursive sizes of
>> transforms to compute the transforms.  If they are always related by
>> powers of two then one time calculating the coefficients should be
>> sufficient if we could calculate them and store them either in the
>> order they are used (in a non-destructive FIFO with capability to set a
>> step size) or with an easy scheme to access them via an index, we might
>> at once calculate the coefficients using our vector engine and then use
> 
> DCT unfortunately doesn't work that way.  in order to complete all butterflies you need, in each row, cos((i+0.5)/n) from i=0..n-1 where n goes up in powers of two per butterfly row.
> 
> you can share those values *in* a row but unlike an FFT you cannot *reuse* them on a *different* row due to the +0.5

That stinks for the DCT.  Thanks for reminding me!  How much coefficient sharing could be done on a single row?

>> If we had such a coefficient cache, I think VFHint could still be
>> useful.
> 
> interesting idea, to have a special separate cache for coefficients.  it is however pretty specialist.  if it really becomes really a focus for performance it's worth pursuing.
> 
> right now issuing cos instructions is "generic".  specialist single-purpose instructions make me twitchy.

I agree about specialist single-purpose instructions unless we can make a good case for how such instructions would be clearly superior performance-wise for an important algorithm!

> for 3D texture interpolation it's fine / great / obvious payoff.

If FFT turns out to win handsomely with said coefficient cache, I would suggest we implement a status register for the cache that stores relevant info. like characteristics of the stored coefficients:  n, ….  Then when we prepare to perform another FFT we can quickly check whether we can reuse the coefficients or need to recalculate them.

An MRI (Magnetic Resonance Imaging) workload would likely use a large number of same-sized FFTs to reconstruct the 2-D slice images.