# [Libre-soc-dev] [RFC] SVP64 Vertical-First Mode loops

Richard Wilbur richard.wilbur at gmail.com
Sat Aug 21 22:30:21 BST 2021

```> On Aug 19, 2021, at 04:43, lkcl <luke.leighton at gmail.com> wrote:
>
>> On August 19, 2021 1:46:59 AM UTC, Richard Wilbur <richard.wilbur at gmail.com> wrote:
>>
>>>>> On Aug 18, 2021, at 18:13, lkcl <luke.leighton at gmail.com> wrote:
>>>
>>> 8 4 2 1 on batch sizes and num coefficients
>>> 1 2 4 8 times reuse on coefficients
>>
>> Is this for FFT?
>
> no, DCT.  N ln N coefficients actually N/2 ln N
>
>> Very cool, I suspected it would be pretty good reuse.
>
> it's not.

Well, from my analysis above, the DCT for N elements (where N is a power if 2) seems to require calculating N-1 unique coefficients which have N+1 total reuses in the course of the transform, thus providing 2*N total coefficients.

If I’m correct, that is a safe bet on doing half of the hard work of calculating coefficients!

Now, let us turn to the RADIX2 FFT.

> RADIX2 FFT on the other hand there are N coefficients and you *can* reuse them, for row 2 you jump every other coefficient for all 2 sub-crossbars, for row 3 you jump every 4th coefficient for sub-sub-crossbars but they are the same N coefficients.

So, if I understand you correctly, with dataset comprising 16 samples:

N = 16
coefficients
16 8 4 2 1
1  2 4 8 16
reuse in row

unique coefficients = 16 = N
total coefficients = 5 * 16 = N (log(base=2, N) + 1)
reuses = 1 * (2 + 4 + 8 + 16) + 1 * (2 + 4 + 8) + 2 * (2 + 4) + 4 * 2
= 1 * 30 + 1 * 14 + 2 * 6 + 4 * 2
= 30 + 14 + 12 + 8
= 64
= N log(base=2, N)

So, if my calculations are correct, the DCT and the FFT for those two algorithms and same sample size require basically the same number of unique coefficients to be calculated (the hard part).  The reuse is considerably better on the FFT but it still could heavily impact both algorithms.

>> Except that the input numbers are rationals with a common denominator
>> for a particular row in DCT.  I think we could effectively store them
>> with a particular structure based on the denominator, indexed with the
>> integer count along the row.  More of a coefficient array/RAM than
>> cache (your usage of this term was more loaded than mine, I simply was
>> referring to a convenient place to stow the numbers where we could
>> easily and quickly get them back when needed).
>
> indeed... at the cost of designing and adding yet more instructions, this time with an extremely small probability that they can be put to use elsewhere.

The overhead of the following proposal is, I believe, two new instructions, described near the bottom of the proposal in order to benefit from the context established between here and there.

> the butterfly REMAP schedule is generic and i can foresee it being used elsewhere.

Here’s where we can use the properties of the REMAP schedule to our considerable advantage.  I envision a hard result cache with tags structured as follows:
algorithm index = 8 bits
loop1 index = 8 bits
loop2 index = 8 bits
loop3 index = 8 bits

This buys us some headroom to grow if and when we decide to further expand our register set but in effect puts off expanding it to simply hold coefficients.  (It also means that initial implementations could be more sparse depending on the size of the register set and the semantics of the REMAP schedule.  The loop indices might require only 7, 6, and 5 bits to represent for a register set size of 128=2^7—thus 18-bits instead of 24.)  It would likely be useful to make the allocation of bits below the algorithm id something that could be set when the algorithm is initialized in the hard result management unit(a knock off of the memory management unit?).  This would allow us to tailor use of the 16.7M tag space/algorithm to the loop or index sizes of particular algorithms.

Also associated with the hard result management unit is a list of vectors to subroutines (indexed by algorithm id) that implement the calculation of the hard result should it not be present in the hard result cache.  The semantics of the routine are restricted to use whatever is implied by the REMAP schedule (indices in particular registers).

When a particular hard calculation routine (algorithm id) is changed the cache for all associated hard results would be invalidated.

When a particular algorithm was set up in a REMAP section, the decoder or dispatch unit could schedule the calculation of vectors of coefficients to pre-load the hard result cache if that would be on an interference basis with the calculations (functional unit contention) or data flow (register usage contention) of the algorithm.

The proposed two new instructions:
1.  Load or calculate hard result, with parameters destination register, algorithm id, loop1 index, loop2 index, loop3 index
2.  Register algorithm with the hard result management unit, with parameters to include algorithm id, entry point of routine to calculate hard result, bit widths of loop indices in the tag.

The effects of the hard result cache I can think of right now will be to:
1.  Cut down the pressure on the register set to hold algorithm-specific coefficients that often are reused many times in the course of a particular calculation.
2.  On the other side of that same coin, it will reduce the number of times we have to calculate the same coefficient, thus considerably reducing the load on the arithmetic units for redundant calculations and reserving the arithmetic and memory bandwidth for the data to which the algorithm is being applied.
3.  Being able to dedicate a higher percentage of our register set to data and results and avoiding the recalculation of coefficients should both work together to improve the performance of the libre-soc Simple-V architecture.

(The hard result cache needn’t be tied specifically to REMAP, it could be used by normal vector or scalar code.)

Richard
```