[Libre-soc-dev] [RFC] SVP64 Vertical-First Mode loops

Wed Aug 18 18:08:09 BST 2021

On Wed, Aug 18, 2021 at 5:53 PM Jacob Lifshay <programmerjake at gmail.com> wrote:

> Even if we get a HW cos pipeline, it will almost always be much faster to
> load the constant from memory...

turns out from an analysis by Mitch Alsup that this is a mistaken assumption.
i also believed it to be true until he explained it on comp.arch.

for particularly large DCTs (used in ffmpeg) the regularity of the LDs results
in regular power-of-two hammering of L1 cache lines so badly that it results
in the *L2* cache getting hammered as well.

extreme large DCTs and FFTs, you end up strip-mining the L2 cache *as well*.

under these circumstances it is imperative to reduce the amount of LDs
and computing the cos values on-demand *significantly* speeds up performance
by reducing the total number of LDs, compared to assuming that pre-computed
tables are "always the best option under all circumstances".

l.