[Libre-soc-dev] [RFC] SVP64 Vertical-First Mode loops
Luke Kenneth Casson Leighton
lkcl at lkcl.net
Wed Aug 18 18:08:09 BST 2021
On Wed, Aug 18, 2021 at 5:53 PM Jacob Lifshay <programmerjake at gmail.com> wrote:
> Even if we get a HW cos pipeline, it will almost always be much faster to
> load the constant from memory...
turns out from an analysis by Mitch Alsup that this is a mistaken assumption.
i also believed it to be true until he explained it on comp.arch.
for particularly large DCTs (used in ffmpeg) the regularity of the LDs results
in regular power-of-two hammering of L1 cache lines so badly that it results
in the *L2* cache getting hammered as well.
extreme large DCTs and FFTs, you end up strip-mining the L2 cache *as well*.
under these circumstances it is imperative to reduce the amount of LDs
and computing the cos values on-demand *significantly* speeds up performance
by reducing the total number of LDs, compared to assuming that pre-computed
tables are "always the best option under all circumstances".
More information about the Libre-soc-dev