[Libre-soc-dev] [RFC] SVP64 Vertical-First Mode loops
luke.leighton at gmail.com
Wed Aug 18 17:14:50 BST 2021
whilst doing the video above i encountered a design flaw in Vertical-First batching which needs fixing.
Vertical-First is important for scenarios where even with 128 registers there is still not enough space to Vectorise all input, output and temporary regs in a given loop, if done Horizontally.
a solution is to have *most* of the input, temp regs and output as Vectorised but some of it be scalars, of course the priority being first on temp regs to be scalars.
one crucial example here is the DCT cosine values, which are quite a big table (O N log N) and therefore take up considerably more registers if done as Horizontal Mode.
instead of pre-calculating the entire table, which itself results in considerably more LDs, and in strip-mining of the L1 Cache, Vertical-First Mode allows each cosine value to be calculated *on demand* as a scalar element, for a SPECIFIC src/dststep at the EXACT moment it is needed.
what i wanted to also allow is *batches* of such scalar values to be calculated, but i realise as i an writing this, the concept "batch" and "scalar" are mutually incompatible by definition.
More information about the Libre-soc-dev