[Libre-soc-dev] [RFC] SVP64 Vertical-First Mode loops
luke.leighton at gmail.com
Thu Aug 19 14:15:01 BST 2021
On August 19, 2021 1:04:02 AM UTC, Jacob Lifshay <programmerjake at gmail.com> wrote:
>i completely spaced that you were talking about vertical-first
>if you only are running one element per inner loop, it makes me think
>won't be any faster than scalar code ... not a good look.
that's where out-of-order multi-issue comes into play.
Mitch Alsup's VVM system is designed around exactly and precisely the Vertical First Vector concept. Mitch has been describing how it works using OoO for 3 years, it just took me 2 to understand it :)
the multi-issue engine analyses loops, spots that the element slots in the in-flight data and merges them into the same SIMD Reservation Station (if they are smaller data width than the SIMD ALUs) or just goes with the flow if the data width is the same as the ALU width.
thus you have to ensure that the total available RSes is big enough to be able to cover at least the entire loop, preferably 2x or 3x bigger.
Mitch points out in many many discussions over the past 3 years that the majority of scenarios and algorithms for which Vertical First can be deployed successfully and parallelism exploited through OoO in-flight "RS stuffing" are in fact short loops.
he also points out that when not possible the fallback is simple scalar. and with a Monster Multi-Issue Engine that scalar execution is going to scream along even without parallelism in most situations.
if however elwidth overrides are deployed and there are no scalar 8-bit or 16-bit ALUs then, yeah, things run sub-optimally.
i can live with that.
>I watched the video you made earlier, and it mostly matches what I
>from vertical-first mode.
yeah, the mistake i made though, you can see i realised it towards the end, is that the VFirst batching has no src/dst-step of its own.
basically VFirst Batching would be:
for i in srcstep .. srcstep+VFHint-1
whilst also running j on dststep... but also skipping masked-out elements...
*then winding back* on the next instruction in order to do srcstep..srcstep+VFHint+1 *again*, and that needs a pair of independent counters (subsrcstep, subdststep) which reset back to srcstep/dststep after each VFirst instruction.
then, what do you do with the COS coefficients? they're supposed to be scalar. where do you put the 2nd *scalar* coefficient if the VFirst Batch size is 2, or 3? nowhere, it's impossible.
there isn't enough spare space in SVSTATE to add a pair of 7 bit sub-steps in, and to be honest it's getting scarily complex.
therefore i think the simplest thing is just to have svstep increment only by ONE, to have the VFirst Mode only do scalar (one element) interaction, and use Mitch's research and OoO multi-issue concepts.
More information about the Libre-soc-dev