[Libre-soc-dev] [RFC] SVP64 Vertical-First Mode, batch processing

lkcl luke.leighton at gmail.com
Thu Aug 12 13:21:44 BST 2021

since adding Vertical-First Mode, which is very cool, a lot simpler to add into compilers, and closer to Mitch Alsup's MyISA 66000 Virtual Vectors, the implications have taken some time to sink in.

VF Mode does *not* increment srcstep/dststep automatically on running an instruction: srcstep/dststep *remain where they are*.  an explicit instruction, svstep, is called to increment src/dststep, then a branch-conditional test of whether VL has been reached, loop back on a BATCH of instructions to do the next element(s).

the next logical evolution on that is: do you allow just the one element per instruction to be executed? or do you allow up to a certain explicit set limit?

in Mitch Alsup's MyISA 66000 it is entirely up to the hardware to determine and decide that "batch size".

the idea being: for very simple hardware, the batch size (number of elements executed per instruction) is definitely one.  this means that the VVM Loop is basically very similar to Power ISA Branch CTR automatic decrementing.  this is also the "fallback" position for complex hardware if it cannot determine it can do multiple elements safely.

more complex hardware in MyISA 66000 can use OoO in-flight buffers.  the caveat: the VVM loop has to be short enough that the engine can analyse the entire loop (a couple of cache lines), and determine that even  memory accesses inside the loop are "safe", and thus determine the element batch size, which, obviously, has to be fixed for the ENTIRE loop.

(it's no good executing 3 elements of the vector for the first instruction then doing 5 for the next, you are guaranteed data corruption that way)

the limitations: you can't do branches inside the loop, you can't call functions, and the only way to get Vectors per se is to use memory LD/STs.  for most situations this is perfectly fine, for us it's not.

also, critically relying on an OoO engine to determine the batch size, i am not happy with that.

so the initial idea is, to have a "Batch Hint" size, very similar to VL.  the compiler informs the hardware "you can safely do up to this many elements per instruction, please tell me exactly how many you CAN do".

ironically you should recognise that as the EXACT same rules for Cray Vectors setvl!

here's where it gets complicated, given how far along we are.

i initially thought, "we need a new hint SPR, like VL and MAXVL, called VFHintLen". this hint would be completely separate from VL and MVL, still within the limits of VL and MVL.

     VFHintLen <= VL <= MVL

and you execute batches of length VFHintLen until hitting VL

however what i have just come to realise is: actually, VFHintLen is redundant.... *if VL is made to do its job*.

in Horizontal-First Mode we have:

* MVL set to max reservation (statically determined by compiler)
* VL set dynamically at runtime to explicit value
* loops go from 0 to VL-1

in VF Mode currently it is:

* MVL set to max reservation (statically determined by compiler)
* VL set dynamically at runtime to explicit value
* VFHint *requested* but is set to hw limit
* VFHint elements are run in batches limited by VL
example, MVL=12, VL=10, VFH=3

* first time round a loop
   elements 0 1 2 are executed in parallel
* svstep called, src/dststep incremented by VFHint
* second loop
   elements 3 4 5 executed in parallel
* svstep called, src/dst incremented to 6
* third loop
   elements 6 7 8 executed in parallel
* svstep called, src/dst incremented to 9
* fourth loop
   ONLY element 9 executed because VL=10
* svstep sets CR0 to 1 to indicate "src/dst exceeds VL"
* Branch-Conditional fails, loop is exited

notice how MVL was wasted, there?

what i *believe* we may be able to do is: do without VFHint and use *VL and MVL instead*.

example, in Vertical-First mode:

* MVL would be set to 10 (as an immediate)
* VL would be *requested* to be set to a given
   dynamic value, but would be set to a value that
   HARDWARE determines it can cope with
* proceed same as above but src/dst step test against **VL** not VFHint and
* svstep tests a limit against **MVL** not VL.

basically all testing of the limit of src/dststep right now is:

    if srcstep < VL
         srcstep increments

i propose this change to:

     if HorizontalFirst
          if srcstep < VL
              srstsep increments
     else if VerticalFirst
          if srcstep < *MAXVL*
               srcstep increments

questions, comments?


More information about the Libre-soc-dev mailing list