[Libre-soc-dev] SVP64 bclrl

Thu Apr 7 05:02:32 BST 2022

On April 7, 2022 2:24:57 AM UTC, lkcl <luke.leighton at gmail.com> wrote:
>
>
>On April 6, 2022 11:50:56 PM UTC, Jacob Lifshay
><programmerjake at gmail.com> wrote:

>>used as a condition for the scalar branch...in other words the full
>>vector
>>loop would complete (up through VL, not stopping early) and only then
>>would
>>the branching part of the operation run. 

>it is all pretty mind-melting.

so, what you describe is exactly how i would advocate the instruction to be designed, *if* in fact we were designing a Vector ISA. an explicit opcode would be allocated, etc etc.

but we are not designing a Vector ISA

we are designing something that can "use and abuse the internal microarchitecture of a high performance supercomputing multi-issue out-of-order design" [without its designers throwing their toys out the pram / freaking out / frowning as they thoughtfuly puff their pipes / delete analogy as appropriate]

a precise multi-issue OoO Scalar Execution engine will say typically have 8 way Branch Computation Units. ordinarily this would allow it to run up to 8 hot-loops speculatively in in-flight computation.

in some fashion there will be a Transitive "Shadow Matrix" where branch 0 throws a "speculative shadow" across branches 1 thru 7, branch 1 across 2 thru 7, ... branch 6 only across branch 7.

* if branch0 fails to be predicted then it *and all other* branches are cancelled.
* if branch1 fails to be predicted then branch0 at least went ahead but 1 thru 7 are cancelled

you get the idea.

note: up to this point there has been NO MENTION of SVP64 at all. i am merely describing how a STANDARD multi-issue OoO system works.

here's the kicker:

    SVP64 Branches are specifically designed to spam
    such an 8-way multi-issue Engine (having 8 Branch
    Units) with a massive hit of *8* branches (one per
    element) in a *single* clock cycle, with virtually no
    modifications required to the Branch Units
    themselves.

an Architect unfamiliar with SVP64 will at first have a bit of an "oink" moment at that, but once the SVP64 paradigm sinks in an epiphany lightbulb will start to glow.

there are some subtle changes needed to support the LRu option, CTR mode, Fail-First mode, and predication, but the whole point is *NOT* to force the Hardware Engineers to have to create an entire new pipeline or consider having to throw away everything they've done and start again from scratch with a completely alien Architectural Paradigm.

the "naive" (explicit) Vector Branch operation would force them to do exactly that, and i would expect them to put their foot down and say, emphatically, "No".

even CTR Mode when sz=0 is not hard to do in OoO multi-issue because each Branch Unit can be passed multiple CR Fields and multiple Predicate bits plus the current value of CTR, and, using count-leading-ones, compute the future value of CTR deterministically.  

this is a little bit more tricky than just passing in a LRu bit from the operation alongside LK, but it is not insurmountable and because it is deterministic based solely on register values available at multi-issue time can truly be done in parallel.

l.