[Libre-soc-dev] Vector Supercomputing ISA and 3D GPU resources

Wed Sep 15 19:28:08 BST 2021

On September 14, 2021 10:01:48 PM UTC, lkcl <luke.leighton at gmail.com> wrote:

>that's one plus [in a specialist area] and everything else is minuses.

one extremely important insight came from a synthesis of observing and analysing so many different ISAs, including Larrabee (aka AVX512) and ARM SVE.

the actual difference at the microarchitectural (hardware implementation) level between Vector ISAs and SIMD ISAs is

      negligeable.

i repeat: in case the significance is missed:

     the ***EXACT*** same underlying ***HARDWARE ARCHITECTURE*** may be deployed for a SIMD ISA as for a Vector ISA.

the difference then is in a *micro-coding* layer, which *translates* Front-End Vector ISA instructions into *underlying* Back-End SIMD microarchitecture.

Swizzle, Permute, Shuffle, predication: these are all present in the world's leading SIMD-based *and* Vector ISAs, and consequently drive the underlying back-end hardware decision making.

thus, when i said "SIMD ISAs have virtually no modern benefits except in ultra specific domains" i did ***NOT*** also say "Hardware implementations and Microarchitectures BEHIND those SIMD ISAs must therefore also be piles of steaming dog-poo"

far from it, i totally respect the amazing level of expertise that has gone into POWER8, 9 and 10 at the hardware level.

there is unfortunately however one well-known widely-discussed key thing missing from VSX Packed SIMD, compared to AVX-512 SIMD and ARM SVE:

      VSX Packed SIMD entirely lacks predication.

when originally designed (2003?) VSX was a significant innovation, but its lack of predication, which cannot easily be retrofitted, it is now falling behind compared to the innovations by Intel, ARM, and others.

[i am aware that IBM has had requests from customers to add predication to Packed SIMD.  this will be extremely challenging].

rather than attempt to fix Packed SIMD (750 instructions, now you have to add 750 *more* to add predication, *and* then also add a Predicate Register File *and* predicate opcodes, just like they did in Larrabee? i don't think so)

(Tom Forsyth's video on Larrabee)
https://m.youtube.com/watch?v=DfJOt9iLqH4

rather than do that, we chose to create SVP64 on top of *only 214* instructions.

which would you choose to design and implement, if doing your own Vector Processor, starting from acratch?

214 scalar + 750 Packed SIMD + another 750 for predicated variants of SIMD

or

214 scalar + Prefix Format + 3 or 4 SV opcodes

in my mind, it's real easy math, there.

l.