[Libre-soc-dev] svp64 review

Mon Jul 25 15:16:11 BST 2022

---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68

On Mon, Jul 25, 2022 at 6:22 AM Jacob Bachmeyer via Libre-soc-dev <libre-soc-dev at lists.libre-soc.org> wrote:
>
> Luke Kenneth Casson Leighton wrote:
> > On Sun, Jul 24, 2022 at 10:11 PM Luke Kenneth Casson Leighton <lkcl at lkcl.net>

> > the LOOP preamble instruction helps identify all loop-invariant
> > registers plus identifies the counter register.
> >
> > it's extremely neat.
> >  
>
> It is neat, and it forms a basis (using slightly different terms) for
> what my second message termed "FlexiVec" which essentially amounts to
> hardware loop unrolling across a vector ALU array.

ahhh.

do you know how long it took me to understand VVM? 18 months! :)

i just couldn't get it, until i derived it independently
https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/sv_horizontal_vs_vertical.svg;hb=HEAD

at which point and only at which point i was able to *recognise* that VVM was the same concept.

> The catch is that I suspect most of the OpenPOWER experts are likely to
> have a similar reaction to mine when they first read that.  By the end
> of chapter 4, I understood better (to see "Vertical Vector Mode" as a
> form of hardware loop unrolling) 

it's... i have tried explaining it on comp.arch, many times. Mitch tried explaining it to me multiple times, i got embarrassed and stopped asking.

i did this youtube video
https://m.youtube.com/watch?v=fn2KJvWyBKg
and put it in front of people and they *still* do not understand.

i put that SVG image in front of them and they still do not understand.

at some point you just have to stop and go ahead without them.

fortunately, Paul, the OPF ISA WG Chair, got the concept so fast that he actually started explaining it to *me*! it was a very funny moment.

> but I chose to leave that comment stand
> because I expect that impression could be a problem for you if the
> OpenPOWER experts are less curious than myself.

with it literally taking 8-12 months to know enough about Power ISA to be able to be reasonably confident with it, i am simply going to tell them (not you) up-front to "suck it up".

["it took a year for me to be familiar with Power ISA, i have been advising you for 18 months to begin the process of getting familiar with SVP64, if you haven't done so then please don't try to make that *my* problem"]

people cannot reasonably expect our team to be instant experts in Power ISA, which has 25 years history (and 1,600 pages) but then expect SVP64 to be "instant absorbable material".

they need to be patient and persistent.  i can't spend NLnet's money repeatedly doing more and more repetitions of the documentation saying the same things over and over, it means we have more to maintain apart from anything.

with Paul fully understanding the concept, i'm hoping he'll take the time to explain it, or at least mitigate needing mr to be so blunt. only thing is, Paul has limited time.

> > yes, isn't it great? a high-performance implementation can apply
> > the same trick above, but in the case of SVP64 is not limited to
> > Memory-only Vectors, it can use registers.
> >  
>
>  From an ISA perspective, it is not so great:  here is a duplicate
> opcode that has effectively the same function but must be *different* to
> indicate vector operations.

i'm not totally sure i get you.

in SVP64 we do not actually add any vector opcodes at all.  (fascinatingly and paradoxically, neither does Mitch Alsup's VVM extension for MyISA66000, it relies heavily on that loop-construct)

other ISAs *do* add explicit similar opcodes and it results in an out of control proliferation problem.

what opcode(s) are duplicated?

> > this can be done only up to batches of 5, safely, and hphint would
> > be set to 5 to make that clear to the underlying hardware which
> > performs the in-flight-merging trick described by Mitch Alsup.
> >  
>
> The problem is that this in-flight-merging trick can only work in big,
> complex, OoO microarchitectures.  

ok, so some background.  Mitch was the designer of the Motorola 88100, AMD K9, AMD's Opteron Series which pissed all over Intel CPUs, and Samsung's new GPU. he stopped working for AMD because the n00b kiddies couldn't comprehend gate-level design and they were getting disrespectful of his expertise. he had enough money having been with the company for so long, and now basically does what he likes, and that happens to be, "design my own ISA and talk about it on comp.arch"

Mitch has been analysing VVM from a gate-level architectural perspective for many years, now, and has very very specifically designed it with *multiple* micro-architectures in mind.

where a micro-architecture does not have OoO then it may instead use SIMD in-order for VVM looping.

where a micro-architecture does not have SIMD it may instead use Scalar for VVM looping.

in other words and this is extremely important the VVM ISA Extension of MyISA66000 *does not in any way* impose require or punish a specific micro-architecture.

and SVP64 (or, SimpleV), is the same.

> The alternate model that I am now
> tentatively calling "FlexiVec" due to its hardware flexibility can
> instead work all the way from actual scalar processors (that effectively
> have MAXVL=1) to simple in-order scalar units driving a vector unit
> chain to complex out-of-order systems that use multiple issue ports and
> parallel ALU pipelines to emulate a vector unit.

yes.  this is the beauty of a Vertical-First ISA.  there's a couple if unit tests using it if you're interested

https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_fft.py;h=98dcbad5c3cfc9dc410f4ed407d256af88387754;hb=c42523e2cb7b0f95fe7a2da58689ebea3a4a2f85#l566

hm that's probably the best "demo", all others are lowlevel unit tests

> Then it is a very bad analogy for your use.  The x86 REP prefix does not
> introduce a secondary program counter:  it works by inhibiting the
> normal program counter advance until a condition is met, causing (at
> least notionally) the same opcode to be read and executed repeatedly.

yes.  ta-daa, you got it.  that's exactly how srcstep and dststep, of SVSTATE, work.  the normal program counter is "inhibited" whilst the sub-steps are advanced, and the same opcode is read and executed repeatedly.

this is literally the definition of "Horizontal-First" Mode.

you have got it perfectly.

> This (introducing a secondary program counter) is likely to be a major
> sticking point with the OpenPOWER experts.

they'll just have to live with it.  i mean, it's not even a new original idea! 
Peter Hsu, the designer of the MIPS R8000, came up with the exact same idea back in 1995! even the prefixing, the vector/scalar marking, and regfile number-extending.

the only reason they did not go ahead was because Peter's team recognised that for best performance you need to rely heavily on a wide multi-issue OoO engine, which MIPS simply did not have the inhouse expertise to create one, at the time.

> > major changes at this point would be... difficult, shall we say.
> > that said i'm happy to go through this because we have to demonstrate
> > completeness.
> >  
>
> Fair enough.

these kinds of conversations also turn up wonderful gems such as FlexiVec, which i am still getting to grips with, looking like it is a Vertical-First ISA.

can you express it in pseudocode at all? i like to make sure i properly understand, and these are such subtle complex concepts it is really challenging to be clear.

> > so, a completely separate regfile / register-memory area for vectors
> > from scalars?  is that right?
> >  
>
> Notionally yes, although the simplest implementation (that does not
> actually have vector hardware) uses the scalar registers to store the
> single element allowed in each vector and the scalar ALU to process the
> elements serially.

it's sounding like a cross between VVM and the ETA-10 (CDC 205).

> > if you're proposing a separate vector regfile / register-memory-area
> > the downside of that are that you then have to add inter-regfile transfer
> > instructions, in between the scalar regfile and the [new] vector
> > regfile/reg-mem-area.
> >  
>
> Nope!  Transfers between scalar and vector register files go through
> memory.  (There are a few instructions in x86 SSE(?) for direct
> transfers between the SIMD and general registers -- they turn out to be
> significantly slower than using LOAD and STORE operations.)

ah. right.  this may be quicker (due to an internal arbitrary micro-architectural decision by intel) but the power consumption is awful. Jeff Bush did power analysis in Nyuzi and he *very specifically* warned that the reason why 3D GPUs have such large regfiles is to make damn sure that workloads are kept to LOAD-PROCESS-STORE.

the moment it becomes LOAD-PARTPROCESS-SPILL-PARTPROCESS-STORE then due to the insanely heavily repeated workloads you end up with a noncompetitive unsaleable product due to its power consumption.

we have to be similarly very very careful.

> Not counting the vector setup (which would be CSRs on RISC-V and
> probably SPRs on OpenPOWER) I tentatively believe that FlexiVec could
> require *zero* additional instructions.

yes this is the beauty of Vertical-First.

> > in other words, i took the concept of "Sub-PC" very seriously and
> > treated it literally as part of the [absolutely] critical Context, aka
> > a peer of PC and MSR.
> >  
>
> If nested traps are possible, the trap handler still must preserve
> SVSRR1 somewhere.

right next to preserving SRR0 (copy of PC) and SRR1 (copy of MSR).

once called, trap handlers must *not* let the exception mask bit go low until they have saved SRR0/SRR1/SVSRR1 somewhere.

hypervisor mode has to have corresponding HSRR0, HSRR1 (and now HSVSRR1) because it *can* interrupt a [supervisor] trap. nested.

this is all standard fare, it has all been in place literally for decades, now. SVSTATE and SVSRR1 (and HSVSRR1) therefore literally get a "free ride" off the back of an existing astonishingly-well-documented spec and associated implementation.

> > i thought about it, and realised that it made Register Hazard Management
> > for Multi-Issue OoO designs really, *really* complicated.
> >  
>
> Actually, POWER already has a loop counter register CTR, so the
> incremental cost of using that cannot be too high.

right. ok.  CTR goes into its own separate Hazard Management.  transfers between the GPRs and CTR (mtspr, mfspr) are explicit instructions that allow clean GPR-SPR Hazard interaction in OoO RaW/WaR tables.

it's... complicated. RISC-V's "simplicity" such as not having Condition Codes has led people to believe everything can be done "real simple". fact is that Intel AMD and IBM have Condition Codes and Special Purpose SPRs (like CTR) for *really good reasons* which start to matter in high performance designs.

bottom line: Patterson has a hell of a lot to answer for.

> Effectively adding bits to PC widens those internal buses and registers;
> this may have far-reaching consequences in actual hardware, possibly
> extending critical propagation delays and therefore the minimum cycle time.

PC and MSR, both 64 bit, are already carried around as "state".  in the TestIssuer design i added DEC and TB to that as well (interrupt counters for watchdogs).  that's 256 bits.  adding 64 more is not such a hardship and can be stored in 1R1W SRAM.

carrying around state in SRAMs is pretty normal for high performance designs.  i have the architectural design guide for the 88100, very kindly sent by Mitch Alsup, i can pass on to you if you're interested.

the other thing to note is that Microwatt is ENORMOUS compared to any 32-bit non-MMU, non-RADIX, non-TLB-aware-L1-Cache RISC-V implementation.

you can do a 32-bit non-MMU RISC-V core in about 3,000 LUT4s in an FPGA.

a 64 bit RADIX MMU Power ISA core you are pushing your luck trying to get it into 25,000 LUT4s, and that's without an FPU (+8,000 more) and without a partial VSX implementation (another 50% increase).

this is not a "shrinking violet", "designed-by-academics" ISA, it's a full-on in-yer-face Supercomputing-class ISA with 25 years back-story.

> Agreed.  Consider the proposal to change the Simple-V execution model
> withdrawn and tentatively offered for comment as an early draft for a
> second vector execution proposal, FlexiVec.

appreciated!  it... i... it's difficult to truly express to people how hard this stuff really is. IBM's internal engineers tried expressing it to Hugh Blemings, many years ago.  he thought he understood it when they said, simply, "Hugh: hardware design is HARD". several years later he realised they were talking several orders of magnitude out of sync with what he'd imagined :)

> On another note, having had a bit more time to examine the Simple-V
> document, I propose splitting the additional scalar operations in Part
> III into a separate "New Instructions for Parallel Applications"
> proposal.  Most (maybe all?) of them should be able to stand on their
> own, without requiring the Simple-V pipeline extensions.

yes.  that's stated (in pretty much exactly those words) right at the top.  section 3:

     https://libre-soc.org/openpower/sv/

i repeat it 4 times, i have just altered the wording slightly.  they *are* very much with almost no exceptions at all designed for scalar-only use, oh and by a not-coincidence-at-all happen to have uses when Vectorised.

ah. i know. Part III has no introductory preamble.  i think what you're missing is that the *entirety* of Part III is Scalar-independent and does not require any of SVP64!

i'll add a preamble chapter.

> Consider it this way:  SVP64 significantly alters the execution pipeline

no, it sits *between* issue and decode.  ah sorry, i think you used "execution" to refer to the whole chain, where normally "execute" is used to refer to one phase of that chain.

https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/svp64-primer/img/power_pipelines.svg;hb=HEAD

the important takeaway from that diagram is that, actually, the alteration really is not that significant.  this is borne out in TestIssuer, a FSM-based HDL implementation.

> and adds to the processor context.  

yes.

> Most of the scalar instructions
> proposed are additional ALU operations, orthogonal to Simple-V proper.

yes.  this is very deliberate.

> I suspect that the latter, having much less far-reaching effects on
> processor design, will be easier to convince the OpenPOWER experts to adopt.

that's the general idea.  allows for incremental proposals.

> Having just now obtained a copy of the OpenPOWER v3.1B spec and having
> barely begun reading its almost 1600 pages, I have already found
> something that might be a problem for Simple-V:  all but the lowest
> compliance levels already have VMX/VSX as a required feature.

ah.

rright.

deep breath.

yes.  RV64GC is 96 instructions.  Linux Compliancy Level (equivalent) is 950.  this is just plain stupid, no need to sugar coat it.

IBM is only now, after 2 years of me banging on about it, and, some confidential stuff i can't tell you about until it's published, just beginning to realise that their incremental 25 *YEARS* of lead on Power ISA gives them an accidental and unintentional myopic view of how much is really truly involved in Power ISA implementation.

putting it bluntly they *genuinely* thought it was perfectly fine to smash 1,000 instructions in peoples' faces and expect them to get on with it. after all, theey managed to do it, right? we have linux running perfectly fine, right? so what's the problem, again?

what they completely overlooked was that they got to 1,000 instructions in an *incremental* fashion over a 25 year period.
(VSX was originally only VMX and that was added as far back as *2003*).

nobody else has IBM's resources *and there are no reference implementations* [A2O and A2I are Scalar v2.06/8 from *15* years ago]

if you put say a scant 2 days per instruction including unit tests and Compliance Test Suite Validation which honestly is barely enough then multiply that by 750 (total MANDATORY for Linux Compliancy with VSX on top of SFFS which is 214) it comes out to 1,500 **DAYS** just to add VSX.

divide by 250 and you have a jaw-dropping six YEARS of development effort at a crash-course speed that will leave the engineers desperately exhausted, uncomfortable, underconfident and highly likely to leave the team long before those 750 instructions are complete.

i made these insights very clear multiple times and the message seems to have gotten through.

what probably did it though was the fact that Microwatt, A2O and A2I are all in exactly the same position: they are all on-track for SFS and SFFS Compliancy Level and absolutely nowhere near Linux Compliancy Level.

IBM had *no idea* how serious a problem this is, and up until about a year ago was continuing a pathological submission of "#ifdef POWER9" patches upstream to libc6 and major software packages, putting in *even more* VSX dependency behind "#ifdef POWER9" all of which has to be ripped out.

so what is happening instead (at last) is that the message has been finally understood, future patches will be "#ifdef VSX" and "#ifdef MMA" respecting the *Compliancy* Level, NOT the IBM product make/model number.  Tulio's libc6 patch dealing with that has landed, finally, about 6 months ago.

also totally missing is official ABI documents for SFS and SFFS!
(what's ~100% likely going to happen there is that automatically whatever "-mnovsx" does, that's what will end up in those documents)

this message has also gotten through after banging on about it for 18 months, but mostly it was the difficulty that LibreBMC had which brought the point home, there.

the IBM LibreBMC team did recompile everything with "-mnovsx -mnoaltivec" and for the most part this worked except for QFP and except for when Tulio's patch wasn't added upstream.

Paul did add a subset of VSX to Microwatt (increasing the LUT4 count by a whopping 50%) in order to get libc6 to run when it was still "#ifdef POWER9" but every few months if he tried again he would find *yet more* patches had added *yet more* VSX instructions and in some rare cases i think gcc violated the "-mnovsx" rule!

bottom line here is that the lack of enthusiastic adoption of Power ISA after its release (Nov 2019!) combined with insights from a lot of people all saying the same thing, has finally gotten through, and there is quiet scrambling going on behind the scenes which i don't know about so can't even tell you (we're not OPF Members), to fix this and make SFFS an actual proper Linux OS peer.

without needing soft-emulation of VSX.

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/arch/powerpc/lib/sstep.c?id=e0dccc3b76fb35bb257b4118367a883073d7390e

i appreciate that's a hell of a lot of context and backstory, but this is a big project :)

l.