[Libre-soc-dev] svp64 review and "FlexiVec" alternative

Tue Jul 26 06:08:30 BST 2022

lkcl wrote:
> On Mon, Jul 25, 2022 at 6:22 AM Jacob Bachmeyer via Libre-soc-dev 
> <libre-soc-dev at lists.libre-soc.org> wrote:
> >
> > Luke Kenneth Casson Leighton wrote:
> > > On Sun, Jul 24, 2022 at 10:11 PM Luke Kenneth Casson Leighton 
> <lkcl at lkcl.net>
>
> [...]
>
> fortunately, Paul, the OPF ISA WG Chair, got the concept so fast that 
> he actually started explaining it to *me*! it was a very funny moment.

Well then, it looks like you may actually have a chance to get Simple-V 
adopted after all, if the WG Chair favors it.

> [...]
>
> > > yes, isn't it great? a high-performance implementation can apply
> > > the same trick above, but in the case of SVP64 is not limited to
> > > Memory-only Vectors, it can use registers.
> > >
> >
> > From an ISA perspective, it is not so great: here is a duplicate
> > opcode that has effectively the same function but must be *different* to
> > indicate vector operations.
>
> i'm not totally sure i get you.
>
> in SVP64 we do not actually add any vector opcodes at all. 
> (fascinatingly and paradoxically, neither does Mitch Alsup's VVM 
> extension for MyISA66000, it relies heavily on that loop-construct)
>
> other ISAs *do* add explicit similar opcodes and it results in an out 
> of control proliferation problem.
>
> what opcode(s) are duplicated?

The loop branch.  The SVSTEP instruction is effectively equivalent to an 
ordinary branch instruction.  (In fact, FlexiVec would *use* the 
ordinary BC instruction at the end of a vector loop.)

> > > this can be done only up to batches of 5, safely, and hphint would
> > > be set to 5 to make that clear to the underlying hardware which
> > > performs the in-flight-merging trick described by Mitch Alsup.
> > >
> >
> > The problem is that this in-flight-merging trick can only work in big,
> > complex, OoO microarchitectures.
>
> ok, so some background. Mitch was the designer of the Motorola 88100, 
> AMD K9, AMD's Opteron Series which pissed all over Intel CPUs, and 
> Samsung's new GPU. he stopped working for AMD because the n00b kiddies 
> couldn't comprehend gate-level design and they were getting 
> disrespectful of his expertise. he had enough money having been with 
> the company for so long, and now basically does what he likes, and 
> that happens to be, "design my own ISA and talk about it on comp.arch"
>
> Mitch has been analysing VVM from a gate-level architectural 
> perspective for many years, now, and has very very specifically 
> designed it with *multiple* micro-architectures in mind.
>
> where a micro-architecture does not have OoO then it may instead use 
> SIMD in-order for VVM looping.
>
> where a micro-architecture does not have SIMD it may instead use 
> Scalar for VVM looping.
>
> in other words and this is extremely important the VVM ISA Extension 
> of MyISA66000 *does not in any way* impose require or punish a 
> specific micro-architecture.

It does appear that I have basically drawn inspiration from Simple-V and 
reinvented VVM as "FlexiVec".  :-)

The main sticking point that I see with Simple-V is the way Simple-V 
uses the main register file.

> [...]
>
> > This (introducing a secondary program counter) is likely to be a major
> > sticking point with the OpenPOWER experts.
>
> they'll just have to live with it. i mean, it's not even a new 
> original idea!

There is, of course, the alternative (unless you know something that I 
do not) that they simply reject Simple-V because they decide that the 
secondary program counter is too many architectural resources to allocate.

> Peter Hsu, the designer of the MIPS R8000, came up with the exact same 
> idea back in 1995! even the prefixing, the vector/scalar marking, and 
> regfile number-extending.
>
> the only reason they did not go ahead was because Peter's team 
> recognised that for best performance you need to rely heavily on a 
> wide multi-issue OoO engine, which MIPS simply did not have the 
> inhouse expertise to create one, at the time.

Right, this is the problem I see with Simple-V:  best performance 
requires multi-issue OoO.  FlexiVec can give optimal performance with 
multi-issue OoO or an in-order scalar unit driving a chain of vector 
lanes.  I expect that the latter model can scale up farther (== more 
vector lanes, straight into GPU territory) than a multi-issue design, 
especially if the vector lanes are allowed to lag the control unit in a 
pipeline design, such that lane251 might be executing its part of the 
operation that was issued to lane0 64 cycles ago.

> > > major changes at this point would be... difficult, shall we say.
> > > that said i'm happy to go through this because we have to demonstrate
> > > completeness.
> > >
> >
> > Fair enough.
>
> these kinds of conversations also turn up wonderful gems such as 
> FlexiVec, which i am still getting to grips with, looking like it is a 
> Vertical-First ISA.

If I understand the "Vertical-First" term correctly, FlexiVec is exactly 
so, just like the old Cray "classic" vector model.

> can you express it in pseudocode at all? i like to make sure i 
> properly understand, and these are such subtle complex concepts it is 
> really challenging to be clear.

I will try a program example, in this case adding two arrays of 
integers:  (register numbers made up without reference to standard ABI 
and code untested; I /think/ I have this right)

----

	[address of array A in R3, B in R4, C in R5, length in R6]
	[assemble vector configuration in memory with address in R7...]
	[...declaring R20, R21, R22 as vectors of words]
	fvsetup	R7
	; we are using 32-bit elements, so 4 bytes per step
	li	R10, 4
	; the initial load will start +4 bytes from base
	addi	R3, -4
	addi	R4, -4
	addi	R5, -4
	; load counter
	mtctr	R6
	; begin vector loop
    1:	lwaux	R20, R3, R10
	lwaux	R21, R4, R10
	add	R22, R20, R21
	stwux	R22, R5, R10
	bdnz	1b
	; end vector loop

----

For the null implementation, FlexiVec is simply ordinary scalar 
processing, with the FlexiVec setup instruction ignored.

For a full implementation, the program first executes FVSETUP with the 
address of a memory buffer containing the architectural vector 
configuration, which hardware examines and configures the vector 
execution machinery.  For a concrete example, assume a Hwacha-like 
multi-lane vector unit with 4 lanes is available.  Three word-element 
vectors are requested and the vector memory is partitioned accordingly.  
Assume this results in MAXVL=20 (must be a multiple of N (lane count) 
since each lane holds a slice of each vector; these numbers are much 
smaller than a practical implementation would be expected to have) and 
the arrays involved are 32 elements each.

FlexiVec is activated by a write to a vector register; in the example 
above, the "lwaux R20" instruction.  Activating FlexiVec clears the 
physical scalar registers configured for vector use; these are 
subsequently used for vector offset tracking and referred to as "pR20", 
"pR21", and "pR22" below.  In this example, each vector lane knows its 
own position VLANE (0,1,2,3 here) in hardware and the scalar unit knows 
that there are N=4 lanes.  (Assume all of the data is in the same 
accessible page for now to avoid virtual memory issues; they are not 
hard to handle but add some complexity to each step; hinted at below.)  
The vector length is VL := MIN(MAXVL, CTR).  Since CTR=32 (32 element 
arrays) and MAXVL=20 in this example, VL is 20 for the first iteration 
through the loop.

For the first load:  The scalar unit distributes a LOAD-STRIDE operation 
to the vector chain with a vector target of V20 (the vector replacing 
R20), the real address derived from the initial value of R3 as base 
address, the value of R10 as the stride length, and the value of pR20 as 
vector offset.  Each vector lane notes the vector offset (and does 
nothing if that offset inhibits the lane; defined as inhibit if 
(((offset%N)&&((offset%N)<VLANE))||((offset+VLANE)>=VL); this occurs if 
an operation only partially completes across the vector unit, as when a 
load crosses a page boundary, or on the last pass through the loop if 
some lanes are unneeded), computes EA := base + (stride*(1+VLANE)), and 
issues a load for its EA to be copied to the appropriate element of 
V20.  The scalar unit increments pR20 by N (since N lanes each read one 
element), and computes EA := base + (stride*N) and updates R3 
accordingly.  Since pR20 is less than VL, the program counter advance is 
inhibited and (since N=4 here) the instruction repeated for vector 
offsets 4, 8, 12, 16.  After offset 16 is processed, pR20 == VL, pR20 is 
cleared, and the program counter advances.

The second load using R21/V21 proceeds analogously.

For the addition:  The scalar unit distributes an ADD operation to the 
vector chain with a vector target of V22 (the vector replacing R22), 
vector sources of V20 and V21 (analogously), and the value of pR22 as 
vector offset.  Each vector lane notes the vector offset and performs 
the indicated addition.  The scalar unit increments pR22 by N and 
repeats the instruction for vector offsets 4, 8, 12, 16.  After offset 
16 is processed, pR22 == VL, pR22 is cleared, and the program counter 
advances.

The store operation is processed analogously to the earlier loads.  Each 
vector unit computes its effective address for each element as the 
scalar unit walks the base address along.

The loop branch is another bit of magic in FlexiVec:  the scalar unit 
knows that FlexiVec is active, so CTR is decremented by VL=20 instead of 
1.  32-20 = 12 -> CTR, so the branch is taken the first time.

For the next iteration, VL is now 12, since CTR<MAXVL.  Each instruction 
proceeds analogously, with vector offsets 0, 4, 8.  This time VL=12, 
CTR=12, 12 - 12 = 0 -> CTR, so the loop branch is not taken, and 
FlexiVec is deactivated.

For an out-of-order multi-issue implementation, the vector lanes are 
emulated by issuing the relevant element-wise operations to the 
available execution ports.  Here, N is the number of simultaneous issue 
ports available instead of the number of vector lanes and MAXVL is 
determined by the availability of scratch registers in the OoO 
microarchitecture to hold the vector elements.

> > > so, a completely separate regfile / register-memory area for vectors
> > > from scalars? is that right?
> > >
> >
> > Notionally yes, although the simplest implementation (that does not
> > actually have vector hardware) uses the scalar registers to store the
> > single element allowed in each vector and the scalar ALU to process the
> > elements serially.
>
> it's sounding like a cross between VVM and the ETA-10 (CDC 205).

Some implementations might be.  The idea is that FlexiVec is, well, 
flexible here.

> > > if you're proposing a separate vector regfile / register-memory-area
> > > the downside of that are that you then have to add inter-regfile 
> transfer
> > > instructions, in between the scalar regfile and the [new] vector
> > > regfile/reg-mem-area.
> > >
> >
> > Nope! Transfers between scalar and vector register files go through
> > memory. (There are a few instructions in x86 SSE(?) for direct
> > transfers between the SIMD and general registers -- they turn out to be
> > significantly slower than using LOAD and STORE operations.)
>
> ah. right. this may be quicker (due to an internal arbitrary 
> micro-architectural decision by intel) but the power consumption is 
> awful. Jeff Bush did power analysis in Nyuzi and he *very 
> specifically* warned that the reason why 3D GPUs have such large 
> regfiles is to make damn sure that workloads are kept to 
> LOAD-PROCESS-STORE.
>
> the moment it becomes LOAD-PARTPROCESS-SPILL-PARTPROCESS-STORE then 
> due to the insanely heavily repeated workloads you end up with a 
> noncompetitive unsaleable product due to its power consumption.
>
> we have to be similarly very very careful.

The idea of FlexiVec for Power ISA is that every operation normally 
available in the Fixed-Point Facility, Floating-Point Facility, and 
Vector Facility (VMX/VSX) [(!!!)] would be available vectorized when 
those facilities are extended using FlexiVec.  (Yes, in theory, FlexiVec 
could extend VSX too!)

> > Not counting the vector setup (which would be CSRs on RISC-V and
> > probably SPRs on OpenPOWER) I tentatively believe that FlexiVec could
> > require *zero* additional instructions.
>
> yes this is the beauty of Vertical-First.

Correction:  3 (maybe 4) additional instructions, requiring 2 encodings, 
since two of them are privileged (FlexiVec context save/restore) and one 
(FlexiVec setup) is only available in problem state (which is Power 
ISA's name for user mode).  FlexiVec setup is a subset of FlexiVec 
context restore and FlexiVec is only available in problem state, to 
prevent chaos when register definitions are changed out from underneath 
system code.  The tentative concept is that the architecture would 
define a fixed configuration layout which is a prefix of an 
implementation-defined context structure.  Live migration between 
different implementations is accomplished by waiting for the program to 
finish using FlexiVec before migrating it.  (Processors capable of live 
migration would need to provide some way to trap on the branch that sets 
CTR to zero and ends the FlexiVec loop.  I think the existing Debug 
and/or Performance Monitor Facilities *might* be able to do this.  
Embedded SOCs would not need this subfeature because live migration is 
not practical in their environments.)

> > > in other words, i took the concept of "Sub-PC" very seriously and
> > > treated it literally as part of the [absolutely] critical Context, aka
> > > a peer of PC and MSR.
> > >
> >
> > If nested traps are possible, the trap handler still must preserve
> > SVSRR1 somewhere.
>
> right next to preserving SRR0 (copy of PC) and SRR1 (copy of MSR).
>
> once called, trap handlers must *not* let the exception mask bit go 
> low until they have saved SRR0/SRR1/SVSRR1 somewhere.
>
> hypervisor mode has to have corresponding HSRR0, HSRR1 (and now 
> HSVSRR1) because it *can* interrupt a [supervisor] trap. nested.
>
> this is all standard fare, it has all been in place literally for 
> decades, now. SVSTATE and SVSRR1 (and HSVSRR1) therefore literally get 
> a "free ride" off the back of an existing 
> astonishingly-well-documented spec and associated implementation.

There is still an incremental software cost.  To be fair, FlexiVec has 
similar costs, since it also adds thread context.  FlexiVec, however, 
can be ignored by the system unless a task switch is to be performed, so 
the runtime cost is very slightly lower.

> > > i thought about it, and realised that it made Register Hazard 
> Management
> > > for Multi-Issue OoO designs really, *really* complicated.
> > >
> >
> > Actually, POWER already has a loop counter register CTR, so the
> > incremental cost of using that cannot be too high.
>
> right. ok. CTR goes into its own separate Hazard Management. transfers 
> between the GPRs and CTR (mtspr, mfspr) are explicit instructions that 
> allow clean GPR-SPR Hazard interaction in OoO RaW/WaR tables.
>
> it's... complicated. RISC-V's "simplicity" such as not having 
> Condition Codes has led people to believe everything can be done "real 
> simple". fact is that Intel AMD and IBM have Condition Codes and 
> Special Purpose SPRs (like CTR) for *really good reasons* which start 
> to matter in high performance designs.

IBM may have known what they were doing, but I am fairly sure that Intel 
and AMD are stuck with condition codes because the original 8086 used a 
FLAGS register to control conditional branches.  If x86 condition codes 
are useful like that, I am convinced that Intel had a lucky guess all 
those years ago.

On the other hand, I view RISC-V as an experimental architecture in "how 
simple can we make it?" and I am uncertain if we would even have 
OpenPOWER if RISC-V did not exist as competition.

> bottom line: Patterson has a hell of a lot to answer for.
>
> > Effectively adding bits to PC widens those internal buses and registers;
> > this may have far-reaching consequences in actual hardware, possibly
> > extending critical propagation delays and therefore the minimum 
> cycle time.
>
> PC and MSR, both 64 bit, are already carried around as "state". in the 
> TestIssuer design i added DEC and TB to that as well (interrupt 
> counters for watchdogs). that's 256 bits. adding 64 more is not such a 
> hardship and can be stored in 1R1W SRAM.
>
> carrying around state in SRAMs is pretty normal for high performance 
> designs. i have the architectural design guide for the 88100, very 
> kindly sent by Mitch Alsup, i can pass on to you if you're interested.

It will be an interesting bit of history to read some day if nothing else.

> [...]
> > Agreed. Consider the proposal to change the Simple-V execution model
> > withdrawn and tentatively offered for comment as an early draft for a
> > second vector execution proposal, FlexiVec.
>
> appreciated! it... i... it's difficult to truly express to people how 
> hard this stuff really is. IBM's internal engineers tried expressing 
> it to Hugh Blemings, many years ago. he thought he understood it when 
> they said, simply, "Hugh: hardware design is HARD". several years 
> later he realised they were talking several orders of magnitude out of 
> sync with what he'd imagined :)

This does not change my views on Simple-V; just that Simple-V is too far 
along in development to meaningfully change at this point.

> > On another note, having had a bit more time to examine the Simple-V
> > document, I propose splitting the additional scalar operations in Part
> > III into a separate "New Instructions for Parallel Applications"
> > proposal. Most (maybe all?) of them should be able to stand on their
> > own, without requiring the Simple-V pipeline extensions.
>
> yes. that's stated (in pretty much exactly those words) right at the 
> top. section 3:
>
> https://libre-soc.org/openpower/sv/
>
> i repeat it 4 times, i have just altered the wording slightly. they 
> *are* very much with almost no exceptions at all designed for 
> scalar-only use, oh and by a not-coincidence-at-all happen to have 
> uses when Vectorised.
>
> ah. i know. Part III has no introductory preamble. i think what you're 
> missing is that the *entirety* of Part III is Scalar-independent and 
> does not require any of SVP64!
>
> i'll add a preamble chapter.

I suggest splitting the document.  Put Simple-V and its instructions in 
one document and the SVP64-independent instructions in a separate 
proposal -- or multiple proposals.  Break the huge block into more 
manageable chunks.

> > Consider it this way: SVP64 significantly alters the execution pipeline
>
> no, it sits *between* issue and decode. ah sorry, i think you used 
> "execution" to refer to the whole chain, where normally "execute" is 
> used to refer to one phase of that chain.

OK, "processing" pipeline.  Yes, you are correct that "Execution" is 
also normally a pipeline stage, so that was a poor choice of words.

> [...]
>
> > Having just now obtained a copy of the OpenPOWER v3.1B spec and having
> > barely begun reading its almost 1600 pages, I have already found
> > something that might be a problem for Simple-V: all but the lowest
> > compliance levels already have VMX/VSX as a required feature.
>
> ah.
>
> rright.
>
> deep breath.
>
> yes. RV64GC is 96 instructions. Linux Compliancy Level (equivalent) is 
> 950. this is just plain stupid, no need to sugar coat it.
>
> IBM is only now, after 2 years of me banging on about it, and, some 
> confidential stuff i can't tell you about until it's published, just 
> beginning to realise that their incremental 25 *YEARS* of lead on 
> Power ISA gives them an accidental and unintentional myopic view of 
> how much is really truly involved in Power ISA implementation.
> putting it bluntly they *genuinely* thought it was perfectly fine to 
> smash 1,000 instructions in peoples' faces and expect them to get on 
> with it. after all, theey managed to do it, right? we have linux 
> running perfectly fine, right? so what's the problem, again?

Eh, I am fairly sure that Linux was adapted to POWER, not the other way 
around.  :-)

> what they completely overlooked was that they got to 1,000 
> instructions in an *incremental* fashion over a 25 year period.
> (VSX was originally only VMX and that was added as far back as *2003*).
>
> nobody else has IBM's resources *and there are no reference 
> implementations* [A2O and A2I are Scalar v2.06/8 from *15* years ago]
>
> if you put say a scant 2 days per instruction including unit tests and 
> Compliance Test Suite Validation which honestly is barely enough then 
> multiply that by 750 (total MANDATORY for Linux Compliancy with VSX on 
> top of SFFS which is 214) it comes out to 1,500 **DAYS** just to add VSX.
>
> divide by 250 and you have a jaw-dropping six YEARS of development 
> effort at a crash-course speed that will leave the engineers 
> desperately exhausted, uncomfortable, underconfident and highly likely 
> to leave the team long before those 750 instructions are complete.
>
> i made these insights very clear multiple times and the message seems 
> to have gotten through.
>
> what probably did it though was the fact that Microwatt, A2O and A2I 
> are all in exactly the same position: they are all on-track for SFS 
> and SFFS Compliancy Level and absolutely nowhere near Linux Compliancy 
> Level.

Wait... are SFS and SFFS abbreviations for "Scalar Fixed" and "Scalar 
Float"?  If so, then BE is mandatory for those after all, and there is a 
*different* (32-bit) Linux port that runs on the SFS/SFFS platform, 
separate from the Linux Compliancy Subset, which is for 64-bit LE 
Linux.  (Confused yet?)

> IBM had *no idea* how serious a problem this is, and up until about a 
> year ago was continuing a pathological submission of "#ifdef POWER9" 
> patches upstream to libc6 and major software packages, putting in 
> *even more* VSX dependency behind "#ifdef POWER9" all of which has to 
> be ripped out.
>
> so what is happening instead (at last) is that the message has been 
> finally understood, future patches will be "#ifdef VSX" and "#ifdef 
> MMA" respecting the *Compliancy* Level, NOT the IBM product make/model 
> number. Tulio's libc6 patch dealing with that has landed, finally, 
> about 6 months ago.

Really, the GNU libc maintainers should have been pushing back on that 
-- the GNU policy is supposed to be to test features rather than 
processor models.

> also totally missing is official ABI documents for SFS and SFFS!
> (what's ~100% likely going to happen there is that automatically 
> whatever "-mnovsx" does, that's what will end up in those documents)

There may also not *be* a standard ABI, if those are intended for 
embedded systems, which do not need standard ABI.

> this message has also gotten through after banging on about it for 18 
> months, but mostly it was the difficulty that LibreBMC had which 
> brought the point home, there.
>
> the IBM LibreBMC team did recompile everything with "-mnovsx 
> -mnoaltivec" and for the most part this worked except for QFP and 
> except for when Tulio's patch wasn't added upstream.

Is this the POWER9 service processor I have heard about that runs its 
very own Linux-based system, such that, after power-on, the machine 
"plays dead" for about a minute while the service processor boots?

> Paul did add a subset of VSX to Microwatt (increasing the LUT4 count 
> by a whopping 50%) in order to get libc6 to run when it was still 
> "#ifdef POWER9" but every few months if he tried again he would find 
> *yet more* patches had added *yet more* VSX instructions and in some 
> rare cases i think gcc violated the "-mnovsx" rule!
>
> bottom line here is that the lack of enthusiastic adoption of Power 
> ISA after its release (Nov 2019!) combined with insights from a lot of 
> people all saying the same thing, has finally gotten through, and 
> there is quiet scrambling going on behind the scenes which i don't 
> know about so can't even tell you (we're not OPF Members), to fix this 
> and make SFFS an actual proper Linux OS peer.
>
> without needing soft-emulation of VSX.
>
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/arch/powerpc/lib/sstep.c?id=e0dccc3b76fb35bb257b4118367a883073d7390e
>
> i appreciate that's a hell of a lot of context and backstory, but this 
> is a big project :)

I always appreciate the historical view.

-- Jacob