[Libre-soc-dev] svp64 review and "FlexiVec" alternative
jcb62281 at gmail.com
Tue Jul 26 06:08:30 BST 2022
> On Mon, Jul 25, 2022 at 6:22 AM Jacob Bachmeyer via Libre-soc-dev
> <libre-soc-dev at lists.libre-soc.org> wrote:
> > Luke Kenneth Casson Leighton wrote:
> > > On Sun, Jul 24, 2022 at 10:11 PM Luke Kenneth Casson Leighton
> <lkcl at lkcl.net>
> fortunately, Paul, the OPF ISA WG Chair, got the concept so fast that
> he actually started explaining it to *me*! it was a very funny moment.
Well then, it looks like you may actually have a chance to get Simple-V
adopted after all, if the WG Chair favors it.
> > > yes, isn't it great? a high-performance implementation can apply
> > > the same trick above, but in the case of SVP64 is not limited to
> > > Memory-only Vectors, it can use registers.
> > >
> > From an ISA perspective, it is not so great: here is a duplicate
> > opcode that has effectively the same function but must be *different* to
> > indicate vector operations.
> i'm not totally sure i get you.
> in SVP64 we do not actually add any vector opcodes at all.
> (fascinatingly and paradoxically, neither does Mitch Alsup's VVM
> extension for MyISA66000, it relies heavily on that loop-construct)
> other ISAs *do* add explicit similar opcodes and it results in an out
> of control proliferation problem.
> what opcode(s) are duplicated?
The loop branch. The SVSTEP instruction is effectively equivalent to an
ordinary branch instruction. (In fact, FlexiVec would *use* the
ordinary BC instruction at the end of a vector loop.)
> > > this can be done only up to batches of 5, safely, and hphint would
> > > be set to 5 to make that clear to the underlying hardware which
> > > performs the in-flight-merging trick described by Mitch Alsup.
> > >
> > The problem is that this in-flight-merging trick can only work in big,
> > complex, OoO microarchitectures.
> ok, so some background. Mitch was the designer of the Motorola 88100,
> AMD K9, AMD's Opteron Series which pissed all over Intel CPUs, and
> Samsung's new GPU. he stopped working for AMD because the n00b kiddies
> couldn't comprehend gate-level design and they were getting
> disrespectful of his expertise. he had enough money having been with
> the company for so long, and now basically does what he likes, and
> that happens to be, "design my own ISA and talk about it on comp.arch"
> Mitch has been analysing VVM from a gate-level architectural
> perspective for many years, now, and has very very specifically
> designed it with *multiple* micro-architectures in mind.
> where a micro-architecture does not have OoO then it may instead use
> SIMD in-order for VVM looping.
> where a micro-architecture does not have SIMD it may instead use
> Scalar for VVM looping.
> in other words and this is extremely important the VVM ISA Extension
> of MyISA66000 *does not in any way* impose require or punish a
> specific micro-architecture.
It does appear that I have basically drawn inspiration from Simple-V and
reinvented VVM as "FlexiVec". :-)
The main sticking point that I see with Simple-V is the way Simple-V
uses the main register file.
> > This (introducing a secondary program counter) is likely to be a major
> > sticking point with the OpenPOWER experts.
> they'll just have to live with it. i mean, it's not even a new
> original idea!
There is, of course, the alternative (unless you know something that I
do not) that they simply reject Simple-V because they decide that the
secondary program counter is too many architectural resources to allocate.
> Peter Hsu, the designer of the MIPS R8000, came up with the exact same
> idea back in 1995! even the prefixing, the vector/scalar marking, and
> regfile number-extending.
> the only reason they did not go ahead was because Peter's team
> recognised that for best performance you need to rely heavily on a
> wide multi-issue OoO engine, which MIPS simply did not have the
> inhouse expertise to create one, at the time.
Right, this is the problem I see with Simple-V: best performance
requires multi-issue OoO. FlexiVec can give optimal performance with
multi-issue OoO or an in-order scalar unit driving a chain of vector
lanes. I expect that the latter model can scale up farther (== more
vector lanes, straight into GPU territory) than a multi-issue design,
especially if the vector lanes are allowed to lag the control unit in a
pipeline design, such that lane251 might be executing its part of the
operation that was issued to lane0 64 cycles ago.
> > > major changes at this point would be... difficult, shall we say.
> > > that said i'm happy to go through this because we have to demonstrate
> > > completeness.
> > >
> > Fair enough.
> these kinds of conversations also turn up wonderful gems such as
> FlexiVec, which i am still getting to grips with, looking like it is a
> Vertical-First ISA.
If I understand the "Vertical-First" term correctly, FlexiVec is exactly
so, just like the old Cray "classic" vector model.
> can you express it in pseudocode at all? i like to make sure i
> properly understand, and these are such subtle complex concepts it is
> really challenging to be clear.
I will try a program example, in this case adding two arrays of
integers: (register numbers made up without reference to standard ABI
and code untested; I /think/ I have this right)
[address of array A in R3, B in R4, C in R5, length in R6]
[assemble vector configuration in memory with address in R7...]
[...declaring R20, R21, R22 as vectors of words]
; we are using 32-bit elements, so 4 bytes per step
li R10, 4
; the initial load will start +4 bytes from base
addi R3, -4
addi R4, -4
addi R5, -4
; load counter
; begin vector loop
1: lwaux R20, R3, R10
lwaux R21, R4, R10
add R22, R20, R21
stwux R22, R5, R10
; end vector loop
For the null implementation, FlexiVec is simply ordinary scalar
processing, with the FlexiVec setup instruction ignored.
For a full implementation, the program first executes FVSETUP with the
address of a memory buffer containing the architectural vector
configuration, which hardware examines and configures the vector
execution machinery. For a concrete example, assume a Hwacha-like
multi-lane vector unit with 4 lanes is available. Three word-element
vectors are requested and the vector memory is partitioned accordingly.
Assume this results in MAXVL=20 (must be a multiple of N (lane count)
since each lane holds a slice of each vector; these numbers are much
smaller than a practical implementation would be expected to have) and
the arrays involved are 32 elements each.
FlexiVec is activated by a write to a vector register; in the example
above, the "lwaux R20" instruction. Activating FlexiVec clears the
physical scalar registers configured for vector use; these are
subsequently used for vector offset tracking and referred to as "pR20",
"pR21", and "pR22" below. In this example, each vector lane knows its
own position VLANE (0,1,2,3 here) in hardware and the scalar unit knows
that there are N=4 lanes. (Assume all of the data is in the same
accessible page for now to avoid virtual memory issues; they are not
hard to handle but add some complexity to each step; hinted at below.)
The vector length is VL := MIN(MAXVL, CTR). Since CTR=32 (32 element
arrays) and MAXVL=20 in this example, VL is 20 for the first iteration
through the loop.
For the first load: The scalar unit distributes a LOAD-STRIDE operation
to the vector chain with a vector target of V20 (the vector replacing
R20), the real address derived from the initial value of R3 as base
address, the value of R10 as the stride length, and the value of pR20 as
vector offset. Each vector lane notes the vector offset (and does
nothing if that offset inhibits the lane; defined as inhibit if
(((offset%N)&&((offset%N)<VLANE))||((offset+VLANE)>=VL); this occurs if
an operation only partially completes across the vector unit, as when a
load crosses a page boundary, or on the last pass through the loop if
some lanes are unneeded), computes EA := base + (stride*(1+VLANE)), and
issues a load for its EA to be copied to the appropriate element of
V20. The scalar unit increments pR20 by N (since N lanes each read one
element), and computes EA := base + (stride*N) and updates R3
accordingly. Since pR20 is less than VL, the program counter advance is
inhibited and (since N=4 here) the instruction repeated for vector
offsets 4, 8, 12, 16. After offset 16 is processed, pR20 == VL, pR20 is
cleared, and the program counter advances.
The second load using R21/V21 proceeds analogously.
For the addition: The scalar unit distributes an ADD operation to the
vector chain with a vector target of V22 (the vector replacing R22),
vector sources of V20 and V21 (analogously), and the value of pR22 as
vector offset. Each vector lane notes the vector offset and performs
the indicated addition. The scalar unit increments pR22 by N and
repeats the instruction for vector offsets 4, 8, 12, 16. After offset
16 is processed, pR22 == VL, pR22 is cleared, and the program counter
The store operation is processed analogously to the earlier loads. Each
vector unit computes its effective address for each element as the
scalar unit walks the base address along.
The loop branch is another bit of magic in FlexiVec: the scalar unit
knows that FlexiVec is active, so CTR is decremented by VL=20 instead of
1. 32-20 = 12 -> CTR, so the branch is taken the first time.
For the next iteration, VL is now 12, since CTR<MAXVL. Each instruction
proceeds analogously, with vector offsets 0, 4, 8. This time VL=12,
CTR=12, 12 - 12 = 0 -> CTR, so the loop branch is not taken, and
FlexiVec is deactivated.
For an out-of-order multi-issue implementation, the vector lanes are
emulated by issuing the relevant element-wise operations to the
available execution ports. Here, N is the number of simultaneous issue
ports available instead of the number of vector lanes and MAXVL is
determined by the availability of scratch registers in the OoO
microarchitecture to hold the vector elements.
> > > so, a completely separate regfile / register-memory area for vectors
> > > from scalars? is that right?
> > >
> > Notionally yes, although the simplest implementation (that does not
> > actually have vector hardware) uses the scalar registers to store the
> > single element allowed in each vector and the scalar ALU to process the
> > elements serially.
> it's sounding like a cross between VVM and the ETA-10 (CDC 205).
Some implementations might be. The idea is that FlexiVec is, well,
> > > if you're proposing a separate vector regfile / register-memory-area
> > > the downside of that are that you then have to add inter-regfile
> > > instructions, in between the scalar regfile and the [new] vector
> > > regfile/reg-mem-area.
> > >
> > Nope! Transfers between scalar and vector register files go through
> > memory. (There are a few instructions in x86 SSE(?) for direct
> > transfers between the SIMD and general registers -- they turn out to be
> > significantly slower than using LOAD and STORE operations.)
> ah. right. this may be quicker (due to an internal arbitrary
> micro-architectural decision by intel) but the power consumption is
> awful. Jeff Bush did power analysis in Nyuzi and he *very
> specifically* warned that the reason why 3D GPUs have such large
> regfiles is to make damn sure that workloads are kept to
> the moment it becomes LOAD-PARTPROCESS-SPILL-PARTPROCESS-STORE then
> due to the insanely heavily repeated workloads you end up with a
> noncompetitive unsaleable product due to its power consumption.
> we have to be similarly very very careful.
The idea of FlexiVec for Power ISA is that every operation normally
available in the Fixed-Point Facility, Floating-Point Facility, and
Vector Facility (VMX/VSX) [(!!!)] would be available vectorized when
those facilities are extended using FlexiVec. (Yes, in theory, FlexiVec
could extend VSX too!)
> > Not counting the vector setup (which would be CSRs on RISC-V and
> > probably SPRs on OpenPOWER) I tentatively believe that FlexiVec could
> > require *zero* additional instructions.
> yes this is the beauty of Vertical-First.
Correction: 3 (maybe 4) additional instructions, requiring 2 encodings,
since two of them are privileged (FlexiVec context save/restore) and one
(FlexiVec setup) is only available in problem state (which is Power
ISA's name for user mode). FlexiVec setup is a subset of FlexiVec
context restore and FlexiVec is only available in problem state, to
prevent chaos when register definitions are changed out from underneath
system code. The tentative concept is that the architecture would
define a fixed configuration layout which is a prefix of an
implementation-defined context structure. Live migration between
different implementations is accomplished by waiting for the program to
finish using FlexiVec before migrating it. (Processors capable of live
migration would need to provide some way to trap on the branch that sets
CTR to zero and ends the FlexiVec loop. I think the existing Debug
and/or Performance Monitor Facilities *might* be able to do this.
Embedded SOCs would not need this subfeature because live migration is
not practical in their environments.)
> > > in other words, i took the concept of "Sub-PC" very seriously and
> > > treated it literally as part of the [absolutely] critical Context, aka
> > > a peer of PC and MSR.
> > >
> > If nested traps are possible, the trap handler still must preserve
> > SVSRR1 somewhere.
> right next to preserving SRR0 (copy of PC) and SRR1 (copy of MSR).
> once called, trap handlers must *not* let the exception mask bit go
> low until they have saved SRR0/SRR1/SVSRR1 somewhere.
> hypervisor mode has to have corresponding HSRR0, HSRR1 (and now
> HSVSRR1) because it *can* interrupt a [supervisor] trap. nested.
> this is all standard fare, it has all been in place literally for
> decades, now. SVSTATE and SVSRR1 (and HSVSRR1) therefore literally get
> a "free ride" off the back of an existing
> astonishingly-well-documented spec and associated implementation.
There is still an incremental software cost. To be fair, FlexiVec has
similar costs, since it also adds thread context. FlexiVec, however,
can be ignored by the system unless a task switch is to be performed, so
the runtime cost is very slightly lower.
> > > i thought about it, and realised that it made Register Hazard
> > > for Multi-Issue OoO designs really, *really* complicated.
> > >
> > Actually, POWER already has a loop counter register CTR, so the
> > incremental cost of using that cannot be too high.
> right. ok. CTR goes into its own separate Hazard Management. transfers
> between the GPRs and CTR (mtspr, mfspr) are explicit instructions that
> allow clean GPR-SPR Hazard interaction in OoO RaW/WaR tables.
> it's... complicated. RISC-V's "simplicity" such as not having
> Condition Codes has led people to believe everything can be done "real
> simple". fact is that Intel AMD and IBM have Condition Codes and
> Special Purpose SPRs (like CTR) for *really good reasons* which start
> to matter in high performance designs.
IBM may have known what they were doing, but I am fairly sure that Intel
and AMD are stuck with condition codes because the original 8086 used a
FLAGS register to control conditional branches. If x86 condition codes
are useful like that, I am convinced that Intel had a lucky guess all
those years ago.
On the other hand, I view RISC-V as an experimental architecture in "how
simple can we make it?" and I am uncertain if we would even have
OpenPOWER if RISC-V did not exist as competition.
> bottom line: Patterson has a hell of a lot to answer for.
> > Effectively adding bits to PC widens those internal buses and registers;
> > this may have far-reaching consequences in actual hardware, possibly
> > extending critical propagation delays and therefore the minimum
> cycle time.
> PC and MSR, both 64 bit, are already carried around as "state". in the
> TestIssuer design i added DEC and TB to that as well (interrupt
> counters for watchdogs). that's 256 bits. adding 64 more is not such a
> hardship and can be stored in 1R1W SRAM.
> carrying around state in SRAMs is pretty normal for high performance
> designs. i have the architectural design guide for the 88100, very
> kindly sent by Mitch Alsup, i can pass on to you if you're interested.
It will be an interesting bit of history to read some day if nothing else.
> > Agreed. Consider the proposal to change the Simple-V execution model
> > withdrawn and tentatively offered for comment as an early draft for a
> > second vector execution proposal, FlexiVec.
> appreciated! it... i... it's difficult to truly express to people how
> hard this stuff really is. IBM's internal engineers tried expressing
> it to Hugh Blemings, many years ago. he thought he understood it when
> they said, simply, "Hugh: hardware design is HARD". several years
> later he realised they were talking several orders of magnitude out of
> sync with what he'd imagined :)
This does not change my views on Simple-V; just that Simple-V is too far
along in development to meaningfully change at this point.
> > On another note, having had a bit more time to examine the Simple-V
> > document, I propose splitting the additional scalar operations in Part
> > III into a separate "New Instructions for Parallel Applications"
> > proposal. Most (maybe all?) of them should be able to stand on their
> > own, without requiring the Simple-V pipeline extensions.
> yes. that's stated (in pretty much exactly those words) right at the
> top. section 3:
> i repeat it 4 times, i have just altered the wording slightly. they
> *are* very much with almost no exceptions at all designed for
> scalar-only use, oh and by a not-coincidence-at-all happen to have
> uses when Vectorised.
> ah. i know. Part III has no introductory preamble. i think what you're
> missing is that the *entirety* of Part III is Scalar-independent and
> does not require any of SVP64!
> i'll add a preamble chapter.
I suggest splitting the document. Put Simple-V and its instructions in
one document and the SVP64-independent instructions in a separate
proposal -- or multiple proposals. Break the huge block into more
> > Consider it this way: SVP64 significantly alters the execution pipeline
> no, it sits *between* issue and decode. ah sorry, i think you used
> "execution" to refer to the whole chain, where normally "execute" is
> used to refer to one phase of that chain.
OK, "processing" pipeline. Yes, you are correct that "Execution" is
also normally a pipeline stage, so that was a poor choice of words.
> > Having just now obtained a copy of the OpenPOWER v3.1B spec and having
> > barely begun reading its almost 1600 pages, I have already found
> > something that might be a problem for Simple-V: all but the lowest
> > compliance levels already have VMX/VSX as a required feature.
> deep breath.
> yes. RV64GC is 96 instructions. Linux Compliancy Level (equivalent) is
> 950. this is just plain stupid, no need to sugar coat it.
> IBM is only now, after 2 years of me banging on about it, and, some
> confidential stuff i can't tell you about until it's published, just
> beginning to realise that their incremental 25 *YEARS* of lead on
> Power ISA gives them an accidental and unintentional myopic view of
> how much is really truly involved in Power ISA implementation.
> putting it bluntly they *genuinely* thought it was perfectly fine to
> smash 1,000 instructions in peoples' faces and expect them to get on
> with it. after all, theey managed to do it, right? we have linux
> running perfectly fine, right? so what's the problem, again?
Eh, I am fairly sure that Linux was adapted to POWER, not the other way
> what they completely overlooked was that they got to 1,000
> instructions in an *incremental* fashion over a 25 year period.
> (VSX was originally only VMX and that was added as far back as *2003*).
> nobody else has IBM's resources *and there are no reference
> implementations* [A2O and A2I are Scalar v2.06/8 from *15* years ago]
> if you put say a scant 2 days per instruction including unit tests and
> Compliance Test Suite Validation which honestly is barely enough then
> multiply that by 750 (total MANDATORY for Linux Compliancy with VSX on
> top of SFFS which is 214) it comes out to 1,500 **DAYS** just to add VSX.
> divide by 250 and you have a jaw-dropping six YEARS of development
> effort at a crash-course speed that will leave the engineers
> desperately exhausted, uncomfortable, underconfident and highly likely
> to leave the team long before those 750 instructions are complete.
> i made these insights very clear multiple times and the message seems
> to have gotten through.
> what probably did it though was the fact that Microwatt, A2O and A2I
> are all in exactly the same position: they are all on-track for SFS
> and SFFS Compliancy Level and absolutely nowhere near Linux Compliancy
Wait... are SFS and SFFS abbreviations for "Scalar Fixed" and "Scalar
Float"? If so, then BE is mandatory for those after all, and there is a
*different* (32-bit) Linux port that runs on the SFS/SFFS platform,
separate from the Linux Compliancy Subset, which is for 64-bit LE
Linux. (Confused yet?)
> IBM had *no idea* how serious a problem this is, and up until about a
> year ago was continuing a pathological submission of "#ifdef POWER9"
> patches upstream to libc6 and major software packages, putting in
> *even more* VSX dependency behind "#ifdef POWER9" all of which has to
> be ripped out.
> so what is happening instead (at last) is that the message has been
> finally understood, future patches will be "#ifdef VSX" and "#ifdef
> MMA" respecting the *Compliancy* Level, NOT the IBM product make/model
> number. Tulio's libc6 patch dealing with that has landed, finally,
> about 6 months ago.
Really, the GNU libc maintainers should have been pushing back on that
-- the GNU policy is supposed to be to test features rather than
> also totally missing is official ABI documents for SFS and SFFS!
> (what's ~100% likely going to happen there is that automatically
> whatever "-mnovsx" does, that's what will end up in those documents)
There may also not *be* a standard ABI, if those are intended for
embedded systems, which do not need standard ABI.
> this message has also gotten through after banging on about it for 18
> months, but mostly it was the difficulty that LibreBMC had which
> brought the point home, there.
> the IBM LibreBMC team did recompile everything with "-mnovsx
> -mnoaltivec" and for the most part this worked except for QFP and
> except for when Tulio's patch wasn't added upstream.
Is this the POWER9 service processor I have heard about that runs its
very own Linux-based system, such that, after power-on, the machine
"plays dead" for about a minute while the service processor boots?
> Paul did add a subset of VSX to Microwatt (increasing the LUT4 count
> by a whopping 50%) in order to get libc6 to run when it was still
> "#ifdef POWER9" but every few months if he tried again he would find
> *yet more* patches had added *yet more* VSX instructions and in some
> rare cases i think gcc violated the "-mnovsx" rule!
> bottom line here is that the lack of enthusiastic adoption of Power
> ISA after its release (Nov 2019!) combined with insights from a lot of
> people all saying the same thing, has finally gotten through, and
> there is quiet scrambling going on behind the scenes which i don't
> know about so can't even tell you (we're not OPF Members), to fix this
> and make SFFS an actual proper Linux OS peer.
> without needing soft-emulation of VSX.
> i appreciate that's a hell of a lot of context and backstory, but this
> is a big project :)
I always appreciate the historical view.
More information about the Libre-soc-dev