[Libre-soc-dev] svp64 review

Mon Jul 25 06:02:17 BST 2022

Luke Kenneth Casson Leighton wrote:
> On Sun, Jul 24, 2022 at 10:11 PM Luke Kenneth Casson Leighton <lkcl at lkcl.net>
> (actually, jacob bachmeyer, a few days ago) wrote:
>
>   
>>> https://ftp.libre-soc.org/simple_v_spec.pdf
>>>       
>> A few comments from a quick partial review:
>>     
>
> always appreciated
>
>   
>>     In chapter 3, "vertical" vector mode as described is ridiculous --
>> that is exactly equivalent to a software loop and therefore a complete
>> waste to support in hardware.
>>     
>
> rright. ok. so some background here is (a) Mitch Alsup's VVM Extension
> for MyISA 66000. Mitch has spent something like the past... 3? years
> on comp.arch explaining to anyone prepared to listen about the benefits
> of Vertical-First loop constructs.
>
> in Mitch's Vertical-First LOOP system, it is assumed that high-performance
> systems will utilise GBOoO to store the entirety of the LOOP instructions
> in in-flight registers.
>
> thus it becomes possible, very easily, to go, "huh, we're doing yet another
> loop, let's merge all the identically-issued *scalar* operations from the
> previous loop in a zip-up with the new ones".  repeat, repeat, repeat, and
> you can blat an entire sequence of *scalar* instructions into *vector*
> (actually, SIMD) ALUs.
>
> there are some limitations:
>
> 1) the input and output "vectors" can only be LDs and STs respectively
> 2) you can only do one inner loop
> 3) if there are not enough in-flight Reservation Stations you have to
>     fall back to Scalar-only looping which is perfectly reasonable
>
> the LOOP preamble instruction helps identify all loop-invariant
> registers plus identifies the counter register.
>
> it's extremely neat.
>   

It is neat, and it forms a basis (using slightly different terms) for 
what my second message termed "FlexiVec" which essentially amounts to 
hardware loop unrolling across a vector ALU array.

The catch is that I suspect most of the OpenPOWER experts are likely to 
have a similar reaction to mine when they first read that.  By the end 
of chapter 4, I understood better (to see "Vertical Vector Mode" as a 
form of hardware loop unrolling) but I chose to leave that comment stand 
because I expect that impression could be a problem for you if the 
OpenPOWER experts are less curious than myself.

>> Any optimizations that can be applied
>> there could also be applied to ordinary "for" loops and "svstep.bc" is
>> nothing more than a dedicated LOOP opcode (similar to the same
>> instruction from the original 8086).
>>     
>
> yes, isn't it great? a high-performance implementation can apply
> the same trick above, but in the case of SVP64 is not limited to
> Memory-only Vectors, it can use registers.
>   

 From an ISA perspective, it is not so great:  here is a duplicate 
opcode that has effectively the same function but must be *different* to 
indicate vector operations.

> there's a hphint which when set ensures that up to that many
> Scalar Registers (actually, Vector elements) are "safe" to read/write
> in parallel.  example:
>
> for (i = 0; i < 100; i++)
>     a[i] = a[i+5]*a[i];
>
> this can be done only up to batches of 5, safely, and hphint would
> be set to 5 to make that clear to the underlying hardware which
> performs the in-flight-merging trick described by Mitch Alsup.
>   

The problem is that this in-flight-merging trick can only work in big, 
complex, OoO microarchitectures.  The alternate model that I am now 
tentatively calling "FlexiVec" due to its hardware flexibility can 
instead work all the way from actual scalar processors (that effectively 
have MAXVL=1) to simple in-order scalar units driving a vector unit 
chain to complex out-of-order systems that use multiple issue ports and 
parallel ALU pipelines to emulate a vector unit.

>>     In chapter 4, we finally start to get to the "meat" of the
>> proposal.  You have a serious misunderstanding of the x86 "REP" prefix.
>>     
>
> probably.  honestly it's a throw-away "analogous concept" comment.
> if people understand "the thing afterwards can be repeated" then
> that's enough to get them started. beyond that initial statement,
> getting people to understand "repeating" by seeing it from an
> existing ISA, there is absolutely no further use for x86 or REP
> of any kind.
>   

Then it is a very bad analogy for your use.  The x86 REP prefix does not 
introduce a secondary program counter:  it works by inhibiting the 
normal program counter advance until a condition is met, causing (at 
least notionally) the same opcode to be read and executed repeatedly.

>> The misunderstanding is that there is no "Sub-PC" in x86
>>     
>
> there's no connection to x86 at this point.  the "REP" analogy
> is already done and finished.
>
> *no* ISA except SVP64 has Sub-Program-Counters.
>   

This (introducing a secondary program counter) is likely to be a major 
sticking point with the OpenPOWER experts.

> [...]
>> I have a change to Simple-V that would allow you to throw most of the
>> current limits out of the proverbial window.
>>     
>
> ah... by this point in time (over 2 years of development) we're just about
> to put SVP64 into the OpenPOWER Foundation "External RFC Process",
> and have a Simulator, thousands of unit tests, a working HDL
> Reference Implementation, and 5 months of work on binutils.
>
> major changes at this point would be... difficult, shall we say.
> that said i'm happy to go through this because we have to demonstrate
> completeness.
>   

Fair enough.

>> Simple-V does *not* "march
>> across the register file", instead Simple-V *replaces* selected ISA
>> scalar registers with sliding windows onto the vector register memory
>> during a vector loop.
>>     
>
> so, a completely separate regfile / register-memory area for vectors
> from scalars?  is that right?
>   

Notionally yes, although the simplest implementation (that does not 
actually have vector hardware) uses the scalar registers to store the 
single element allowed in each vector and the scalar ALU to process the 
elements serially.

> if you're proposing a separate vector regfile / register-memory-area
> the downside of that are that you then have to add inter-regfile transfer
> instructions, in between the scalar regfile and the [new] vector
> regfile/reg-mem-area.
>   

Nope!  Transfers between scalar and vector register files go through 
memory.  (There are a few instructions in x86 SSE(?) for direct 
transfers between the SIMD and general registers -- they turn out to be 
significantly slower than using LOAD and STORE operations.)

> with SVP64 being a bare-minimum RISC-paradigm extension
> of the Scalar Power ISA, one of its key strengths is that it only
> requires 5 (five) additional "management" instructions to turn
> that Scalar Power ISA into a Scalable Vector ISA.
>   

Not counting the vector setup (which would be CSRs on RISC-V and 
probably SPRs on OpenPOWER) I tentatively believe that FlexiVec could 
require *zero* additional instructions.

>>  (Your current pseudocode still describes marching
>> across the register file.)  This is very similar to the "vector tail"
>> model I was proposing as "RVP lanes" a few years ago.
>>     
>
> that was as far back as 2018, wasn't it? :)  i do remember you
> using the phrase "RVP lanes".
>   

Yes it was.  :)

>> The proposed "sub-PC" represents a problem for exception handling,
>>     
>
> ah!  no, amazingly, it doesn't!  i've been extremely strict about this,
> and designed both the Simulator and the HDL to be precise-exception
> capable.
>
> anything - anything at all - that prevents or prohibits exceptions
> in the middle of processing a Vector element batch is immediately
> rejected.  the "State" information is kept to:
>
> * SVSTATE (contains the element index sub-step counters)
> * SVLR (the SVSTATE equivalent of LR)
> * SVSHAPE0-3 (the "REMAP" SPRs for hardware index reforming)
>
> if the REMAP areas are zero you do not need to save/restore
> the four SVSHAPE SPRs.
>
> SVSTATE is saved/restored into SVSRR1, just as PC is saved/restored
> in SRR0 and MSR is saved/restored in SRR1.
>
> in other words, i took the concept of "Sub-PC" very seriously and
> treated it literally as part of the [absolutely] critical Context, aka
> a peer of PC and MSR.
>   

If nested traps are possible, the trap handler still must preserve 
SVSRR1 somewhere.

>> but
>> the 8086 "REP" prefix provides precedent for an easy solution:  use a
>> general-purpose (scalar) register as the loop variable.  Actually, using
>> a (programmer-chosen) scalar register as the control-flow loop variable
>>     
>
> i thought about it, and realised that it made Register Hazard Management
> for Multi-Issue OoO designs really, *really* complicated.
>   

Actually, POWER already has a loop counter register CTR, so the 
incremental cost of using that cannot be too high.

> at least a separate SVSTATE SPR does not interfere with the
> RaW/WaR Hazard Management of the GPRs.  it can be cached
> and passed around at the peer-level of PC and MSR, which
> is *really* important in a Multi-Issue context.
>   

Effectively adding bits to PC widens those internal buses and registers; 
this may have far-reaching consequences in actual hardware, possibly 
extending critical propagation delays and therefore the minimum cycle time.

>> With a few restrictions on allowed operations related to inter-lane data
>> transfers, a vector loop can then, for most operations and on
>> appropriate hardware, be unrolled (by hardware) across however many
>> vector lanes are actually implemented, with the loop variable advanced
>> by N (number of implemented vector lanes) on each pass through the loop.
>>     
>
> deep breath: this is unfortunately a completely different design paradigm
> for which i would have to rethink the entire implementation strategy that
> i've been holding in my head for over 30 months.
>   

Agreed.  Consider the proposal to change the Simple-V execution model 
withdrawn and tentatively offered for comment as an early draft for a 
second vector execution proposal, FlexiVec.

>> If Simple-V really is intended to march across the register file, then I
>> propose an alternate "FlexiVec" as I previously described.  The
>> interesting possibility with "FlexiVec" is that it can scale all the way
>> down to the baseline scalar ISA (with MAXVL=1) and up to arbitrarily
>> large "hybrid GPU" designs with thousands of vector lanes driven by a
>> single control unit.
>>     
>
> IBM POWER 9 and IBM POWER 10 took a different strategy: 8-way
> multi-issue OoO execution.  POWER10 i think has two 128-bit SIMD
> ALU pipelines per core, which is completely mad.
>
> what i very much did not want to happen was IBM, who are (obviously)
> on the OPF ISA WG and who have now handed over control of the ISA
> to the OPF after being its custodians and designers for 25 years, to freak
> out.
>
> with IBM having already implemented such an astoundingly-powerful
> multi-issue engine, it made prudent sense to propose SVP64 as "merely
> leveraging what IBM already has".
>
> it is telling that IBM did *not* extend the VSX ISA to 256 or 512 bit: instead
> they increased the number of 128-bit multi-issue ALUs.
>   

Multiple-issue would be a perfectly acceptable way to implement 
FlexiVec, too.

On another note, having had a bit more time to examine the Simple-V 
document, I propose splitting the additional scalar operations in Part 
III into a separate "New Instructions for Parallel Applications" 
proposal.  Most (maybe all?) of them should be able to stand on their 
own, without requiring the Simple-V pipeline extensions.

Consider it this way:  SVP64 significantly alters the execution pipeline 
and adds to the processor context.  Most of the scalar instructions 
proposed are additional ALU operations, orthogonal to Simple-V proper.  
I suspect that the latter, having much less far-reaching effects on 
processor design, will be easier to convince the OpenPOWER experts to adopt.

Having just now obtained a copy of the OpenPOWER v3.1B spec and having 
barely begun reading its almost 1600 pages, I have already found 
something that might be a problem for Simple-V:  all but the lowest 
compliance levels already have VMX/VSX as a required feature.

-- Jacob