[Libre-soc-dev] svp64 review and "FlexiVec" alternative

Tue Aug 2 05:09:39 BST 2022

lkcl wrote:
> On Mon, Aug 1, 2022 at 5:35 AM Jacob Bachmeyer <jcb62281 at gmail.com> wrote:
> >
> > lkcl wrote:
>
> > > On Sun, Jul 31, 2022 at 2:57 AM Jacob Bachmeyer 
> <jcb62281 at gmail.com> wrote:
>
> [...]
>
> > A multiple-issue
> > OoO implementation could use the same hardware pathways to perform a
> > horizontal sum, but that is ordinary OoO loop optimization, not 
> FlexiVec.
>
> yes. it also means programs are larger. L1 cache size has to increase 
> to compensate. increases in L1 cache size have an O(N^2) effect on 
> power consumption (this is a well-known studied phenomenon, more below).

Yes, that would have that effect, perhaps not 2 exactly, but certainly 
more than 1:  you are increasing the cache index arrays in both axes, 
and every CAM cell consumes power to check for a match on every access.  
The increase in columns is logarithmic, but the increase in rows is 
linear.  The actual number of CAM cells required depends on 
associativity, so that gets complicated.  Yes, power usage is definitely 
more than linear in cache size, but I think quadratic is still an 
exaggeration.

> > A severe limitation shared by every other current vector computing model
> > I have seen. (I am still considering Simple-V to be a "future" model as
> > there is no hardware yet.)
>
> indeed. please understand that SV (all Cray style ISAs) is by no
> means perfect. the mistake made by SIMD ISAs is to impose a
> particular lane structure and/or internal microarchitecture onto
> programmers.
>
> the beauty of VVM/FV/Cray Scalable Vectors is that they are more
> like a "Software API". the internal microarchitecture is NOT known
> by the assembler/compiler developer, it is a true black box.
>
> what that in turn means is that the *hardware architect* has one
> hell of a lot more work to do!
>
> the *hardware architect* has to provide the means to do
> subdivision of a vector into batches of elements, to be thrown
> at, ironically, the exact same SIMD ALU backends that would
> normally be *DIRECTLY* exposed to the programmer (in any
> SIMD ISA).
>
> and that includes Horizontal Summing and Parallel Prefix Sums.

Unfortunately, FlexiVec's implicit vectorization model makes explicit 
horizontal operations impossible.  Combining FlexiVec and VSX might 
allow reductions, if VSX has such instructions, but I have not checked 
yet.  (A calculation loading 4x32b into VSX, performing a 4x32b reducing 
sum and emitting 1x32b result at each step would be one way to work 
around this limitation in Power ISA, if VSX can do reducing sums.  
However, multiple cycles of this, complete with LOAD/STORE would be 
needed to eventually reduce to a final 32bit result.)

There is a possibility of combining FlexiVec (for "string" operations) 
and Simple-V (for reductions), if Simple-V is also generalizable to FP 
and VSX.  This would greatly reduce Simple-V's register pressure by 
instead using FlexiVec for "bulk" computations.  The interaction could 
allow reductions to be performed at only the cost of an additional 
pipeline (literally a wide shift register to present vector elements to 
the scalar core serially) in SIMT implementations, with trivial 
additional cost in minimal or OoO implementations.

> > > exactly. which puts pressure on LDST. that is a severe limitation 
> (one that SV does not have).
> > >
> >
> > Admitted; this limitation is necessary to ensure that a Hwacha-like SIMT
> > hardware implementation is possible. (That is the direction that I
> > believe maximum performance systems will ultimately need to go. There
> > are few hard limits on how large an SIMT array can get, only latency
> > tradeoffs.)
>
> the internal microarchitecture i have in my head for a high
> performance not-completely-insane-wiring design is to do
> standard crossbars and multi-porting regfiles for regs r0-r31
> but above that to do MODULO 4 "Lanes".
>
> Lane-crossing to be covered by a "cyclic buffer" (rather than a 
> massive crossbar).
>
> i.e. if you want to add r32 to r36 it is 1 clock cycle (excluding the 
> regfile read time itself)
>
> but if you want to add r32 to r33 it is 2 clocks (1 extra through the 
> cyclic buffer).
>
> it is a compromise, but the important thing is, it is *our choice*,
> absolutely bugger-all to do with the ISA itself. anyone else could
> *choose* to do better (or worse).

Now you have variable-latency vector lanes.  :-)

> [...] <https://m.youtube.com/watch?v=9zoQb9n0akM%29>
>
> > > each of those algorithms has, for decades, solutions that perform 
> small subkernels (using available registers).
> > >
> >
> > Which introduces the same memory bandwidth pressure that plagues
> > FlexiVec, does it not? :-)
>
> interestingly, no! at least, not the same percentage. the ability
> to have and access pre-computed loop-invariant *vectors*
> of constants in actual (real) ISA-numbered registers reduces
> the memory bandwidth compared to FlexiVec/VVM.
>
> yes it is true that all of these algorithms are reliant on high
> perfornance LDST which is precisely why it is crucial not to
> overload it.

So you do have the same problem, but perhaps somewhat less severe.  High 
performance memory access is one of the reasons that the predictability 
of a FlexiVec loop is important.

> > > these are O(N^2) costs and they are VERY high.
> > >
> > > (unacceptably high)
> > >
> >
> > I disagree on the exponent there:
>
> unfortunately it isn't, for multiple reasons.
>
> 1) Multi-Issue LDST address conflict detection requires a triangular
> comparison every address against every other address. this
> is by nature O(N^2-N) - N times (N-1) on every clock.

Not done in FlexiVec -- avoiding those conflicts is the programmer's 
responsibility.  (The hardware cannot reliably do that lookaround anyway 
-- the required lookaround varies depending on the implementation.)

> 2) SMP Cache Coherency (snooping) is O(N^2)

Once hardware has seen the first iteration, the strides are known (along 
with the total vector length in CTR) and any relevant remote cache 
"preflushes" can be initiated and the needed cachelines claimed 
locally.  Since the local cache now holds the vector working areas, this 
is a minimal cost, provided that other processors are not playing 
cacheline ping-pong, but cacheline ping-pong is a problem regardless of 
vectorization or lack thereof.

> 3) L1 data cache misses require increased L1 cache sizes to
> compensate for, and studies have shown O(N^2) power
> consumption increases with L1 cache size

The idea is that the vector unit probably has either its own L1 cache or 
is directly connected to L2.  (I believe the AMD K8 did that with SSE.)  
If an SIMT vector chain has its own L1, that cache is likely also 
divided into per-lane sub-caches.

> this is just to get data in and out: any other benefits of an ISA
> are irrelevant if the power consumption is through the roof.
>
> this experience which Mitch has in bucketloads is what has driven
> him to design VVM the way he designed it. i don't have all the
> answers but i am rapidly picking up enough to know that you
> *REALLY* need to have a full microarchitectural working knowledge
> of every type of Computing system under the frickin sun in order
> to avoid making carastrophic *ISA* mistakes.
>
> sigh.

So then the best route would be to abandon both Simple-V and FlexiVec 
and implement VVM, on the basis that Mitch likely knows (many) 
something(s) that we do not?  :-)

> [...]
>
> > >> Worse, that hard limit is
> > >> determined by the ISA because it is based on the architectural 
> register
> > >> file size.
> > >>
> > >
> > > turns out that having an ISA-defined fixed amount means that 
> binaries are stable. RVV and ARM with SVE2 are running smack into this 
> one, and i can guarantee it's going to get ugly (esp. for SVE2).
> > >
> >
> > Interesting way to view that as a tradeoff. Precise programmer-level
> > optimization opportunities versus wide hardware scalability with fixed
> > program code...
>
> there is actually a key difference between FlexiVec and VVM.
>
> * VVM allows the microarchitect to choose MAXVL. in-flight
> RSes can and are used as the auto-vectorised registers.
>
> * FlexiVec specifies a MAXVL and forces the architect to create a
> regfile of sufficient size:
>
> (num_regs times MAXVL).
>
> the problem there is that if you allow MAXVL of say 512 and the
> number of regs is 32, 8 bytes each, that's a MASSIVE 128kbyte regfile. 
> even if you let MAXVL=8 that's still a 2kbyte regfile which is scarily 
> large, given that it has to be multi-ported.

Not so at all:  hardware commits to a MAXVL at each FVSETUP and MAXVL 
can vary with the vector configuration.  (The vector configuration 
persists after a FlexiVec loop, but the vector register contents are 
undefined.  Another loop with the same configuration does not require a 
second FVSETUP.  Turning FlexiVec off requires an explicit FVSETUP, but 
perhaps "FVSETUP R0" could be special-cased for this purpose if R0 is 
the hardwired zero that it appears to be in Power ISA, or other 
operations such as "branch to link" could implicitly clear FlexiVec if 
function calls in a vector loop are to be considered programming 
errors.)  An expected implementation would be able to variably partition 
the vector register file (no more complex than is required for Simple-V) 
such that configuring 3 vectors of 32bit elements yields a higher MAXVL 
than 8 vectors of 64bit elements.  Exactly how much higher is 
unspecified -- vector register partitioning is not required to be 
entirely efficient.

Larger vector processors are expected to use the SIMT topology, where 
the entire vector register set is not stored in a single physical 
register file, but is instead distributed across the vector lanes.

Remember that FlexiVec-Null is simply the scalar unit, treated as 
MAXVL=1 in all cases.  The only architectural bound on MAXVL is that it 
is greater than zero.

> the problem with VVM is that the autovectorisation being
> arbitrarily chosen by the architect: above a certain loop
> size you *have no choice* but to fall back to scalar operation.
>
> which seems perfectly fine and logical until you get different
> hardware running the same binary. one is great, the other is
> shit performance. investigation shows that the algorithm was
> designed ASSUMING that the inflight RSes would allow loops
> of around (say) 60 instructions but the lower-spec hardware
> only has enough RSes to handle 50.
>
> at which point you have run into exactly the same type of problem
> as SVE2 is going to suffer from: dependence on the hardware
> exposed to the programmer forcing them to make multiple
> assembly-level implementations.
>
> back to FlexiVec: having to have such a large and fixed size of 
> autovec Regfile, this makes me very nervous.

FlexiVec avoids this problem by taking a leading horizontal chunk 
(determined by hardware capabilities) of each vector on each iteration 
of the loop.  For FlexiVec-Null, that chunk is one element and it is an 
ordinary scalar loop.  The length of the loop does not matter in 
FlexiVec because hardware has no (strict) need to look beyond the 
instruction at PC /right/ /now/.  Optimizations considering more of the 
loop are possible, but FlexiVec, unlike VVM, does *not* require the 
entire loop to be simultaneously in-flight.

> > > see the comparison table on p2 (reload latest pdf). or footnote 21
> > > https://libre-soc.org/openpower/sv/comparison_table/
> > >
> >
> > ...and ARM managed to bungle both of those with SVE2 if that is correct.
>
> uh-huhn. all they had to do was add a SETVL instruction which,
> in its implementation, creates an auto-predicate-mask
>
> (1<<VL)-1
>
> thus if SETVL sets VL to 5, the predicate mask would be:
>
> 1<<5 = 32, -1 =31 -> 0b0000000011111
>
> SETVL could even be a 32-bit instruction which if you look at
> the post on github ("quote" in the comparison table) it is explained
> that was a high priority for ARM when designing SVE2 and it has
> made one hell of a mess. half the instructions don't take a
> predicate mask at all making them a PackedSIMD nightmare just
> when you need predication the most.

Predication in FlexiVec for Power ISA would require the addition of 
general predication to the scalar ISA.  Power ISA currently has 
branches, but not predication.  Using a short forward branch to imply 
predication /might/ be possible.

> > FlexiVec and RVV align on the issue of VL-independence, however -- I see
> > that as an important scalability feature.
>
> unfortunately for RVV it means that 100% of portable algorithms
> *have* to contain a loop. i note that in the comparison table.
> at least with VVM/FlexiVec the auto-vectorisation is a hardware
> abstraction.

No, FlexiVec absolutely requires a loop unless the vector length is 
exactly one element.  The autovectorization is that the hardware is able 
to run multiple iterations of that loop in parallel and the programmer 
promises not to do things that would result in such parallel execution 
varying from serial execution, like overlapping inputs and outputs.

> [...]
>
> > This scalability is an important feature for FlexiVec: the programmer
> > will get the optimal performance from each hardware implementation
> > (optimal for that hardware) with the /exact/ /same/ /loop/, including
> > the null case where the loop simply runs on the scalar unit.
>
> yes... except i just realised what the problem is that was nagging
> at me on the fixed regfile allocation.
>
> let us assume:
>
> * a loop of 100,000 instructions
> * a MAXVL of 512 [autovec SRAM of 128k]
> * hardware of literally 512-wide SIMD

In this case that 128k autovec SRAM is split into 512 256-item pieces, 
one per lane.  Each of those pieces is much more reasonable, but this is 
one of the reasons that FlexiVec calculations absolutely cannot cross lanes.

> **OR** as in the
> case of Broadcom VideoCore IV the ability to make it
> *look* like you have 512-wide SIMD elements by doing
> say 4-per-cycle in a micro-coded hardware for-loop over
> 512/4 clock cycles.

This was used in my original example of FlexiVec execution:  hardware 
repeats the instruction (akin to the x86 REP prefix effects) using the 
physical scalar register to track the incremental position in the 
corresponding vector.  When the entire vector is processed, PC advances 
to the next instruction.

> * an interrupt occurs at instruction 50,000
>
> the problem with the design of FlexiVec is that once you
> are committed to any given loop you ABSOLUTELY MUST
> complete it.

Nope!  You do not even have to complete any one instruction either, 
since you can use the slot in the scalar PRF to track the current vector 
progress.  FlexiVec is not available to privileged code, so the only 
requirement is that the interrupt handler restore any registers it 
changes, which any existing interrupt handler already meets.  (To 
privileged code, the scalar registers will appear to contain the 
internal vector offsets when problem state is running a vector 
computation; as long as that is restored properly the vector computation 
will resume correctly.)

> ... or....
>
> you must provide a means and method of context-switching
> that entire architecturally-hidden auto-vec regfile.
>
> all 128k of it.

FlexiVec has that too:  privileged vector context save/restore 
instructions that allow the system software to preserve the entire 
FlexiVec state across task-switching.  Support for live migration to a 
possibly different microarchitecture requires that the system also be 
able to trap at the end of a FlexiVec loop, where the vector register 
contents are undefined, in order to catch the program at a point where 
there is no dependency on the vector implementation that it was using.  
Microarchitectural independence requires hardware to handle 
saving/restoring the vector state, but this also means those operations 
can use the widest path to memory available in the vector unit.

> this is ultimately why Mitch designed VVM the way he did,
> with "best" implementations being on GBOoO, *and* why there
> is an explicit end-loop instruction, because at least with OoO
> you can Shadow-cancel all in-flight instructions in order to
> service an interrupt, and having the explicit end-loop op allows
> HW to identify (in full) the entire extent of the ops that are *going*
> to be in-flight, and if that is not possible fall back to Scalar.

None of that is needed in FlexiVec:  each instruction is independently 
resumable, always, at every step.

> [...]
>
> > Also, since when has power consumption /ever/ been a concern for 
> IBM? :-)
>
> since forever. if you have 160,000 POWER9 cores in the top500.org
> supercomputer it is a BIG damn deal to be able to say, as i
> think they do, "we piss over x86 by 10% performance/watt"

Performance per watt is not power consumption:  11x performance for 10x 
power would meet that.  :-)  I suspect either IBM or Cray Research 
likely pioneered liquid-cooled computing back in the day.

> [...]
>
> Matrix Multiply Assist, part of v3.1, is a whopping 2.5x power
> reduction in POWER10.

That is impressive.  The issue with dissipation is that we are hitting 
the point where the thermal resistance of the silicon substrate itself 
can be a problem.

> [...]
> > > yes. see
> > > https://libre-soc.org/openpower/sv/SimpleV_rationale/
> > >
> > > the concept you describe above is on the roadmap, after learning 
> of Snitch and EXTRA-V. it's the next logical step and i'd love to talk 
> about it with you at a good time.
> > >
> >
> > I have been turning an outline for a "Libre-SOC VPU ISA strawman" over
> > in my head for a while. Are you suggesting that I should write it down
> > after all?
>
> do read the SimpleV_rationale first, i think you'll find i have already
> written it, given what you describe below.
>
> the key thing i would like to know is, would you be interested to
> properly investgate this, funded by an NLnet EUR 50,000 Grant,
> to put Snitch, EXTRA_V, SVP64 and OpenCAPI all together in
> one pot and create Coherent Distributed Compute?

That sounds very interesting, but I will need to do quite a bit of 
reading before I will be able to say with confidence that I /can/ do that.

Are we confident that Simple-V will actually be accepted at OPF?  (That 
is actually what I see as the biggest risk right now -- that Simple-V 
will end up needing too many architectural resources and being rejected 
on those grounds.  I suppose that FlexiVec could be considered a 
contingency strategy for that case; its design specifically minimizes 
required architectural resources.)

The VPU concept was a little different than the distributed Processing 
Elements; the idea was based on the CDC6600 architecture, with a VPU 
that would probably be larger than its host processor, with a 1024 slot 
register file, 48-bit instructions (to accommodate the large register 
file), and 2-issue VLIW (to align 48-bit instructions with the basic 
32-bit wordsize used in Power ISA).  Interestingly, the use of VLIW also 
means that you can use one side of the VLIW pair to provide an extended 
immediate to the other, and possibly a fourth word in each VLIW packet 
to provide loop control or other functions.

A few instructions, initially suggested as a Custom Extension, provide 
the control interface (or trap to the supervisor if problem state 
attempts to use the VPU while a previous task is still using it).

And how would that funding be arranged?  A stipend to work on the 
problem?  Payment upon reaching milestones?  Something else?

Lastly, why does this vaguely feel like we are busily reinventing the 
Transputer?  :-)

> > > read the paper above and the links to other academic works Snitch, 
> ZOLC and EXTRA-V. the novel bit if there is one is the "remote Virtual 
> Memory Management" over OpenCAPI.
> > >
> >
> > I was thinking of just having a VPU interrupt hit the Power core when
> > the VPU encounters a TLB miss. The Power hypervisor (or hardware) then
> > installs the relevant mapping and sends the VPU back on its way.
>
> over OpenCAPI. yes. this is *exactly* what i envisaged and
> describe in https://libre-soc.org/openpower/sv/SimpleV_rationale/
>
> each PE (Processing Element, the more common industry standard
> term for what you call VPU) would still *have* a RADIX MMU,
> still *have* a TLB, but it would lack the ability (in hardware)
> to cope with a TLB miss. yes, precisely and exactly the same
> design concept.

If I understand correctly, the PE model is more like SIMT (on the 
proverbial wheel-o-reincarnation) but with each processor more 
independent (individual instruction fetch and decode) but still with 
each processor handling a slice of the overall calculation, much like 
SIMT.  If this is so, then simplicity in the PEs is the watchword, lest 
the problem quickly blow right back up to full SMP.

Would OpenCAPI provide the means for each PE to have a local TLB and 
*no* other MMU capabilities at all?  TLB miss at PE -> host processor 
interrupt -> hypervisor (or hardware) provides PE TLB entry?  Could the 
TLB be preloaded before starting a computation?  (Software should be 
able to accurately predict the page mappings needed for this kind of 
calculation in advance.)

Actually, if I understand the slides I found at 
<URL:https://www.snia.org/sites/default/files/SDC/2018/presentations/General_Session/Jeff_Stuechelli_OpenCAPI.pdf> 
correctly, OpenCAPI may not be usable for this purpose, as OpenCAPI 
devices deal only in virtual addresses and access memory /through/ the 
host processor.

Further, "OpenCAPI" is not what I would consider open at all; read 
<URL:https://opencapi.org/license/> and barf.  (It explicitly excludes 
producing an implementation.  Here we go again.)  Take a look at their 
membership levels, too.  I see the RISC-V Foundation mess here all over 
again, although at least RISC-V was not offering "study-only" licenses.  
 >:-<

> my feeling is that it is very important for PEs to be able to
> execute the same ISA, btw. in case the PEs are too busy,
> the *main core* can execute its programs as well!

Probably best to keep a *subset* of the main processor's ISA.  To use 
Power ISA for this would likely require (again) some special approval 
from OPF because the PEs will /not/ meet any compliancy subset.

> [...]
> > > plus, Multi-Issue reg-renaming can help reuse regs. certain 
> computationally heavy workloads that doesn't work. we just have to 
> live with it for now.
> > >
> >
> > Extend the main pipeline with a multi-segment register file;
>
> mmm if i understand the concept correctly this doesn't help
> with the latency on access to the (large) physical regfile SRAM.
> it was fine to even have external SRAM as the regfile in Cray
> days, when CPU speed was the same order as SRAM (100mhz)
> but now we are up to 4.8 ghz you simply cannot activate the
> tree cascade of address MUXes in time.
>
> in theory you could have the segment hold say 50% of the address
> MUXes, this is the exact trick used in "Burst Mode" of DDR3/4/5
> ICs, but it is making me nervous (again).

The idea is exactly that:  split the register file into N segments 
(where N is a power of 2), split the register number into "bank select" 
and "register select" lines, and have each segment match the bank select 
lines against its bank number.  On a match, the segment provides the 
value read from that register in that segment, otherwise it passes the 
value that arrived on its input port.  A second match comparing the 
entire writeback bus register number to each forward register number 
allows the forward value to be replaced if the processor has executed a 
write to that register in the interim.

This trades latency for a larger register file.  I suppose you could 
call this a register cache; the cache being the values in-flight in the 
pipeline latches.

> > Yes, aliasing between inputs and outputs is a programming error in
> > FlexiVec. Aliasing between /inputs/ is fine, as long as none of them
> > are written in the loop. (Such aliasing is an error because the
> > visibility of the updates depends on VL.)
>
> indeed. loop invariant vectors using LDs. remember the bandwidth
> penalty that causes?

The question, relevant to large systems but less so to embedded SoCs, is 
whether improved performance can make up for that bandwidth penalty?  On 
an embedded SoC, possibly in a battery-powered device, this tradeoff is 
less valid -- you /need/ the lower power usage to save battery charge.

I think that this may be the fundamental issue where we have been 
talking past each other.  I have been tacitly assuming a desktop-like 
environment where power is fed from the AC line and a nice big heatsink 
with forced air takes care of dissipation.  What has your assumed 
environment been?

> > >> Is there another bit to allow REX-prefixing without changing the 
> meaning
> > >> of the prefixed opcode or would this also fill the entire prefixed
> > >> opcode map in one swoop?
> > >>
> > >
> > > we're requesting 25% of the v3.1 64-bit EXT001 space already.
> > > one more bit would require 50% and i would expect the OPF ISA WG 
> to freak out at that.
> > >
> >
> > Then I suspect they may freak out either way: if I am reading the
> > opcode maps correctly, simply adding REX prefixes without that
> > additional "REX-only" bit would cause /collisions/ between existing
> > 64-bit opcodes and REX-prefixed versions of some 32-bit opcodes.
>
> no it very much doesn't.
>
> > Perhaps I am misreading the Power ISA opcode maps?
>
> take a closer look at the table:
> https://libre-soc.org/openpower/sv/svp64/#index6h1
>
> jacob lifshay came up with this trick. we could have used
> rows 001--- and 101--- such that bits 7&8 equal to "01"
> indicate "SVP64". it's quite neat but needs a little thought
> as to how and why it works, without interfering with the rest
> of EXT001.

Those are indeed two blocks in the PREFIX map currently available for 
new instructions.  Why do you need that many slots for a REX-form on the 
existing 32-bit instructions?

To cut through all the fog here, how do I encode "ADDI R67, R123, 8" as 
a scalar operation, not using Simple-V?

I suppose I should also clarify that I believe the register file 
extension proposal should also be cleanly separated from Simple-V.  Put 
another way, asking for a big chunk of the PREFIX space like this is 
less likely to freak out the ISA WG if we can show quickly (at the 
overview level) how that chunk is actually partitioned for "this" 
subproposal, "that" subproposal, etc.  Break it down, instead of just 
proposing an "SVP64" /blob/ in the opcode space.

Side note:  in my copy of OpenPOWER v3.1B, opcode map table 12 is on 
page 1372, PDF page 1398.

> to understand it, write out in binary *all* of those SVP64
> entries in the EXT001 table above.
>
> then, search down the binary numbers looking for which bits
> do not change. you will find that the two bits which do not
> change correspond to bits 7 and 9 (in MSB0 numbering)
> of the EXT001 32-bit prefix.

I am familiar with Karnaugh maps and those tables are a similar structure.

-- Jacob