[Libre-soc-dev] svp64 review and "FlexiVec" alternative

Mon Aug 1 17:48:02 BST 2022

On Mon, Aug 1, 2022 at 5:35 AM Jacob Bachmeyer <jcb62281 at gmail.com> wrote:
>
> lkcl wrote:

> Since you evidently know more about what VVM is than I do;

strictly speaking, i do and i don't.  i immediately recognised the underlying concept, which absolutely has to be the same

> I will take
> your word for it that FlexiVec is my quasi-independent (by way of
> Simple-V) reinvention of VVM.

yes, the devil is in the details, Mitch chose for very good reasons to have a specific loop-end-marker instruction for example.  finding out *why* he did that instead of using bc-with-CTR, well, the answer to that illustrates why i am reluctant to include VVM/FlexiVec in SV.

to find out *why* Mitch added a loop-end-marker opcode would require several days of full-time research to get a cursory explanation.  then, on understanding why, it has to be confirmed.  that basically means *actually implementing* VVM/FlexiVec in a hardware-cycle-accurate Simulator.

that is literally months of work, as the only hardware-cycle-accurate Simulator at the moment is gem5-experimental and it is not properly maintained.

> > On Sun, Jul 31, 2022 at 2:57 AM Jacob Bachmeyer <jcb62281 at gmail.com> wrote:

> > it is not possible to perform anything but the most rudimentary horizontal sums (a single scalar being an alias)
> >  
>
> Strictly speaking, FlexiVec cannot perform horizontal sums *at* *all*.
> A horizontal sum requires ordinary scalar computation. 

which would not be anywhere near good enough for a VPU/GPU.
we designed a Parallel Prefix Schedule for SV, which can run
on top of any instruction.

> A multiple-issue
> OoO implementation could use the same hardware pathways to perform a
> horizontal sum, but that is ordinary OoO loop optimization, not FlexiVec.

yes.  it also means programs are larger. L1 cache size has to increase to compensate.  increases in L1 cache size have an O(N^2) effect on power consumption (this is a well-known studied phenomenon, more below).

> A severe limitation shared by every other current vector computing model
> I have seen.  (I am still considering Simple-V to be a "future" model as
> there is no hardware yet.)

indeed.  please understand that SV (all Cray style ISAs) is by no
means perfect. the mistake made by SIMD ISAs is to impose a
particular lane structure and/or internal microarchitecture onto
programmers.

the beauty of VVM/FV/Cray Scalable Vectors is that they are more
like a "Software API". the internal microarchitecture is NOT known
by the assembler/compiler developer, it is a true black box.

what that in turn means is that the *hardware architect* has one
hell of a lot more work to do!

the *hardware architect* has to provide the means to do 
subdivision of a vector into batches of elements, to be thrown
at, ironically, the exact same SIMD ALU backends that would
normally be *DIRECTLY* exposed to the programmer (in any
SIMD ISA).

and that includes Horizontal Summing and Parallel Prefix Sums.

> > exactly.  which puts pressure on LDST. that is a severe limitation (one that SV does not have).
> >  
>
> Admitted; this limitation is necessary to ensure that a Hwacha-like SIMT
> hardware implementation is possible.  (That is the direction that I
> believe maximum performance systems will ultimately need to go.  There
> are few hard limits on how large an SIMT array can get, only latency
> tradeoffs.)

the internal microarchitecture i have in my head for a high
performance not-completely-insane-wiring design is to do
standard crossbars and multi-porting regfiles for regs r0-r31
but above that to do MODULO 4 "Lanes".

Lane-crossing to be covered by a "cyclic buffer" (rather than a massive crossbar).

i.e. if you want to add r32 to r36 it is 1 clock cycle (excluding the regfile read time itself)

but if you want to add r32 to r33 it is 2 clocks (1 extra through the cyclic buffer).

it is a compromise, but the important thing is, it is *our choice*,
absolutely bugger-all to do with the ISA itself.  anyone else could
*choose* to do better (or worse).

> > you missed the point completely that some intermediary results may remain in registers, saving power consumption by not hitting L1 Cache or TLB Virtual Memory lookups/misses.
> >  
>
> I believe that we are talking past each other here:  FlexiVec can have
> intermediate values that are never written to memory, as in the complex
> multiply-add sample below.

i focussed specifically on the LDST because that's where the
brown stuff hits the spinning bladed rotational device.
(blue, if you've seen the mythbusters episode
https://m.youtube.com/watch?v=9zoQb9n0akM)

> > each of those algorithms has, for decades, solutions that perform small subkernels (using available registers).
> >  
>
> Which introduces the same memory bandwidth pressure that plagues
> FlexiVec, does it not?  :-)

interestingly, no! at least, not the same percentage.  the ability
to have and access pre-computed loop-invariant *vectors*
of constants in actual (real) ISA-numbered registers reduces
the memory bandwidth compared to FlexiVec/VVM.

yes it is true that all of these algorithms are reliant on high
perfornance LDST which is precisely why it is crucial not to
overload it.
> > these are O(N^2) costs and they are VERY high.
> >
> > (unacceptably high)
> >  
>
> I disagree on the exponent there:  

unfortunately it isn't, for multiple reasons.

1) Multi-Issue LDST address conflict detection requires a triangular
    comparison every address against every other address.  this
    is by nature O(N^2-N) - N times (N-1) on every clock.

2) SMP Cache Coherency (snooping) is O(N^2)

3) L1 data cache misses require increased L1 cache sizes to
    compensate for, and studies have shown O(N^2) power
    consumption increases with L1 cache size

this is just to get data in and out: any other benefits of an ISA
are irrelevant if the power consumption is through the roof.

this experience which Mitch has in bucketloads is what has driven
him to design VVM the way he designed it.  i don't have all the
answers but i am rapidly picking up enough to know that you
*REALLY* need to have a full microarchitectural working knowledge
of every type of Computing system under the frickin sun in order
to avoid making carastrophic *ISA* mistakes.

sigh.

> unit is calculating.  Further, prefetch buffers for FlexiVec can be
> independent of the regular caches, aside from assuring coherency.

ok, 100% cache bypassing is where Snitch went.
see https://libre-soc.org/openpower/sv/SimpleV_rationale/
it contains the link to the eth zurich research.

100% cache bypassing does save massively on power. but yes,
you need to make that explicitly known to the programmer what
the hell's going on.

> How do existing products avoid that power consumption?  (I expect that
> they do not, therefore FlexiVec would indeed be competitive.)

ok, so pre-SV: answer, they don't. which is why VVM/FlexiVec
are attractive and compelling.... *if* SV had not been created.

> >>  Worse, that hard limit is
> >> determined by the ISA because it is based on the architectural register
> >> file size.
> >>    
> >
> > turns out that having an ISA-defined fixed amount means that binaries are stable.  RVV and ARM with SVE2 are running smack into this one, and i can guarantee it's going to get ugly (esp. for SVE2).
> >  
>
> Interesting way to view that as a tradeoff.  Precise programmer-level
> optimization opportunities versus wide hardware scalability with fixed
> program code...

there is actually a key difference between FlexiVec and VVM.

* VVM allows the microarchitect to choose MAXVL.  in-flight
   RSes can and are used as the auto-vectorised registers.

* FlexiVec specifies a MAXVL and forces the architect to create a   
   regfile of sufficient size:

     (num_regs times MAXVL).

the problem there is that if you allow MAXVL of say 512 and the
number of regs is 32, 8 bytes each, that's a MASSIVE 128kbyte regfile.  even if you let MAXVL=8 that's still a 2kbyte regfile which is scarily large, given that it has to be multi-ported.

the problem with VVM is that the autovectorisation being
arbitrarily chosen by the architect: above a certain loop
size you *have no choice* but to fall back to scalar operation.

which seems perfectly fine and logical until you get different
hardware running the same binary.  one is great, the other is
shit performance.  investigation shows that the algorithm was
designed ASSUMING that the inflight RSes would allow loops
of around (say) 60 instructions but the lower-spec hardware
only has enough RSes to handle 50.

at which point you have run into exactly the same type of problem
as SVE2 is going to suffer from: dependence on the hardware
exposed to the programmer forcing them to make multiple
assembly-level implementations.

back to FlexiVec: having to have such a large and fixed size of autovec Regfile, this makes me very nervous.

> > see the comparison table on p2 (reload latest pdf). or footnote 21
> > https://libre-soc.org/openpower/sv/comparison_table/
> >  
>
> ...and ARM managed to bungle both of those with SVE2 if that is correct.

uh-huhn.  all they had to do was add a SETVL instruction which,
in its implementation, creates an auto-predicate-mask

    (1<<VL)-1

thus if SETVL sets VL to 5, the predicate mask would be:

    1<<5 = 32, -1 =31 -> 0b0000000011111

SETVL could even be a 32-bit instruction which if you look at
the post on github ("quote" in the comparison table) it is explained
that was a high priority for ARM when designing SVE2 and it has
made one hell of a mess.  half the instructions don't take a
predicate mask at all making them a PackedSIMD nightmare just
when you need predication the most.

> FlexiVec and RVV align on the issue of VL-independence, however -- I see
> that as an important scalability feature.

unfortunately for RVV it means that 100% of portable algorithms
*have* to contain a loop. i note that in the comparison table.
at least with VVM/FlexiVec the auto-vectorisation is a hardware
abstraction.

> hardware resources permit."  The use of an ISA-provided loop counter in
> the Power ISA probably helps here, since it allows hardware to predict
> the loop exactly.

yes.

> This scalability is an important feature for FlexiVec:  the programmer
> will get the optimal performance from each hardware implementation
> (optimal for that hardware) with the /exact/ /same/ /loop/, including
> the null case where the loop simply runs on the scalar unit.

yes... except i just realised what the problem is that was nagging
at me on the fixed regfile allocation.

let us assume:

* a loop of 100,000 instructions
* a MAXVL of 512 [autovec SRAM of 128k]
* hardware of literally 512-wide SIMD **OR** as in the
   case of Broadcom VideoCore IV the ability to make it
   *look* like you have 512-wide SIMD elements by doing
   say 4-per-cycle in a micro-coded hardware for-loop over
   512/4 clock cycles.
* an interrupt occurs at instruction 50,000

the problem with the design of FlexiVec is that once you
are committed to any given loop you ABSOLUTELY MUST
complete it.

... or....

you must provide a means and method of context-switching
that entire architecturally-hidden auto-vec regfile.

all 128k of it.

this is ultimately why Mitch designed VVM the way he did,
with "best" implementations being on GBOoO, *and* why there
is an explicit end-loop instruction, because at least with OoO
you can Shadow-cancel all in-flight instructions in order to
service an interrupt, and having the explicit end-loop op allows
HW to identify (in full) the entire extent of the ops that are *going*
to be in-flight, and if that is not possible fall back to Scalar.

these kinds of details which make or break a system... the amount
of time it takes to analyse them... it explains why there are in
excess of 50 messages a day on comp.arch

> Also, since when has power consumption /ever/ been a concern for IBM?  :-)

since forever.  if you have 160,000 POWER9 cores in the top500.org
supercomputer it is a BIG damn deal to be able to say, as i
think they do, "we piss over x86 by 10% performance/watt"

given that total power consumption is somewhere north i think
100 Megawatts for those 160,000 cores a 10% saving is a f***
of a lot of money. 10 MW @ $0.20/kWh = i think USD 50,000 a DAY saving on electricity.  i.e. they spend half a MILLION dollars a day on powering those supercomputers not even including aircon cooling or the HDDs and network switches.

Matrix Multiply Assist, part of v3.1, is a whopping 2.5x power
reduction in POWER10.

> Is the problem here that you are trying to minimize memory accesses
> categorically?

yes.  given the O(N^2) power consumption and given that Jeff Bush
showed if you get this wrong it can kill your product as far as
commercial conpetitivesness is concerned it is of show-stopping
importance.

> > yes.  see
> > https://libre-soc.org/openpower/sv/SimpleV_rationale/
> >
> > the concept you describe above is on the roadmap, after learning of Snitch and EXTRA-V. it's the next logical step and i'd love to talk about it with you at a good time.
> >  
>
> I have been turning an outline for a "Libre-SOC VPU ISA strawman" over
> in my head for a while.  Are you suggesting that I should write it down
> after all?

do read the SimpleV_rationale first, i think you'll find i have already
written it, given what you describe below.

the key thing i would like to know is, would you be interested to
properly investgate this, funded by an NLnet EUR 50,000 Grant,
to put Snitch, EXTRA_V, SVP64 and OpenCAPI all together in
one pot and create Coherent Distributed Compute?

> > read the paper above and the links to other academic works Snitch, ZOLC and EXTRA-V. the novel bit if there is one is the "remote Virtual Memory Management" over OpenCAPI.
> >  
>
> I was thinking of just having a VPU interrupt hit the Power core when
> the VPU encounters a TLB miss.  The Power hypervisor (or hardware) then
> installs the relevant mapping and sends the VPU back on its way.  

over OpenCAPI.  yes.  this is *exactly* what i envisaged and
describe in https://libre-soc.org/openpower/sv/SimpleV_rationale/

each PE (Processing Element, the more common industry standard
term for what you call VPU) would still *have* a RADIX MMU,
still *have* a TLB, but it would lack the ability (in hardware)
to cope with a TLB miss.  yes, precisely and exactly the same
design concept.

my feeling is that it is very important for PEs to be able to
execute the same ISA, btw.  in case the PEs are too busy,
the *main core* can execute its programs as well!

also needed is the ability to remote-manage context-switching
(dump and restore all regs) and by a not-coincidence-at-all
there is the DMI Interface of Microwatt which, if put over
OpenCAPI, allows precisely and exactly that.

> (The
> VPU can run in a different address space from its host processor's
> problem state, allowing the Power core's OS to task-switch a different
> program in when a thread blocks waiting for the VPU.)

ta-daaa, and OpenCAPI even allows neighbouring PEs to not
only read/write each other's VM but the main core as well.

> > plus, Multi-Issue reg-renaming can help reuse regs.  certain computationally heavy workloads that doesn't work.  we just have to live with it for now.
> >  
>
> Extend the main pipeline with a multi-segment register file;

mmm if i understand the concept correctly this doesn't help
with the latency on access to the (large) physical regfile SRAM.
it was fine to even have external SRAM as the regfile in Cray
days, when CPU speed was the same order as SRAM (100mhz)
but now we are up to 4.8 ghz you simply cannot activate the
tree cascade of address MUXes in time.

in theory you could have the segment hold say 50% of the address
MUXes, this is the exact trick used in "Burst Mode" of DDR3/4/5
ICs, but it is making me nervous (again).

> Yes, aliasing between inputs and outputs is a programming error in
> FlexiVec.  Aliasing between /inputs/ is fine, as long as none of them
> are written in the loop.  (Such aliasing is an error because the
> visibility of the updates depends on VL.)

indeed.  loop invariant vectors using LDs. remember the bandwidth
penalty that causes?

> >> Is there another bit to allow REX-prefixing without changing the meaning
> >> of the prefixed opcode or would this also fill the entire prefixed
> >> opcode map in one swoop?
> >>    
> >
> > we're requesting 25% of the v3.1 64-bit EXT001 space already.
> > one more bit would require 50% and i would expect the OPF ISA WG to freak out at that.
> >  
>
> Then I suspect they may freak out either way:  if I am reading the
> opcode maps correctly, simply adding REX prefixes without that
> additional "REX-only" bit would cause /collisions/ between existing
> 64-bit opcodes and REX-prefixed versions of some 32-bit opcodes.

no it very much doesn't.

> Perhaps I am misreading the Power ISA opcode maps?

take a closer look at the table:
https://libre-soc.org/openpower/sv/svp64/#index6h1

jacob lifshay came up with this trick.  we could have used
rows 001--- and 101--- such that bits 7&8 equal to "01"
indicate "SVP64".  it's quite neat but needs a little thought
as to how and why it works, without interfering with the rest
of EXT001.

to understand it, write out in binary *all* of those SVP64
entries in the EXT001 table above.

then, search down the binary numbers looking for which bits
do not change.  you will find that the two bits which do not
change correspond to bits 7 and 9 (in MSB0 numbering)
of the EXT001 32-bit prefix.

neat trick :)

l.