[Libre-soc-dev] svp64 review and "FlexiVec" alternative

Tue Aug 2 17:24:23 BST 2022

On Tue, Aug 2, 2022 at 5:09 AM Jacob Bachmeyer <jcb62281 at gmail.com> wrote:

> linear.  The actual number of CAM cells required depends on
> associativity, so that gets complicated.  Yes, power usage is definitely
> more than linear in cache size, but I think quadratic is still an
> exaggeration.

i know it is surprising, but it is the cache misses and associated
hammering of L2 and data pathways between L1 and L2
(reloading of cache lines) that contributes to N^2.

this effect *is* well-documented, well-researched, replicable,
and often quoted.  it just doesn't sound logical.

> > and that includes Horizontal Summing and Parallel Prefix Sums.
>
> Unfortunately, FlexiVec's implicit vectorization model makes explicit
> horizontal operations impossible.

i have a feeling that Mitch worked out how to do it.  FMAC
having in effect a Scalar accumulator (src==dest) whilst
other operands get tagged as vectors, HW can detect that and
go "ah HA! what you *actually* want here is a horizontal
sum, let me just microcode that for you".

>  Combining FlexiVec and VSX might

nooooo.  *really*, no.  ok anyone else can do that but i have
put my foot down and said "no" on PackedSIMD.  the lesson
from the sigarch article is very much understated

https://www.sigarch.org/simd-instructions-considered-harmful/

in the case of Power ISA although it was amazingly powerful
for its time and for its purpose (Banking, supercomputing) VSX
is an astounding *750* additional instructions which even if you
put say 2-3 days per instruction, unit test, and Compliance Suite,
comes out to an eye-popping six YEARs development effort.

no, no, and hell no :)

but, see below...

> around this limitation in Power ISA, if VSX can do reducing sums.
> However, multiple cycles of this, complete with LOAD/STORE would be
> needed to eventually reduce to a final 32bit result.)

another way is to use VVM/FlexVec to do a parallel reduction,
an explicit outer loop, but honestly it is easier to detect the
use of the FMA (or madd) and micro-code it.

last resort you can always fall back to Scalar.

> There is a possibility of combining FlexiVec (for "string" operations)
> and Simple-V (for reductions),

if you recall the LD/ST Fault-First capability of RVV, which was
inspired by ARM SVE FFirst, i added *Data-Dependent* Fail-First
to SVP64 as well as LDST FFirst, to help with strings and other
sequential data-dependence.

when an Rc=1 test fails (using the same BO encoding from
branch: one bit inverts the other 2 bits say which of EQ LT GT or SO
to test from the CR Field) then the Vector Length VL is truncated
at that point.

an extra bit in the RM Prefix (VLi) specifies whether the truncation
*includes* the element that failed or *excludes* it in the setting VL.
this means for example that you can do

    sv.cmpi/ff=eq/ew=8/vli *r8, 0 # ew=8 to do 8-bit cmp to zero

and the truncation of VL will include the null-termination zero of
a string.

no pissing about.

> if Simple-V is also generalizable to FP
> and VSX.

yes to Scalar FP.  SV can be applied to anything: it actually doesn't
care about the operation (per se).  you can do something as dumb
as apply SV to the PackedSIMD operations of VSX if you feel so
inclined [i don't].

what *does* make sense is applying SV to the 128-bit *scalar*
parts of VSX, such as the Quad-Precision FP128 and maaaybe
even the 128-bit Logical ops (xxland etc).

(btw Jacob - Lifshay - this would be how we best get 256 registers
 because VSX already has 64 regs.  adding 2 bits to that expands
 to 256)

but the analysis and workload is so high on VSX, i don't want to
go *anywhere near it* until we have everything else in place.

> > it is a compromise, but the important thing is, it is *our choice*,
> > absolutely bugger-all to do with the ISA itself. anyone else could
> > *choose* to do better (or worse).
>
> Now you have variable-latency vector lanes.  :-)

yyep.  not a problem for an OoO microarchitecture, in the least.
any in-order architect will be freaking out and crying home to
momma, but an OoO one no problem.

very interestingly for an OoO system the MOD4 "laning" creates
massive holes in the Hazard Dependency Matrices, which given
that they are O(N^2) where N is 128 if you allowed full crossbars.

that would result in several MILLION gates and be so large that
it would be unlikely that the speed of light would get from one
side to the other in order to keep the DM clock-coherent.

(translation: massive DMs limits the max clock freq)

the "holes" (a sparse Dep-Matrix) would cut the sizes by 75%.

> So you do have the same problem, but perhaps somewhat less severe.

basically yes.

>  High
> performance memory access is one of the reasons that the predictability
> of a FlexiVec loop is important.

Mitch pointed out that for the Samsung GPU he was involved in
only a few years ago the Texture Interpolation required *TEN* LD
ops per clock cycle and i think 2 STs.

per clock cycle - sustained!

Texture interpolation is a bitch.  you need the contents of 4 pixels
plus the X and Y offsets between 0.0 and 1.0 in order to do a 2D
LERP and then write out the result.  this is all as one instruction!

> Not done in FlexiVec -- avoiding those conflicts is the programmer's
> responsibility.

to date i have been extraordinarily careful to emphasise that
SV does not break programmer expectations in this type of way.

> The idea is that the vector unit probably has either its own L1 cache or
> is directly connected to L2.  (I believe the AMD K8 did that with SSE.)
> If an SIMT vector chain has its own L1, that cache is likely also
> divided into per-lane sub-caches.

this starts to get completely out of the realm of normal ubiquitous
compute, needing specialist programmer knowledge and hardware
level intricate internal knowledge.

it is, i have to say reluctantly, the antithesis of the entire design
paradigm of where SV has come from.  the reason for that is so
as not to freak out IBM, who, if one of SV's founding principles
was that you had to throw away memory coherency cross-core
and have the programmer deal with keeping consistency, they
will be unlikely to speak with us ever again.

there are Architects within IBM who have been responsible for
certain specialist areas for *over 20 years* because that is just
what it takes to deal with some types of complexity.

if we lose their respect by designing something that breaks their
expectations, potentially "damaging" the Power ISA, then we're
done.

> So then the best route would be to abandon both Simple-V and FlexiVec
> and implement VVM, on the basis that Mitch likely knows (many)
> something(s) that we do not?  :-)

sigh, i wish :)

no but seriously, we're committed to SV and Power ISA, now.  2
years on SVP64 (so far), we have to see it through.

but the limitations on VVM, including the hardware-dependent
point at which it falls back to Scalar, combined with the LDST
dependence and increased power, these are show-stoppers.

> Not so at all:  hardware commits to a MAXVL at each FVSETUP and MAXVL
> can vary with the vector configuration.

whatever amount that is, once the loop is started you are
committed to finishing it with no possibility of contextswitch
(unless saving the entire state which, by design, has to include
 the behind-the-scenes hidden vector SRAM contents).

> Larger vector processors are expected to use the SIMT topology, where

[checking: SIMT = "synchronously broadcast an instruction to
Cores that have their own regfiles caches LDST access just no PC"
is that accurate in this case, yes it is. ok i'm good. i say "good",
i am not a fan of SIMT. at all. although the opportunity to
entirely hide its very existence behind autovectorisation is
pretty compelling i have to admit]

> the entire vector register set is not stored in a single physical
> register file, but is instead distributed across the vector lanes.

this just means that every SIMT core [standard-core-in-every-respect
other-than-receiving-broadcast-instructions] has to perform a save of the
hidden SRAM used for behind-scenes vector autoregs, on a
contextswitch.

it moves the problem, it doesn't solve the problem.

> Predication in FlexiVec for Power ISA would require the addition of
> general predication to the scalar ISA.

and that's not happening. as in: it will be a cold day in hell when
the Power ISA duplicates its scalar instructions to add predicated
variants of all 32 bit opcodes.

it should have been designed in right from the start and there is
no longer any 32 bit opcode space to duplicate 200 instructions.

> Power ISA currently has
> branches, but not predication.  Using a short forward branch to imply
> predication /might/ be possible.

a better bet would be some form of "tagging". "if you use r5
then please consider the instruction to be predicated, please
use Condition Code {n} as the predicate".  or, just,
"the next instruction is predicated" which is pretty much exactly
what a short branch would be, yes.

> > unfortunately for RVV it means that 100% of portable algorithms
> > *have* to contain a loop. i note that in the comparison table.
> > at least with VVM/FlexiVec the auto-vectorisation is a hardware
> > abstraction.
>
> No, FlexiVec absolutely requires a loop unless the vector length is
> exactly one element.

you mean CTR=1 i assume?

regardless, the very fact that loops are required (for hardware
parallelism either automatically for VVM/FlexiVec or
explicitly in the case of RVV) i consider to be a limitation.

SV although you have to call SETVL you can use sv.ld / sv.st with
oredication to do a stack save/restore, effectively Compressed
LDST multi

[we did give consideration to setting VL in the prefix but there are
just not enough bits]

unlike in RVV and FlexiVec/VVM you *do not* need a loop
construct to do that.  it is one instruction [okok two including setvl]

> > the problem with the design of FlexiVec is that once you
> > are committed to any given loop you ABSOLUTELY MUST
> > complete it.
>
> Nope!  You do not even have to complete any one instruction either,
> since you can use the slot in the scalar PRF to track the current vector
> progress.  FlexiVec is not available to privileged code, so the only
> requirement is that the interrupt handler restore any registers it
> changes, which any existing interrupt handler already meets.  (To
> privileged code, the scalar registers will appear to contain the
> internal vector offsets when problem state is running a vector
> computation; as long as that is restored properly the vector computation
> will resume correctly.)

unfortunately i am running out of time to verify whether that's
correct or not. i know it will take weeks and potentially months to do
a hardware-cycle-accurate simulator and unfortunately i have to draw a
line under FlexiVec and eliminate it from SV.

when we had time was when SV was in development. ironically if the
discussion had taken place 2 years ago when we started SV for Power i
would not have known what you were talking about :) Vertical-First was
only added about a year ago and that was only when i finally recognised
VVM as being also Vertical-First. what i am going to do however is create
a comp.arch thread referring to this discussion. i think people there
will be interested to share insights esp. on FlexiVec.

> > you must provide a means and method of context-switching
> > that entire architecturally-hidden auto-vec regfile.
> >
> > all 128k of it.
>
> FlexiVec has that too:

i was referring specifically to FlexiVec.

> privileged vector context save/restore
> instructions that allow the system software to preserve the entire
> FlexiVec state across task-switching.

the need to save the hidden (otherwise completely inaccessible) SRAM
containing the auto-created vectors is a bit of a problem. basically
by this point what i am saying is that the analysis takes weeks if not
months and we are under time and resource pressure. to do proper due
diligence for the quality needed for an ISA asserious as Power, requires
specific funding and time allocation that we don't have right now. i have
been going through this with you in order to do due diligence for *SV*,
but unlike when we were talking on RV-isa-dev there are now deadlines,
a list of tasks, and future roadmaps a mile long.

> > Matrix Multiply Assist, part of v3.1, is a whopping 2.5x power
> > reduction in POWER10.
>
> That is impressive.

it's nowhere near Snitch's 6x reduction in
power consumption though. which is astounding.

> > the key thing i would like to know is, would you be interested to
> > properly investgate this, funded by an NLnet EUR 50,000 Grant,
> > to put Snitch, EXTRA_V, SVP64 and OpenCAPI all together in
> > one pot and create Coherent Distributed Compute?
>
> That sounds very interesting, but I will need to do quite a bit of
> reading before I will be able to say with confidence that I /can/ do that.
>
> Are we confident that Simple-V will actually be accepted at OPF?

it'll be proposed.  if it is rejected then RED Semiconductor will to the
strict letter activate the provisions allowed to be followed, as set
out on page xii of the Power ISA Specification, to use EXT022.
*we* will have acted "in-good-faith", in other words, and it becomes
neither our problem nor allows any OPF Member to complain
[including IBM] if we have, in fact, followed precisely and exactly
the proposal procedures set out.

one company _did_ in fact attempt to blithely drive a coach and
horses through the OpenPOWER ISA EULA and the ISA WG,
even telling me that they actually intended to go ahead and design
3D and VPU instructions, create a mass-volume processor, then
expect IBM and other OPF Members to "accept a fait-accomplit".
which if they'd bothered to read the EULA they'd have found that
approach to be a direct violation.

they're not around any more.

>  (That
> is actually what I see as the biggest risk right now -- that Simple-V
> will end up needing too many architectural resources and being rejected
> on those grounds.

it was designed such that it could be explained to IBM that there exists
an *option* - not repeat not repeat not a REQUIREMENT -to leverage and
exploit *their* preexisting IBM POWER proprietary architecture. which is
when you try to compare it against any FOSSHW design definitely already
well into "too many architectural resources".

this is already causing adoption problems because well firstly 950
mandatory instructions which seems perfectly reasonable to IBM who has
had 20 years to incrementally build those up, is of course flat-out
"too many resources" for everyone else.

i do not want to say "we will give IBM a taste of its own medicine" with
SV because that would just piss them off.  the best i can come up with is
"leveraging existing micro-architecture in a nondisruptive fashion", and
rely on the fact that IBM has always, traditionally, properly done its
research.

>  I suppose that FlexiVec could be considered a
> contingency strategy for that case; its design specifically minimizes
> required architectural resources.)

i'm estimating a delay of approx 8 to 12 months, putting RED
Semiconductor's entire future "on hold" in order to due proper
due diligence on VVM/FlexiVec. including Simulator, unit tests,
documentation. compared to SV for which that work (2x over) has already
been done, giving me confidence and evidence that SV is sound, i have
to rationally conclude "no" on VVM/FlexiVec.

it is what it is.

> The VPU concept was a little different than the distributed Processing
> Elements; the idea was based on the CDC6600 architecture, with a VPU
> that would probably be larger than its host processor, with a 1024 slot
> register file, 48-bit instructions (to accommodate the large register
> file), and 2-issue VLIW (to align 48-bit instructions with the basic
> 32-bit wordsize used in Power ISA).  Interestingly, the use of VLIW also
> means that you can use one side of the VLIW pair to provide an extended
> immediate to the other, and possibly a fourth word in each VLIW packet
> to provide loop control or other functions.

VLIW makes me nervous. i saw how that went with SGI and with
TI. i had to watch a colleague do the entirety of CEDAR Audio's
DSP realtime processing in pure assembler because the compilers
were that bad.

with IBM having created SMP coherency and atomics clean over
even a hundred tnousand uniform cores there is precedent for
keeping to a uniform ISA and not going VLIW.

my feeling is that it is critically important that the main CPU(s)
be capable in full of executing the satellite core programs.

thus if VLIW is to be executed on the satellite cores then VLIW
must become part of mainstream Power ISA. at which point
IBM is going to go on "high alert" (probably a metaphorical
DEFCON 3 i would guess).

but let us take a step back.

1. Power ISA v3.1 now has 64-bit prefixing. 34 bit pc-relative branches are
    possible. 32-bit GPR immediates are possible.

2. the primary algorithms of Video are oddly DCT and FFT.
    this was why i spent over 8 weeks on the DCT and FFT REMAP
    Schedules.  VLIW is *not* required despite both TI's TMS320
    and Qualcomm's Hexagon both being VLIW DSPs specialising
    in FFT (inner loop only).  SV's REMAP covers the *entire*
    triple loop
    https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_dct.py;hb=HEAD

3. an entire NLnet Grant has been dedicated to Audio/Video
    https://bugs.libre-soc.org/show_bug.cgi?id=137

    in this Grant we have performed significant analysis of algorithms
    and instructions, resulting in extracting a number of A/V opcodes
    and confirming that SVP64 will cover the gaps.  Horizontal-Sum
    Schedules, Parallel-Prefix Schedules and so on. a starting point:
    https://libre-soc.org/openpower/sv/av_opcodes/

4. a demo of MP3 decode showed a *75%* reduction in assembler.
    450 instructions crashed down to only 100.  at no time was VLIW
    considered.
    https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=media/audio/mp3/mp3_0_apply_window_float_basicsv.s;hb=HEAD

5. in the ZOLC paper the motion-time-estimation algorithm was SPECIFICALLY
   chosen as one of the toughest examples traditionally causing massive
   problems for DSPs (and VLIW).  despite *SIX* nested for-loops he
   achieved an astounding 45% reduction in the number of instructions
   executed.

   that's on an almost entirely-coherent-deterministic schedule.

   absolutely astonishing.

   .... oh... and it was a simple in-order core.

in Video you picked the *one* area where we've already done a SHED
load of work :)

VLIW did not in any way shape or form even remotely enter my head
as a possible candidate for any of that, and it hasn't for any of the
other research either.

> And how would that funding be arranged?  A stipend to work on the
> problem?  Payment upon reaching milestones?  Something else?

NLnet does payment-on-milestones and we're good at subdividing
those so it's not "Do 4 Months Work Only Then Do You Get 15k".
also there's a bit in the RED Semi bucket now - you'd have to submit
subcontractor invoices.

> Lastly, why does this vaguely feel like we are busily reinventing the
> Transputer?  :-)

m-hm :)

> > each PE (Processing Element, the more common industry standard
> > term for what you call VPU) would still *have* a RADIX MMU,
> > still *have* a TLB, but it would lack the ability (in hardware)
> > to cope with a TLB miss. yes, precisely and exactly the same
> > design concept.
>
> If I understand correctly, the PE model is more like SIMT (on the
> proverbial wheel-o-reincarnation)

no.  ok, it's not actually specified.  PE is a general term.  Samsung's
"PEs" for near-memory compute has a measly *nine* instructions (!)

> but with each processor more
> independent (individual instruction fetch and decode)

yes.  the definition of SIMT is "one shared broadcast fetch and decode
per otherwise-absolutely-normal core".  only AMD, Intel and NVIDIA have
the kind of money to throw at the insanely-complex compilers that result.

we cannot in any way be that dumb as to try to replicate billion-dollar
corporations' results using only EUR 50,000 budgets at a time.  a little
more intelligence, a lot less monetary brute-force is required.

> but still with
> each processor handling a slice of the overall calculation, much like
> SIMT.

more like plain SMP but with workloads that are sufficiently
separate so as not to break the paradigm.

>  If this is so, then simplicity in the PEs is the watchword, lest
> the problem quickly blow right back up to full SMP.

have a look at the ZOLC and EXTRA-V papers.  the Coherent
Deterministic Schedules allow for significant avoidance of clashes.

plus, if you've got a shed-load of parallel processors with their
own Memory connected directly to them, yet you're still trying
to get them to execute sequential algorithms, you're Doing Something
Wrong :)

> Would OpenCAPI provide the means for each PE to have a local TLB and
> *no* other MMU capabilities at all?  TLB miss at PE -> host processor
> interrupt -> hypervisor (or hardware) provides PE TLB entry?

i was planning to work out how to extend OpenCAPI to do exactly that.
given the expectation that the binaries being executed would only be
around the 1-8k mark all-in (appx size of 3D Shader binaries) i would
not expect thrashing.

>  Could the
> TLB be preloaded before starting a computation?

i don't see why not.  it's sensible.

> Actually, if I understand the slides I found at
> <URL:https://www.snia.org/sites/default/files/SDC/2018/presentations/General_Session/Jeff_Stuechelli_OpenCAPI.pdf>
> correctly, OpenCAPI may not be usable for this purpose, as OpenCAPI
> devices deal only in virtual addresses and access memory /through/ the
> host processor.

no it's entirely independent i.e. 100% peer-level distributed coherent
compute protocol.  similar to GenZ i believe. and the OpenPITON
protocol.

> Further, "OpenCAPI" is not what I would consider open at all; read
> <URL:https://opencapi.org/license/> and barf.  (It explicitly excludes
> producing an implementation.  Here we go again.)  Take a look at their
> membership levels, too.  I see the RISC-V Foundation mess here all over
> again, although at least RISC-V was not offering "study-only" licenses.
>  >:-<

raised it with OPF a couple of times.  done so again.

> > my feeling is that it is very important for PEs to be able to
> > execute the same ISA, btw. in case the PEs are too busy,
> > the *main core* can execute its programs as well!
>
> Probably best to keep a *subset* of the main processor's ISA.  To use
> Power ISA for this would likely require (again) some special approval
> from OPF because the PEs will /not/ meet any compliancy subset.

you've missed that it is possible to go for a lower Compliancy Level
then "step up" by adding optional instructions.  as long as you meet
the lower level nobody cares what else you added.  but yes, it is
something to keep an eye on.

> This trades latency for a larger register file.  I suppose you could
> call this a register cache; the cache being the values in-flight in the
> pipeline latches.

my feeling is, it would be better just to have a L1 register cache.
another name for that is "Physical Register File" which allows
opportunities for built-in WaW Hazard avoidance (reg-renaming)

> I think that this may be the fundamental issue where we have been
> talking past each other.  I have been tacitly assuming a desktop-like
> environment where power is fed from the AC line and a nice big heatsink
> with forced air takes care of dissipation.  What has your assumed
> environment been?

everything.  near-memory PEs operating at only 150mhz, 3.5 watt
quad-core SoCs, 8-core 4.8 ghz i9 killers, 64-core Supercomputer
chiplets.

everything.

> Those are indeed two blocks in the PREFIX map currently available for
> new instructions.  Why do you need that many slots for a REX-form on the
> existing 32-bit instructions?

24 bit prefix.  6 for Major Op EXT001 + 2 for karnaugh-map-ID + 24 for SVRM
== 32.

> To cut through all the fog here, how do I encode "ADDI R67, R123, 8" as
> a scalar operation, not using Simple-V?

you don't.  Power ISA 3.0 GPRs are restricted to r0-r31.

> I suppose I should also clarify that I believe the register file
> extension proposal should also be cleanly separated from Simple-V.

already done. https://libre-soc.org/openpower/sv/compliancy_levels/

> Put
> another way, asking for a big chunk of the PREFIX space like this is
> less likely to freak out the ISA WG if we can show quickly (at the
> overview level) how that chunk is actually partitioned for "this"
> subproposal, "that" subproposal, etc.  Break it down, instead of just
> proposing an "SVP64" /blob/ in the opcode space.

hence the SV Compliancy Levels.  thought it through already :)
a hell of a lot got done in the past 3 years.

> > then, search down the binary numbers looking for which bits
> > do not change. you will find that the two bits which do not
> > change correspond to bits 7 and 9 (in MSB0 numbering)
> > of the EXT001 32-bit prefix.
>
> I am familiar with Karnaugh maps and those tables are a similar structure.

basically yes.

l.