[Libre-soc-dev] svp64 review and "FlexiVec" alternative

Sat Aug 6 20:04:47 BST 2022

On Wed, Aug 3, 2022 at 5:50 AM Jacob Bachmeyer <jcb62281 at gmail.com> wrote:
>
> lkcl wrote:

> I had not realized that an accumulating ADD allows to indicate a
> horizontal sum.  Hardware can do that in varying ways; the complexities
> are manageable even for an SIMT implementation.

you simply need the appearance of Program Order. deterministic behaviour equivalent to sequential execution.  ORing into an accumulator is blindingly obviously any-order paralleliseable  int add likewise.  noncommutable ops (and FP) not so much.

> My use of the term "string" here was a bit unclear.  FlexiVec does not
> deal in C strings; all vectors have a counted length (in CTR), but are
> intended to be arbitrarily-long "strings" of elements. 

as in Aspex Microelectronics "Array String Processor". so called because of its left-right neighbour connectivity between 4096 2-bit ALUs.

> The idea is that
> FlexiVec can handle long parallel operations, while Simple-V is used for
> shorter operations.

no: again, please do read the Snitch and EXTRA-V and ZOLC papers because, sorry for having to repeat it, getting slightly annoyed because it's three times i referred you to them, those give the context of the roadmap to break out of the "limitation" (the false one).

the papers are complex so i do not wish to spend time spelling them out, i have such a hell of a lot of ground to cover.

once you have read those papers you will see the possibilities intuitively and understand that what is in the current SV Spec is by no means the final word.

>  Since Simple-V introduces its own iteration counter
> (in SVSTATE if I understand correctly), what would prevent a Simple-V
> inner loop inside a FlexiVec outer loop? 

see ZOLC, Snitch, EXTRA-V and the Simple-V Rationale whitepaper

> Earlier, you mentioned that
> some algorithms have relatively simple repeatable sub-kernels

yes.  massive matrices are subdivided into tiles and done 6 loops, outer 3 cover "which tile", inner 3 cover "contents of tile". independent parallel processing csn deterministically schedule nonoverlapping tiles.

likewise DCT and FFT can be done with convolutions to combine smaller FFTs or you can do small ones within regs initially then move to a "standard" recursive FFT algorithm for the larger combinings but still use SV...

> If the
> individual applications of those sub-kernels are independent, Simple-V
> could express the per-group computation while FlexiVec expands that
> across multiple concurrent groups.

again please look at ZOLC, EXTRAV and Snitch, it is a far more powerful combination even that SVP64 or VVM/FlexiVec.

> The main reason for wanting implicit vector operations to generalize to
> VSX is orthogonality, lack of which would likely bother the OPF ISA WG
> severely.

there is nothing to prevent anyone from doing the work.
the reason i refuse to tackle it is because attempting to
jump straight in to 128 bit arithmetic and architectures
on EUR 50,000 NLnet Grant budgets when you have never
even done a 32 bit or 64 bit processor is a sure-fire way to
fall flat on your face.

> > [...
> >>> it is a compromise, but the important thing is, it is *our choice*,
> >>> absolutely bugger-all to do with the ISA itself. anyone else could
> >>> *choose* to do better (or worse).
> >>>      
> >> Now you have variable-latency vector lanes.  :-)
> >>    
> >
> > yyep.  not a problem for an OoO microarchitecture, in the least.
> > any in-order architect will be freaking out and crying home to
> > momma, but an OoO one no problem.
> >  
>
> That would make Simple-V dependent on a specific microarchitectural
> strategy, which is probably very bad in an ISA.

which would indeed be precisely why that exact dumbness has
in fact been very specifically avoided, yes.  to clarify: if your
statement "SV depends on a microarchitecture" was factually
 correct then the conclusion would likewise be correct.

as it is not factually correct then the conclusion is invalidated.

> > no but seriously, we're committed to SV and Power ISA, now.  2
> > years on SVP64 (so far), we have to see it through.
> >  
>
> So Libre-SOC is committed to Simple-V at this point and FlexiVec must be
> left as a possible future option.

in a word, yes.  we will jeapordise funding and business
 opportunities to try to take it on.

> > VVM as being also Vertical-First. what i am going to do however is create
> > a comp.arch thread referring to this discussion. i think people there
> > will be interested to share insights esp. on FlexiVec.
> >  
>
> It is worth noting that I could not have proposed FlexiVec prior to
> those developments.  :-/

funny, isn't it? takes such a lot of time to synthesise thoughts

> Also, as mentioned below, OpenCAPI has to be excluded from that mix at
> this time if I am involved.

you'll love this. OpenCAPI has been absorbed into CXL.
Intel controls that and the licensing is even more laughable.

so.. uhnn... ya :)

> This is the main reason I would have wanted FlexiVec for Power ISA
> ("Flexible Vector Facility" to put it in quasi-IBM-speak) accepted as a
> Contribution *before* even /beginning/ an implementation.

ok so there are procedures being developed which allow you
to do that.  PLEASE NOTE that IBM internal employees have been
terrified that they will be overwhelmed with [time-wasting] RFCs.
*please be mindful* of the consequences of putting forward ideas,
you need to think "can i commit to this to see it through".

we went the other way: we sought NLnet funding to *prepare*
the information to be presented rather than just expect IBM to
cough up internal resources (which they have to justify).

now that's mostly been done i put in *another* Grant request
to cover the actual cost of submitting the RFC and associated
followup

    https://libre-soc.org/nlnet_2022_opf_isa_wg/

> Read the definition of "architectural resources" in the OpenPOWER spec
> license terms.  In this case, mostly opcode assignments.

ah.  i was talking implementation resources, microarchitectural
design resources, sorry, misunderstood.

going back to architectural resources: nah.  it's 5 instructions
with 5/6-bit XO (like addpcis or the crand/or/xor group) and
25% of EXT001.  that is in no way a "lot of opcode resources". 

now, we do also have to add 100 *scalar* instructions but this
is because the *Scalar* Power ISA v3.0 is anaemic and totally
lacking compared to ARM, AMD and x86.  example, the entire
BMI1 set is missing from Power ISA, but fascinatingly not
cntlz. 

    https://www.chessprogramming.org/BMI1

but as we went over earlier, these have absolutely nothing to
do with SV.  IBM's heavy focus on Banking etc. customers left
the Scalar ISA pretty much frozen in time for 12 years, with
small elegant maintenance additions such as ldbrx.

    https://ftp.libre-soc.org/RFC02601.r02%20Byte-Reverse%20Instructions.public.pdf

> The other option would be to have the VPU run 64-bit instructions and
> drop VLIW, since 64-bit instructions inherently align with Power ISA's
> 32-bit words.  (VLIW was suggested purely to resolve the misalignment
> between the 32-bit words Power ISA uses and the suggested 48-bit VPU
> instructions.)

the moment you add 48 bit the variable length encoding
massively complexifies multi issue detection and starts
to interfere with the parallelism achievable.

> I referenced the CDC6600 architecture.  The Power core would be the
> /peripheral/ processor that handles I/O and the OS.  The VPU would
> handle only bulk compute.  

again, see the SimpleV rationale whitepaper.

> > in Video you picked the *one* area where we've already done a SHED
> > load of work :)
> >  
>
> Eh, "VPU" was intended as "Vector Processing Unit" not "Video Processing
> Unit".

oh whoops :)

> > NLnet does payment-on-milestones and we're good at subdividing
> > those so it's not "Do 4 Months Work Only Then Do You Get 15k".
> > also there's a bit in the RED Semi bucket now - you'd have to submit
> > subcontractor invoices.
> >  
>
> The project in question is to be done either way, correct?  (Such that
> milestones will need to be devised whether I do it or someone else does
> it, right?)  

correct.

> (Asking to see the milestones/roadmap before committing
> either way is reasonable, no?)

see the SimpleV Rationale whitepaper.  that defines the roadmap,
also helps what words need to go into the Grant request and then
also helps define the actual milestones appx 10-12 weeks later if
it is accepted.

> OK, what am I possibly getting into?

:)  basically an opportunity to define and shape the future of
computing.  save power reduce complexity reduce time to
market oh and be entirely FOSSHW to the maximum extent
practical.

> > plus, if you've got a shed-load of parallel processors with their
> > own Memory connected directly to them, yet you're still trying
> > to get them to execute sequential algorithms, you're Doing Something
> > Wrong :)
> >  
>
> It sounds like the main issue then is partitioning the work out to the PEs.

and creating and defining the protocols necessary to direct
that partitioning, and throw together some Simulators to demo
its feasibility..... yes.

> 8KiB == 2 4KiB pages.  Could we limit PE programs to 16KiB and specify
> that the PE has 4 instruction TLB entries, controlled by host software?

see SimpleV rationale, that's pretty much exactly what is written
in the whitepaper.

> Until those license issues are fixed, I am not touching OpenCAPI with
> the proverbial ten-foot pole.

with it being transferred to CXL i 100% agree, it is a lost cause
at this point.  perhaps registering the domain name "ClosedCAPI"
and offering it to them as a gift might help get the message
across.

> No, I mean the PEs might not meet the /lowest/ level, thus the
> requirement for special approval.  Or, perhaps in combination with a
> hypervisor running on the host processor, they /do/ meet the minimal
> level, even though the actual PE hardware does /not/ meet it?

there is anooother whitepaper
https://libre-soc.org/openpower/sv/microcontroller_power_isa_for_ai/

where i put forward the idea of having Compliancy Levels that
allow regfiles and ALUs to default to 16 or even 8 bit ops and
for sharing of those reg entries to get back up to 32 or 64 bit
if needed.  similar to load-quad except scaled riiight down.

the opportunity exists at the same time to define and propose
what would go into such Compliancy Levels.

> > everything.  near-memory PEs operating at only 150mhz, 3.5 watt
> > quad-core SoCs, 8-core 4.8 ghz i9 killers, 64-core Supercomputer
> > chiplets.
> >
> > everything.
> >  
>
> Actually, that prompts another idea:  perhaps we have been looking at
> Moore's Law the wrong way.  Instead of asking how high we can push
> f_CLK, perhaps we should take another look at that 150MHz DRAM sweet
> spot and ask how much logic we can pack into a 3.25ns half-period?

again: look at the SV rationale, for the link to the Snitch paper,
they suggest synchronous time-division multiplexing and
achieve 85% power reduction as a result.

>  This
> leads to a possible VLIW /microarchitecture/ fed from a parallel Power
> instruction decoder.  What is the statistical distribution of the
> lengths of basic blocks in Power machine code?  Could chainable ALUs
> allow a low-speed Power core to transparently execute instructions in
> groups?

that's exactly the kind of brilliantly "right" question that i'd like
the R&D to investigate... from the Snitch, EXTRAV, ZOLC plus
SVP64 perspective.

instructions that make FIFO queues between ALUs the primary
building blocks.

> > [...]
> >> To cut through all the fog here, how do I encode "ADDI R67, R123, 8" as
> >> a scalar operation, not using Simple-V?
> >>    
> >
> > you don't.  Power ISA 3.0 GPRs are restricted to r0-r31.
> >  
>
> This would break orthogonality in the Power ISA and I expect this to be
> likely to cause the OPF ISA WG to "freak out" as you describe it.

you forgot to ask the corollary question, how *do* you do it?
and the answer is:

    sv.addi r67, r123, 8

>  Are
> there any other cases of general registers not available to every
> fixed-point instruction in Power ISA?

i assume you mean "if you extend the GPRs to 128 entries are
there circumstance where scalar 32-bit nonprefixed ops can't
get at them" and the answer is "of course, but you always
just use the prefixed version of the exact same op to do so".

this is in no way different from a ton of examples of ISAs for
40 years being extended with escape-sequences or prefixes.
it's nothing new.

even VSX was expanded from VMX which was 32xFP overlaid
on FPR, to 64x128 by doubling length and doubling numbers.

exactly the same when AVX doubled to AVX2 and again to
AVX128 and again and again.

the "lower" stationed version of the ISA has access to a *subset*
of regs.  this is how it is, people understand it.

> This comes back to the problem exposed above.  The register file
> extension proposal should be available entirely independent of Simple-V,
> such that a processor could implement the extended register file and
> *not* implement Simple-V or vice versa.

without SV they are inaccessible therefore there is no point.
just as Intel expanding to 64 bit it would be ridiculous to
expand the regfile to 64 bit but then tell people the instructions
to use them are optional.  a nonstarter that one.

l.