[Libre-soc-dev] svp64 review and "FlexiVec" alternative

Sun Aug 7 00:19:49 BST 2022

lkcl wrote:
> On Wed, Aug 3, 2022 at 5:50 AM Jacob Bachmeyer <jcb62281 at gmail.com> wrote:
> >
> > lkcl wrote:
>
> [...]
>
> no: again, please do read the Snitch and EXTRA-V and ZOLC papers 
> because, sorry for having to repeat it, getting slightly annoyed 
> because it's three times i referred you to them, those give the 
> context of the roadmap to break out of the "limitation" (the false one).
>
> the papers are complex so i do not wish to spend time spelling them 
> out, i have such a hell of a lot of ground to cover.
>
> once you have read those papers you will see the possibilities 
> intuitively and understand that what is in the current SV Spec is by 
> no means the final word.

Indeed they are, and I am still digesting them.

> [...]
> > > no but seriously, we're committed to SV and Power ISA, now. 2
> > > years on SVP64 (so far), we have to see it through.
> > >
> >
> > So Libre-SOC is committed to Simple-V at this point and FlexiVec must be
> > left as a possible future option.
>
> in a word, yes. we will jeapordise funding and business
> opportunities to try to take it on.

That is fine; consider it shelved.

> > > VVM as being also Vertical-First. what i am going to do however is 
> create
> > > a comp.arch thread referring to this discussion. i think people there
> > > will be interested to share insights esp. on FlexiVec.
> > >
> >
> > It is worth noting that I could not have proposed FlexiVec prior to
> > those developments. :-/
>
> funny, isn't it? takes such a lot of time to synthesise thoughts

Yes.

> > Also, as mentioned below, OpenCAPI has to be excluded from that mix at
> > this time if I am involved.
>
> you'll love this. OpenCAPI has been absorbed into CXL.
> Intel controls that and the licensing is even more laughable.
>
> so.. uhnn... ya :)

[...]

> > This is the main reason I would have wanted FlexiVec for Power ISA
> > ("Flexible Vector Facility" to put it in quasi-IBM-speak) accepted as a
> > Contribution *before* even /beginning/ an implementation.
>
> ok so there are procedures being developed which allow you
> to do that. PLEASE NOTE that IBM internal employees have been
> terrified that they will be overwhelmed with [time-wasting] RFCs.
> *please be mindful* of the consequences of putting forward ideas,
> you need to think "can i commit to this to see it through".
>
> we went the other way: we sought NLnet funding to *prepare*
> the information to be presented rather than just expect IBM to
> cough up internal resources (which they have to justify).
>
> now that's mostly been done i put in *another* Grant request
> to cover the actual cost of submitting the RFC and associated
> followup
>
> https://libre-soc.org/nlnet_2022_opf_isa_wg/

I wish you luck in your endeavors here.

> > Read the definition of "architectural resources" in the OpenPOWER spec
> > license terms. In this case, mostly opcode assignments.
>
> ah. i was talking implementation resources, microarchitectural
> design resources, sorry, misunderstood.
>
> going back to architectural resources: nah. it's 5 instructions
> with 5/6-bit XO (like addpcis or the crand/or/xor group) and
> 25% of EXT001. that is in no way a "lot of opcode resources".

Unless I badly misread the opcode maps, EXT001 is the /vast/ majority of 
the remaining opcode space.

> [...]
>
> > The other option would be to have the VPU run 64-bit instructions and
> > drop VLIW, since 64-bit instructions inherently align with Power ISA's
> > 32-bit words. (VLIW was suggested purely to resolve the misalignment
> > between the 32-bit words Power ISA uses and the suggested 48-bit VPU
> > instructions.)
>
> the moment you add 48 bit the variable length encoding
> massively complexifies multi issue detection and starts
> to interfere with the parallelism achievable.

Not variable length:  the VPU would have its own instruction set 
optimized for vector calculations using Simple-V.  With the 48-bit 
instructions, you have a 8-bit major opcode field, three 10-bit register 
numbers, and either a 10-bit minor opcode field or a fourth register 
number.  The VPU would *not* implement Power ISA at all.  The VPU 
interface in the Power core would be a Custom Extension.

> > I referenced the CDC6600 architecture. The Power core would be the
> > /peripheral/ processor that handles I/O and the OS. The VPU would
> > handle only bulk compute.
>
> again, see the SimpleV rationale whitepaper.

Wait... a /different/ paper?  URL?

> [...]
>
> > Until those license issues are fixed, I am not touching OpenCAPI with
> > the proverbial ten-foot pole.
>
> with it being transferred to CXL i 100% agree, it is a lost cause
> at this point. perhaps registering the domain name "ClosedCAPI"
> and offering it to them as a gift might help get the message
> across.

What about some form of packetized serial Wishbone on DisplayPort PHY?

> > No, I mean the PEs might not meet the /lowest/ level, thus the
> > requirement for special approval. Or, perhaps in combination with a
> > hypervisor running on the host processor, they /do/ meet the minimal
> > level, even though the actual PE hardware does /not/ meet it?
>
> there is anooother whitepaper
> https://libre-soc.org/openpower/sv/microcontroller_power_isa_for_ai/
>
> where i put forward the idea of having Compliancy Levels that
> allow regfiles and ALUs to default to 16 or even 8 bit ops and
> for sharing of those reg entries to get back up to 32 or 64 bit
> if needed. similar to load-quad except scaled riiight down.
>
> the opportunity exists at the same time to define and propose
> what would go into such Compliancy Levels.

Now you are suggesting a return to the "sub-ISAs" that IBM already tried 
once and rolled back into a single mainline, if I understand the history 
of Power ISA correctly.

> > > everything. near-memory PEs operating at only 150mhz, 3.5 watt
> > > quad-core SoCs, 8-core 4.8 ghz i9 killers, 64-core Supercomputer
> > > chiplets.
> > >
> > > everything.
> > >
> >
> > Actually, that prompts another idea: perhaps we have been looking at
> > Moore's Law the wrong way. Instead of asking how high we can push
> > f_CLK, perhaps we should take another look at that 150MHz DRAM sweet
> > spot and ask how much logic we can pack into a 3.25ns half-period?
>
> again: look at the SV rationale, for the link to the Snitch paper,
> they suggest synchronous time-division multiplexing and
> achieve 85% power reduction as a result.

Which is basically the CDC 6600's "barrel" peripheral processor.

> > This
> > leads to a possible VLIW /microarchitecture/ fed from a parallel Power
> > instruction decoder. What is the statistical distribution of the
> > lengths of basic blocks in Power machine code? Could chainable ALUs
> > allow a low-speed Power core to transparently execute instructions in
> > groups?
>
> that's exactly the kind of brilliantly "right" question that i'd like
> the R&D to investigate... from the Snitch, EXTRAV, ZOLC plus
> SVP64 perspective.
>
> instructions that make FIFO queues between ALUs the primary
> building blocks.

No queues, not if ALU latency is short enough and f_CLK low enough to 
push the data through multiple ALUs within a single half-period.  In 
other words, executing instructions in groups as they can be collected 
in-order.

> > > [...]
> > >> To cut through all the fog here, how do I encode "ADDI R67, R123, 
> 8" as
> > >> a scalar operation, not using Simple-V?
> > >>
> > >
> > > you don't. Power ISA 3.0 GPRs are restricted to r0-r31.
> > >
> >
> > This would break orthogonality in the Power ISA and I expect this to be
> > likely to cause the OPF ISA WG to "freak out" as you describe it.
>
> you forgot to ask the corollary question, how *do* you do it?
> and the answer is:
>
> sv.addi r67, r123, 8

Why should this be a separate assembler mnemonic?  Why would the 
assembler not simply accept "ADDI R67, R123, 8" and produce the prefixed 
opcode?

> > Are
> > there any other cases of general registers not available to every
> > fixed-point instruction in Power ISA?
>
> i assume you mean "if you extend the GPRs to 128 entries are
> there circumstance where scalar 32-bit nonprefixed ops can't
> get at them" and the answer is "of course, but you always
> just use the prefixed version of the exact same op to do so".

No, I mean "are there any existing fixed-point instructions that are 
only applicable to a subset of the fixed-point registers in Power ISA?"

> this is in no way different from a ton of examples of ISAs for
> 40 years being extended with escape-sequences or prefixes.
> it's nothing new.
>
> even VSX was expanded from VMX which was 32xFP overlaid
> on FPR, to 64x128 by doubling length and doubling numbers.
>
> exactly the same when AVX doubled to AVX2 and again to
> AVX128 and again and again.
>
> the "lower" stationed version of the ISA has access to a *subset*
> of regs. this is how it is, people understand it.

All of your examples are PackedSIMD instruction sets that were 
subsequently extended to wider SIMD.

Consider the REX prefixes in AMD64 for a better example.  They replaced 
a group of 1-byte INC/DEC instructions and simply supply the high bits 
for extending the registers used in an instruction.  An exact analogy 
does not work for Power ISA because x86 was a giant mess even before AMD 
extended it like that, but I think this illustrates the issue at hand here.

> > This comes back to the problem exposed above. The register file
> > extension proposal should be available entirely independent of Simple-V,
> > such that a processor could implement the extended register file and
> > *not* implement Simple-V or vice versa.
>
> without SV they are inaccessible therefore there is no point.
> just as Intel expanding to 64 bit it would be ridiculous to
> expand the regfile to 64 bit but then tell people the instructions
> to use them are optional. a nonstarter that one.

Then there should be a subset of SV, orthogonal to the rest of SV, that 
only extends the register file.

-- Jacob