[Libre-soc-dev] compressed instructions state requirements

Wed Nov 25 20:07:43 GMT 2020

On Wed, Nov 25, 2020, 06:07 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
wrote:

> On Tue, Nov 24, 2020 at 9:07 PM Jacob Lifshay <programmerjake at gmail.com>
> wrote:
>
> > My idea of how 16-bit instructions work is that they should be usable
> > anywhere (like RVC), no special pages needed. The extra info is
> > conceptually part of the PC (or some decode status register).
>
> right.  "some decode status register".  let's walk through it.  let me
> know (explicitly) if you agree or disagree with the answers.
>
> Q: how often will that status register need to be set/reset?
> A: on every call into a standard PPC64LE ABI v3.0B formatted function
>

Actually, on every instruction. Just like PC is changed by every
instruction.

Q: how do you return *to* the enhanced mode?
>

All the time -- whenever you want to encode a 16-bit instruction.

A: on *entry* to every function encoded in enhanced-variable-encoding
> mode, instructions are required to set the new mode
> Q: what is the cost of setting a status register?
>

1 16-bit instruction -- it is set by every instruction executed. It is not
intended to be normally written through a SPR.

Q: how many "batches" of those 5, 6, 7, 8 instructions are required:
> A: one batch on EVERY single call to a PPC64LE ABI v3.0B function
>     one batch at the start of every single enhanced-variable-encoding
> mode function
>     one batch at the exit of every single enhanced-variable-encoding
> mode function
>
> this latter so that the enhanced-encoded function "looks like" a
> standard PPC64LE ABI v3.0B function at all times.
>
> > Unlike VLE,
> > it works fine with any combination of 32/64-bit mode and LE/BE mode,
>
> ... but, we established yesterday, not with 16/48.
>

Umm, the whole point of the extra mode is to be able to encode 16-bit
instructions. 32/48/64-bit instructions only need the extra bit in PC. Part
of the reason I'm designing this to mesh well into the existing Power ISA
is that will allow future OpenPower processors to also benefit from
supporting compressed instructions even for scalar programs without having
to switch the processor to "GPU mode" (which most others won't support).

> > in a
> > way such that the bytes in memory needed to encode pre-existing
> > instructions are completely unchanged -- you can run pre-existing
> > PowerPC[64][LE] programs entirely unmodified even if the processor
> supports
> > 16/48-bit instructions since the program always starts executing in
> > Standard Mode.
>
> the caveat being: if there is no VLE mode-page the cost is such a high
> quantity of mode-setting instructions that it jeapordises the entire
> purpose of the exercise.
>

You are *waay* overestimating the cost.

>
> even if we add just the one instruction that allows a compact
> mode-switching (to alter the "decode status register" with only a
> 32-bit instruction), that's still one instruction too many given that
> it's literally going to be at the start and end of absolutely every
> single function call, and added just before literally every single
> call to a ppc64le v3.0B ABI function.
>
> ... or...
>
> ... we could very simply mark pages with ONE (quantity 1) single bit.
>
> > > if however a mix is permitted within a "marked" 64k page (16/32/48/64)
> > > then the
> > >
> >
> > Truncated sentence?
>
> doh.  yes.  i believe i completed what i wanted to point out, above.
>
> > We need to contact the OpenPower Foundation and get permission to
> implement
> > v3.1.
>
> before doing that i'd reaaally like to know that it's worthwhile.
> below you outline that it may well be the case.  however applying
> SVPrefix to v3.1B 64-bit instructions: this is well... it'll be...
> exciting.
>

the 64-bit instructions are often more useful as scalar instructions than
vectorized instructions (loading immediates and addresses), so I don't see
limiting vectorizable scalar instructions to 16/32/48-bit instructions as
much of a problem.

>
> > I disagree: having code that's compatible with v3.1 means getting a speed
> > bump from better support for larger immediates (34-bits instead of 16) as
> > well as PC-relative addressing. This could mostly eliminate the need for
> a
> > TOC, since shared libraries can generally be assumed to be less than 8GB
> in
> > size. This should also reduce code size somewhat. Though that's all true
> > only once compilers catch up.
>
> indeed.  i'd really prefer to see numbers on code-size reduction that
> results.  from what you're saying they could be quite significant: the
> amount of extra work involved is something that should not taken
> lightly.
>
> > The way I'm envisioning it, SVP64 instructions share the PowerISA v3.1
> > prefix encoding space with PowerISA v3.1 64-bit instructions (more than
> > half that space is available),
>
> ah no.  absolutely not.  no way.  the entirety of the SV-P64 needs to
> be completely and 100% free and clear.  and it also needs to be not
> one but two EXTNNNs.  this was established 18+ months ago from the
> work done on the original SV-P64 (RV) encoding.
>
> i really, *really* do not want to have yet more time spent doing yet
> another total redesign of the SVP formats.  we simply do not have
> time.
>

Sorry, that's necessary anyway, Power is sufficiently different that
redesigning is necessary.

>
> when i said that we need to accelerate the development, i really meant it.
>
> we *have not* got time to try to "desperately work out how to cram in
> mix of two completely different encodings".  plus, there is no
> guarantee that IBM will not extend EXT001 in the future, jeapordising
> the entirety of SV-P64.
>

That's what the OpenPower ISA design workgroup solves. Once that is set up,

>
> logically, therefore (and particularly given that they're
> mutually-exclusively incompatible)
>
> > SVP48 instructions use the same 48-bit
> > encoding space as all other 48-bit instructions (probably using primary
> > opcode 0)
>
> two primary opcodes.  11 bits are required for the small prefix (and i
> reiterate: i *do not* want us wasting yet more time to redesign
> something that's already had months of work gone into it)
>
> > and SVP32 instructions use other 32-bit encoding space (possibly
> > shared using primary opcode 0).
>
> not a chance in hell of it being a single primary opcode, or shared.
> two separate primary opcodes are required, those two being completely
> separate and distinct from all other prefix-identifying opcodes.
>
> please read this page
> https://libre-soc.org/openpower/sv/major_opcode_allocation/
> and, reminder: the original page:
> https://libre-soc.org/simple_v_extension/sv_prefix_proposal/
>
> it's 11 bits for the short version (applied to SV-P48) and 27 bits for
> the long version (applied to SV-P64).
>
> i emphasises again: i really, *really* do not want the time wasted
> doing yet another redesign of something that took several months to
> write.
>
> we have not got time.
>
> > Yup, that can be done by (in Standard Mode) decoding the primary opcode
> as
> > well as (for opcode 0) one bit of the extended opcode field (the 256
> place)
> > for compatibility with the "Service Processor Attention" instruction,
> which
> > needs to be 32-bit. That should be sufficiently trivial to satisfy your
> > worries about decode issues with multi-issue.
>
> the only reason i can think of where this would be reasonable is if
> there was a genuine legitimate reason to halt the processor during the
> first-stage (length/mode-identifying) phase.  given that that would
> require considerable additional gates (identifying the *full* 32-bit
> pattern 0x00000080) i would also be very reluctant to suggest even
> doing that.
>

All you need is to decode it as a trap instruction. Also, very important:
only the primary opcode and extended opcode fields determine if the
instruction is a "Service Processor Attention" instruction, all other bits
in the instruction encoding are *don't cares* meaning that *we have to read
the second half* of the 32-bit instruction to even determine that it is ok
to use it as a non-32-bit instruction, *completely ruling out 16-bit
instructions* using primary opcode 0, unless you want a very complex
instruction encoding. Note that reserving space for "Service Processor
Attention" is *non-optional*, the PowerISA spec requires it (it's in the
appendix defining which encodings are reserved).

>
> given that it is such a rare occurence ("halt processor") the benefits
> of conforming to standard conventions i'd advocate that they outweigh
> the "cost" of moving one (single, extremely rare) instruction.
>
>
> > This causes no issues with needing all 0s to be illegal, since, in
> Standard
> > Mode, the first 32-bits in memory being all 0s would be an illegal 48-bit
> > instruction and in Compressed Mode all 0s would be an illegal 16-bit
> > instruction. No need to use Primary Opcode 0 for 16-bit instructions in
> > order to achieve that, so I think 16-bit instructions in Standard Mode
> > should use Primary Opcode 5 since that is entirely unallocated,
>
> you're missing the fact that *two* primary opcodes are required for C.
> i repeat-documented this 2 weeks ago, based on an analysis that i did
> almost a year ago when we first began the move of SV to OpenPOWER
>

Only 1 is required, primary opcode 5, because that only works for a subset
of compressed instructions. The full compressed instruction set with
16-bits available is only available in compressed mode, which is switched
into using 1 16-bit instruction.

>
> nggggh :)
>
> two contiguous primary opcodes means that in the critical stage of
> identifying the length/mode, extra gates are not required to recognise
> two separate non-contiguous patterns then AND those results together.
>

Even if we required 2 primary opcodes, by picking well, we can pick primary
opcodes that have only 2 bits differing between them, reducing the number
of gates required to 2-3.

>
> instead by having two contiguous EXTNNNs you can *drop* one bit from
> the detection logic.
>
> > no need for
> > annoying workarounds to get "Service Processor Attention" to still be
> > 32-bit since the Extended Opcode field which encodes "Service Processor
> > Attention" is outside of 16-bits and all the other bits are don't-cares
> for
> > "Service Processor Attention", meaning using PO 0 would require always
> > reading 32-bits to check -- very messy.
>
> i may be missing something: is "attn" an extremely common instruction?
>

no, but the spec still requires us to reserve space for it.

 (what is PO 0?)
>

primary opcode 0.

>
> also, i'm having difficulty parsing the above paragraph.
>

sorry, I tried to explain more clearly above in the sentence with all the
emphasis.

>
> > >
> > > > and 48-bit (no spec yet) instructions.
> > >
> > > TBD when we get to SV Prefixing.  remember also that we have SV-P64
> > > (32-bit SV prefix plus a 32-bit instruction) and we have SV-C64
> > > (32-bit prefix plus a 16-bit swizzle prefix plus a 16-bit Compressed)
> > >
> >
> > I propose that we limit the maximum possible instruction length to
> 64-bits
> > (kinda like x86's 15-byte limit)
>
> errr yeah.  i kinda instinctively rebelled against going beyond 64 bits.
>
> > allowing the encoding I described above to
> > be sufficient.
>
> no, unfortunately.  if you're referring to a fundamental assumption
> that a single major opcode is sufficient for SV-Prefix encodings, that
> is.  right back as far as the very first days where we discussed
> potentially moving to OpenPOWER the very first thing that i did was:
> analyse SV Prefix encodings.
>
> i found that the SV Prefix encodings *only* worked if they were
> allocated 2x EXTNNN opcodes each.
>

Remember that the shorter encodings should only be a compressed version of
the full SVP64 encoding, so if there's not enough space, we can always put
the less common operations only in larger instructions.

We should have enough space, since we can use a little more than half of
primary opcode 1 for all of SVP64. We'd only need to use the msb 7-8 bits
as part of the opcode since the available space is not contiguous (though
nearly so).

primary opcode 5 is used for 16-bit instructions (switch to 16-bit mode),
primary opcode 0 is evenly split between 32 and 48-bit instructions. The
32-bit half can be split between SVP32 and new 32-bit instructions (like
sin/cos).

>
> this has therefore been the *fundamental* assumption of the entire
> development and discussion of SV-OpenPOWER right from the very start.
>

I think we can make it work just fine. 32-bits of extra fields (in SVP64)
should be enough to encode subvectors, element width, predication, extra
register field bits, and all the other stuff. If things doesn't fit (super
complex swizzles), the less important parts can be simplified or removed
entirely (swizzles in all instructions, then maybe subvectors would be
first on the chopping block for me).

>
> >In particular, this means 64-bit instructions can't be
> > further prefixed.
>
> whewww :)  yeah 72 and 96 bit... yeah.
>
> honestly my feeling is that given that we are 99.99% likely to have to
> use a VLE-style page-bit marker, we're kinda "free and clear" to move
> Major opcodes around.  i'd therefore advocate that it should be *v3.1B
> P64* that's moved to EXT005, leaving C free and clear to need only 5
> bits to identify at the gate-critical level of identifying
> mode/length.
>

I disagree, we should try to have as much available as possible without
requiring per-page mode switches or switching to "GPU mode". That way,
other OpenPower processors are much more likely to actually implement some
of the new instructions, giving a benefit to scalar code everywhere, not
just only on our processor only in the GPU driver.

Jacob