[Libre-soc-bugs] [Bug 238] POWER Compressed Formal Standard writeup

Mon Nov 23 20:24:56 GMT 2020

https://bugs.libre-soc.org/show_bug.cgi?id=238

--- Comment #72 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
ok coming back to this...

(In reply to Alexandre Oliva from comment #60)
> (In reply to Luke Kenneth Casson Leighton from comment #57)
> 
> >> I can't help the feeling that we're wasting precious encoding bits to remain
> >> in 16-bit mode.  We could easily double the number of accessible registers
> >> using them, and having an encoding similar to that of nop to switch out.
> 
> > this was discussed in comment #8
> 
> It was, but from the perspective of "let's try to get some of the most
> common opcodes duplicated in this compressed form", which reminds me of
> Thumb v1.  What I'm thinking is of a slightly different perspective, of
> making the compressed form so extensive that you'd hardly ever have a
> need to switch back to 32-bit mode, which is more like later versions of
> Thumb.
> 
> Bear in mind that Thumb nowadays is a very complete instruction set.
> Why not aim for that?

simple answer: time (as in: we're under time pressure)  additional answer: it
becomes CISC, and is harder to justify to the OPF ISA Working Group.

consider these options:

* a complex (CISC-like) decoder that duplicates the entirety of the OpenPOWER
v3.0B ISA
* a greatly aimplified (RISC-like) decoder that rides off the back of an
existing v3.0B ISA

the former is a huge amount of work (both to design and implement) being
effectively an embedding of an entirely new ISA within OpenPOWER

the latter is far less work, can be implemented as a no-state "mapping table"
(every 16bit opcode maps directly and cleanly to a 32bit equivalent with very
few gates) and requires near-trivial initial changes to gcc and binutils.

yet both provide high compression ratio.

which, by the assessment criteria that Paul Mackerras kindly informed us would
be used by the OpenPOWER ISA WG, would stand the higher chance of being
accepted?

the idea here is, like RISCV RVC, to go after the "low hanging fruit" with the
highest bang-per-buck.  minisise effort and changes whilst maximising benefit.

a new (full) embedded CISC-like encoding maximises benefit but also maximises
changes and complexity, as well as costs us time (and NLnet funding) which we
really don't have.

> > if we knew in advance that instruction streams were to remain in 16bit mode
> > consistently for considerable durations i would be much more in favour of
> > schemes that had to rely on nops to transition between modes.
> 
> That's what I'm aiming for with my suggestions.

i do see where it's going.  the idea is, stay as long as possible in 16bit mode
so that 32bit v3.0B is not needed.

unfortunately, if we had started this a year ago it might have been viable
within the budget and time to explore.

rather than go the ARM thumb route i feel our efforts are better spent going
the RISCV RVC route, which is based around letting 16 and 32 bit opcodes
interleave even down to single-digit quantities.

if however it can be shown clearly that the fixed overhead of the 10 bits being
effectively wasted is still a better compression ratio i will be more than
happy with it.

> >> and follow the practice of other compact encodings of
> >> using a single register as input and output operand. 
> 
> > this was discussed in comment #18
> 
> Yeah, but dismissed without data.

really delighted you were able to get some.

> Consider that 22% of the instructions that take 3 registers as operands
> in the ppc-gcc binary I'm using for tests actually use the output
> register as one of the input operands, without any effort by the
> compiler to make them so:
> 
> $ ./objdump -d ppc-gcc | grep -c ' r[0-9]*,r[0-9]*,r[0-9]*$' # x <- y op z
> 5731
> $ ./objdump -d ppc-gcc | grep -c ' \(r[0-9]*\),\1,r[0-9]*$'  # x <- x op z
> 673
> $ ./objdump -d ppc-gcc | grep -c ' \(r[0-9]*\),r[0-9]*,\1$'  # x <- y op x
> 630

rrrigh.  ok.  so the next most critical question(s) is/are:

* of the x y op z type, how many of these do and do not fit into 16-bit
compressed format?
* of the x x op z and x y op x format, likewise?

my concern is this: it's only 20%.  therefore even if 100% of those are
candidates for 16bit compression, the maximum achievable compression ratio is:

   (0.8 * 4bytes + 0.2 * 2bytes) / 4bytes

whereas if we have a full 3-op 16bit compression system and by some
fantastically weird fluke, 100% of those are compressible, it is:

   1.0 * 2bytes / 4bytes

i.e. 50%.

so we definitely need a further breakdown, because, ironically, without further
data, intuition is in favour of 3-operand!

> We (well, I :-) can easily reconfigure the compiler to prefer such a
> form, without ruling out 3-different-register alternatives, and see how
> far that gets us,

ahh now *that* is valuable.   you're talking about actually changing ppc64 gcc
to get it to emit a greater proportion of 2-op instructions on standard v3.0B
Power ISA?

then if those proportions tip to say 20% 3op 80% 2op, the analysis can be done
again.

the only thing is: if the 32bit recompilation increases executable size we have
to do the comparison against the *unmodified* version of gcc.

if however it actually decreases executable size then apart from laughing so
hard we forget to submit an upstream patch, the smaller executable size is our
comparison.

:)

> >> Most (*) 3-operands
> >> insns x := y op z can be turned into x := y ; x op= z, which is no worse and
> >> probably better than switching to 32-bit mode,
> 
> > except: look at the amount of space used to do so.  it's still 32 bit, isn't
> > it?
> 
> Point was, even in the (to be exceptional) case of NOT getting an insn
> that could be encoded as a single compressed insn from the compiler, you
> could still remain in compressed mode.

... and this is what we need to check as a tradeoff.  got it.

> And by making the switching out of compressed mode rare enough, we'd
> strengthen the case for using the 2/16 bits for something more useful
> than switching modes: the switch out would be an exception, thus a
> special opcode rather than 1/8th of every insn.

this does mean some significant gcc and llvm changes to the compilation
strategies.  we had better be damn sure it's a good idea before committing
resources to it.

> >> and we could have an
> >> extend-next pseudo-insn to supply an extra operand and/or extra
> >> immediate/offset bits to the subsequent insn.
> 
> > i very much do not want to go down this particular route, although it is very
> > tempting to do so.
> 
> Think of it as a 32-bit insn representation if you must ;-)
> 
> Thumb does that.  We're looking at 48-bit instructions that are hardly
> different from that.

i know.  alarm bells are ringing at how much wotk is involved given everything
else that needs doing, and that it is the "CISC" design route, for which we
would have some serious pushback from the OPF ISA WG over.

remember, this has to be acceptable to IBM for inclusion in a future POWER11 or
POWER12 core.

> I don't really feel that an extend-next opcode that supplies an extra
> operand and/or an extended immediate range for the subsequent insn that
> would otherwise take it, respectively, from a repeat operand, or from a
> sign-extension of a narrow immediate range is such a huge increase in
> complexity, and I believe the benefits could be huge in enabling us to
> remain in compressed mode for much longer.

vs being able to flip between the two without a fixed overhead penalty (at
least, not a massive one), this is the crux of the debate.

prefixes on top of prefixes make me very nervous as far as the decoder hardware
is concerned.

the FSM for the 10/16/32 is really quite straightforward.  entering the 10bit
mode is simple: is 32bit Major Opcode == 0b000000.  after that, only 2 bits are
involved in the decision-making.

what you are proposing is far more complex, involving the use of a
"next-marker" plus adding a Special Purpose Register to store the state
information (the immediate that hasn't yet been applied) and/or macro-op fusion
analysis.

it involves analysing far more bits to ascertain the length, plus involves
basically replacing the entirety of v3.0B because the incentive is there to do
so.

it's a huge amount of work and it'S CISC and it goes strongly against
everything that is good about RISC microarchitectures.

we intend to do a multi-issue core and that means that the identification of
which instruction is which needs to be kept below a threshold of complexity.

CISC encoding is *really* challenging to do multi-issue and we have yet to
include SV Prefixing (Vectorisation).

without having full implementation details at the hardware level available to
show you, i really, really do not want to take us down the CISC encoding route.

if the extra bits for the regs can fit into a *uniform* 16 bit encoding, that
is the RISC paradigm and it is infinitely preferable to the
escape-sequence-based CISC one.

-- 
You are receiving this mail because:
You are on the CC list for the bug.