[Libre-soc-bugs] [Bug 238] POWER Compressed Formal Standard writeup

Mon Nov 23 23:18:17 GMT 2020

https://bugs.libre-soc.org/show_bug.cgi?id=238

--- Comment #76 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
(In reply to Alexandre Oliva from comment #74)

>   when I speak of expanding the
> scope of the compressed isa into a fuller isa, I'm not talking of inventing
> new insns not present in 3.0B, nor borrowing insns from Thumb proper, I'm
> talking of getting more coverage of 3.0B insns.

yes.  sorry for not making it clear that i understood that this was the
context.

VLE i gather from looking at the way it does implement something similar to
what you propose (extend 16bit opcodes by another 16 bits which contain an
immediate) *might* i stress might not map onto v3.0B (or in VLE's case because
it was designed 10+ years ago v2.06B) opcodes.

> e.g. 8 registers would to too little even if r0, r1 and r2 weren't reserved
> to begin with. 

it may surprise you to learn that RISCV RVC is highly effective, achieving
something like a 25% reduction.  also, hilariously, someone did an experimental
16bit re-encoding called RV16 (or something like it) that they demonstrated
achieved a whopping 40% reduction.

> it's a bit like going back to 32-bit x86, but with plenty of
> registers there, just not as usable because we're wasting representation
> bits with something else, so they require switching to uncompressed mode
> which we'd rather not do.

you mean, "the proposed alternative encoding (v2) has as its premise the
avoidance of switching as a driving design goal".

> as for multi-issue, having encoding bits in every insn to tell how the very
> next one is to be interpreted doesn't help much with that.  

if they are very simple and do not involve the HDL equivalent of "deep packet
inspection" then yes, they do.

the moment that this inspection becomes a complex layered FSM, with multiple
unavoidable gate dependencies, this automatically and inherently means that the
top clock speed is limited.

example: to achieve a 5 ghz clock rate you must have no more than around 16
gates in any given "combinatorial cascade" before capturing partial results in
"latches" that are then passed on to the next "stage" in the "pipeline".

in other words, the 10/16/32bit FSM, being only involving 2 bits, *not* having
to go further in and analyse any more bits, can indeed identify "packets" very
easily, precisely because it is only 2 bits.

> mode transitions
> had better be the exception rather than the rule.

you'll need to trust the 4 years i've spent doing HDL development, here:

* 2-bit FSM: fine for multi-issue
* op-next chained CISC encoding: not fine

:)

> OTOH, I can hardly tell the complexity difference between one bit that says
> "return to compressed mode after the next 32-bit insn" and one that says
> "use this register as the first input operand instead of the output one in
> the next insn".  

the first one is a pre-analyser that needs know absolutely nothing about the
rest of the bits of the instruction.  the remaining bits, having been
identified, can be passed to secondary (parallel) processing analysis units.

that secondary processing branches out, based on information it received from
the 1st phase, "this is 10 bit" or 16 or 32.

the other one *cannot do that*, it has to do "deep packet inspection",
analysing far more bits before being able to "hand off" to other processing
analysis units.

> it's just that with this one, we ensure that *every* occurrence of a
> 3-operand insn (among the selected opcodes) can be represented in 2-operand
> compressed mode, even those that can't be made up for with a register-copy
> insn.

if only the PowerISA were that simple.

we have some pipelines in the LibreSOC codebase with as high as **SEVEN**
incoming and **FIVE** outgoing registers.

not all of those are active at the same time, however it's pretty close.

please, i love the idea however i also have a better "feel" for how much
combined design, HDL and toolchain work is involved in the CISC escape-sequence
approach, and i really, really do not think it is a good idea to commit the
available resources from NLnet to it.

> also consider that, unless I'm missing something about the ppc encoding,
> using the full 16 bits and 2 5-bit operands makes the mapping much easier: 6
> bits for the EXT opcode, plus the 2 operands makes 16; 

the point is, here: this is effectively shuffling the v3.0B encoding space
around, just to reach 32bit parity with an existing long-established encoding
(v3.0B).

the time taken would be enormous. remember: OpenPOWER has 300+ integer
instructions, 300+ FP ones and a staggering 700+ SIMD ones

then there are at least *SIX* separate and distinct register files (!), INT FP
CR STATE MSR XER and other SPRs

i am trying to give you some idea, without going into too much detail, of the
implications time-wise of the "simple-sounding" idea of re-encoding the
entirety of the OpenPOWER ISA.

it is a massive rabbithole and timesink that would easily justify its own EUR
50,000 NLnet Grant, and could take most of a year to complete.

whereas a subset encoding we can just about justify by selecting the top 10-15%
instructions, demonstrate that this gives us a 25% (whatever) compression
ratio, implement it, declare the NLnet Grant milestone "complete" and be in a
position to apply for another one.

> in many cases, the
> mapping could just directly copy 11 bits directly, without any mapping
> whatsoever; the intelligence would have to go into where/how to copy the
> remaining bits, and in whether to duplicate the first operand into one of
> the input fields.
> 
> is that not much much simpler, efficient and likely to be accepted than not
> just one, but two new encodings?  (namely 10- and 14-bit AKA 16-bit?)

that's what i would like to determine... but *not* by re-encoding the entirety
of the OpenPOWER ISA, and definitely not by using CISC-style variable-length
encodings...

...*unless* those variable-length encodings are using as an absolute maximum
one maybe 2 *uniform* bits to identify.

* acceptable:

     if instr[0:2] == 0b00 then
          length = 32

   gate chain depth here is around 2

* also acceptable:

     if FSM.mode==X & instr[0]==0b1 then
          FSM.next.mode = Y
          length = 10
     elif FSM.mode==Y ....

   gate chain depth is also around 2

* not acceptable:

     if instr[0:4] == 0b00110 then
         if something else then
              if something else from
                 somewhere else:
                     length = 32

   gate chain here is very high and will
   jeapordise chances of high performance
   multi-issue.

-- 
You are receiving this mail because:
You are on the CC list for the bug.