[Libre-soc-bugs] [Bug 238] POWER Compressed Formal Standard writeup

Tue Nov 24 19:08:33 GMT 2020

https://bugs.libre-soc.org/show_bug.cgi?id=238

--- Comment #84 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
(In reply to Alexandre Oliva from comment #80)
> re: comment 78
> 
> Though I discussed the topic, I missed explicitly stating it as a mostly
> orthogonal proposal:
> 
> 4. have an insn kind that provides additional operands for the subsequent
> insn

(i'll come back to the other ones, which are good / worthwhile)

this one is so detrimental to the efficiency of a hardware implementation, that
the detriment is sufficient to eliminate it entirely.

the absolute top priority job of an instruction hardware decoder is:

     identify the instruction length.

this job above all others is the absolute, without a shadow of doubt, the
absolute top highest priority above and beyond all and any other.

in a standard 32-bit RISC instruction set such as OpenPOWER v3.0B this is
an absolute brain-dead task:

     word0 word1 word2 word3 word4 ....

maps to:

     instr0 instr1 instr2 instr3 instr4 ...

however now let us introduce OpenPOWER v3.1B which includes 64-bit
instructions.  let us also presuppose that we want a 4-way issue.

that means that we can either have at the minimum or the maxium the following:

* all 32-bit opcodes
* all 64-bit prefixed opcodes

that means we need to have a buffer of up to 8x 32-bit words in memory.

now let us have a series of "analysers" which tell us, for each instruction,
the answer to the following question:

    does word[N] have a Major Opcode EXT001?  (this is the "prefix" identifier)

note: we haven't actually determined if it's *actually* an instruction.  the
only one we KNOW is an instruction is:

    word[0]

so, we have an array of 8 words:

    word[0] [1] [2] [3] [4] [5] [6] [7]

and we have an array of 8 bits which tell us "bits 0..5 equals 0b000001"

     P0 P1 P2 P3 P4 P5 P6 P7

now let us suppose that we have those set to the following values:

     0 0 0 0 0 0 0 0

what does that tell us?  it means, ta-daaa, the 1st 4 words in our buffer are
*NOT* 64-bit prefixed instructions.

therefore, in this case, we can take the 1st 4 words, shove them into
instruction-processing-units 0, 1, 2 and 3 respectively, along with their
lengths: 32-32-32-32

now let us suppose that we have the following values for P0-P7:

     1 0 1 0 1 0 1 0

what that tell us?  it means, ta-daaaa, this is 4x 64-bit prefixed
instructions.
in fact, anything according to this pattern is *also* 4x 64-bit prefixed:

     1 X 1 X 1 X 1 X

because the "data" in the 2nd word of each 64-bit instruction could contain the
pattern "0b000001" in bits 32-37

therefore, anything matching that pattern "1x1x1x1x" we shove straight into the
4 buffers, along with length-identification pattern:

    64-64-64-64

you can see where that's going.  you basically have a big "Switch" statement
covering *all* possible patterns, producing all the permutations:

    32-32-32-32
    64-32-32-32
    32-64-32-32
    64-64-64-64

and additionally the *INDEX*, on a per-target-buffer-basis, of where the 32/64
packet is to be routed to, that target being the multi-issue "instruction"
decoder, indexed 0,1,2 or 3.

note:

    AT NO TIME WAS THERE ANY FURTHER INSPECTION OF THE PACKETS OTHER THAN
    "IS BITS 0 THRU 5 EQUAL to 0b000001"

now let's f*** with this utterly simple brain-dead algorithm.  let's throw in
some variable-length encoding.  let's add the proposed 16-32 encoding.

let's say that if the 16-bit opcode (in let's say bits 0 thru 3) are equal to
0b1111, that this indicates "escape sequence to decode as a 32-bit instruction"

our detection pattern now comprises:

   * an array of 16x identifiers of the 1st 6 bits of every
     16-bit chunk to the pattern 0b00000
   * an additional array of 16x identifiers of the 1st 4 bits of every
     16-bit chunk to the pattern 0b1111

followed by a MASSIVELY complex permutation network of Switch/Case statements,
potentially several thousand in size, that is at least two orders of magnitude
larger than our simple brain-dead identification algorithm in terms of gate
count.

if we try to do this as a chained detection system (a for-loop in hardware),
the gate chain will be SO LONG that our maximum clock rate even in 7nm would
probably be limited to somewhere around the 800mhz mark.

so i have to reiterate: it is CRIICALLY important that we DO NOT do a CISC
encoding that involves "deep packet inspection".

the FSM that i came up with for the 10/16-bit encoding is a **POST**
length-analysis algorithm that is isolated to that particular local
instruction, that does *NOT* in any way interfere with the detection of the
packet "chunk" length.

it so happens that there *is* chaining between each of the packets (after the
independent length-based post-processing) however each 10/16-bit FSM, because
it is only 2 bits, is a really, REALLY simple decoding that involves 1 gate per
"level" of dependence.

1 gate delay between each is perfectly acceptable, and an optimising HDL
compiler would likely come up with a better solution that turns the analysis of
the chain-of-FSMs into only 3 maybe even as low as 2 gates of delay... for the
*entire* 4-way multi-issue identification of 4 10/16-bit instructions.

...

i just realised, through writing this out (which is always good) that the above
isn't strictly true.  we do actually need 1 bit to tell us that the next
instruction is back to "32 bit mode".

however - and this is the kicker - *it's one bit*.  it's not "4 bits"
or "2-3 permutations of 4 possible lots of bits"

that means, darn-it, we may have to actually prototype this in HDL (very
simple HDL) to see what it actually looks like.

-- 
You are receiving this mail because:
You are on the CC list for the bug.