[Libre-soc-bugs] [Bug 238] POWER Compressed Formal Standard writeup
bugzilla-daemon at libre-soc.org
bugzilla-daemon at libre-soc.org
Tue Nov 24 19:08:33 GMT 2020
--- Comment #84 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
(In reply to Alexandre Oliva from comment #80)
> re: comment 78
> Though I discussed the topic, I missed explicitly stating it as a mostly
> orthogonal proposal:
> 4. have an insn kind that provides additional operands for the subsequent
(i'll come back to the other ones, which are good / worthwhile)
this one is so detrimental to the efficiency of a hardware implementation, that
the detriment is sufficient to eliminate it entirely.
the absolute top priority job of an instruction hardware decoder is:
identify the instruction length.
this job above all others is the absolute, without a shadow of doubt, the
absolute top highest priority above and beyond all and any other.
in a standard 32-bit RISC instruction set such as OpenPOWER v3.0B this is
an absolute brain-dead task:
word0 word1 word2 word3 word4 ....
instr0 instr1 instr2 instr3 instr4 ...
however now let us introduce OpenPOWER v3.1B which includes 64-bit
instructions. let us also presuppose that we want a 4-way issue.
that means that we can either have at the minimum or the maxium the following:
* all 32-bit opcodes
* all 64-bit prefixed opcodes
that means we need to have a buffer of up to 8x 32-bit words in memory.
now let us have a series of "analysers" which tell us, for each instruction,
the answer to the following question:
does word[N] have a Major Opcode EXT001? (this is the "prefix" identifier)
note: we haven't actually determined if it's *actually* an instruction. the
only one we KNOW is an instruction is:
so, we have an array of 8 words:
word       
and we have an array of 8 bits which tell us "bits 0..5 equals 0b000001"
P0 P1 P2 P3 P4 P5 P6 P7
now let us suppose that we have those set to the following values:
0 0 0 0 0 0 0 0
what does that tell us? it means, ta-daaa, the 1st 4 words in our buffer are
*NOT* 64-bit prefixed instructions.
therefore, in this case, we can take the 1st 4 words, shove them into
instruction-processing-units 0, 1, 2 and 3 respectively, along with their
now let us suppose that we have the following values for P0-P7:
1 0 1 0 1 0 1 0
what that tell us? it means, ta-daaaa, this is 4x 64-bit prefixed
in fact, anything according to this pattern is *also* 4x 64-bit prefixed:
1 X 1 X 1 X 1 X
because the "data" in the 2nd word of each 64-bit instruction could contain the
pattern "0b000001" in bits 32-37
therefore, anything matching that pattern "1x1x1x1x" we shove straight into the
4 buffers, along with length-identification pattern:
you can see where that's going. you basically have a big "Switch" statement
covering *all* possible patterns, producing all the permutations:
and additionally the *INDEX*, on a per-target-buffer-basis, of where the 32/64
packet is to be routed to, that target being the multi-issue "instruction"
decoder, indexed 0,1,2 or 3.
AT NO TIME WAS THERE ANY FURTHER INSPECTION OF THE PACKETS OTHER THAN
"IS BITS 0 THRU 5 EQUAL to 0b000001"
now let's f*** with this utterly simple brain-dead algorithm. let's throw in
some variable-length encoding. let's add the proposed 16-32 encoding.
let's say that if the 16-bit opcode (in let's say bits 0 thru 3) are equal to
0b1111, that this indicates "escape sequence to decode as a 32-bit instruction"
our detection pattern now comprises:
* an array of 16x identifiers of the 1st 6 bits of every
16-bit chunk to the pattern 0b00000
* an additional array of 16x identifiers of the 1st 4 bits of every
16-bit chunk to the pattern 0b1111
followed by a MASSIVELY complex permutation network of Switch/Case statements,
potentially several thousand in size, that is at least two orders of magnitude
larger than our simple brain-dead identification algorithm in terms of gate
if we try to do this as a chained detection system (a for-loop in hardware),
the gate chain will be SO LONG that our maximum clock rate even in 7nm would
probably be limited to somewhere around the 800mhz mark.
so i have to reiterate: it is CRIICALLY important that we DO NOT do a CISC
encoding that involves "deep packet inspection".
the FSM that i came up with for the 10/16-bit encoding is a **POST**
length-analysis algorithm that is isolated to that particular local
instruction, that does *NOT* in any way interfere with the detection of the
packet "chunk" length.
it so happens that there *is* chaining between each of the packets (after the
independent length-based post-processing) however each 10/16-bit FSM, because
it is only 2 bits, is a really, REALLY simple decoding that involves 1 gate per
"level" of dependence.
1 gate delay between each is perfectly acceptable, and an optimising HDL
compiler would likely come up with a better solution that turns the analysis of
the chain-of-FSMs into only 3 maybe even as low as 2 gates of delay... for the
*entire* 4-way multi-issue identification of 4 10/16-bit instructions.
i just realised, through writing this out (which is always good) that the above
isn't strictly true. we do actually need 1 bit to tell us that the next
instruction is back to "32 bit mode".
however - and this is the kicker - *it's one bit*. it's not "4 bits"
or "2-3 permutations of 4 possible lots of bits"
that means, darn-it, we may have to actually prototype this in HDL (very
simple HDL) to see what it actually looks like.
You are receiving this mail because:
You are on the CC list for the bug.
More information about the libre-soc-bugs