[Libre-soc-bugs] [Bug 238] POWER Compressed Formal Standard writeup

Wed Nov 25 23:52:51 GMT 2020

https://bugs.libre-soc.org/show_bug.cgi?id=238

--- Comment #88 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
(In reply to Jacob Lifshay from comment #87)

> > for a 4-way multi-issue, which will have up to 4x8=32 bytes of data in
> > the "raw" shift register, that would imply an absolutely insane 8x32-way 
> > 8-bit parallel multiplexer to get any one of the 32 bytes into their
> > respective targetted 4x 64-bit parallel 2nd-stage decoder buffers.
> 
> Do we need to decode 4 64-bit instructions per cycle?

mmm good question.  let's think it through

* SV-P64 - the probability is quite high that it will be chucking out
  a large batch of (multi-issue, OoO) operations all on its own.  i.e.
  VL=4 or VL=8

* still SV-P64 - however if those are VL=8 and elwidth=8-bit, actually
  it'll fit into a single 64-bit SIMD pipeline.

  therefore, answer: yes, we do want to be able to decode multiple SV-P64
  instructions, otherwise there's an unfair penalty to SV-P64 that will
  discourage its use.

* v3.1B 64-bit prefix - these are "just like 32-bit except happen to have
  larger immediates" - and consequently are one-only

  therefore, answer: yes, really, there is a strong incentive.

* SV-C64-swizzled (this is the concatenation of:
  - same prefix as SV-P48 plus
  - 16-bit swizzle data plus
  - any arbitrary (appropriate) 16-Compressed instruction

  here it can be VL-looped (because the SV-P48 11-bit prefix can mark
  things as VL-Vectorised) but in cases where that doesn't happen it's
  just "yet another scalar instruction"

  therefore, answer: yes, again, scalar SV-C64-swizzled has an incentive
  to be multi-issue

in all cases i'd conclude that there's a strong incentive to allow multi-issue
execution of 64-bit encoded instructions.

>  can we just have it
> decode up to 4 instructions or up to 16 bytes of instructions, which ever is
> smaller per cycle?

16 bytes being able to fit 2x 64-bit, 4x 32-bit or 8x 16-bit, yeah that would
keep the gate-count down to "a little less insane" and would be much more
manageable.

the only thing being, we'd have to make sure to keep an eye on the IPC
(instructions per clock), to make sure it's acceptable.  if certain encodings
get penalised, that's a definite "minus" because it will strongly discourage
their uptake / usage.

> > that's 8192 MUXes which are.. what... 5 gates each?  that's a whopping
> > 40,000 gates which, to give you some idea of scale, alexander, is 3x times
> > larger than a 64-bit multiplier.
> > 
> > even a 16-bit encoding (aligned at 16 bits) is going to be 4x16-way 16-bit
> > parallel multiplexing however with the reduction in both dimensions that's
> > an O(N^2) drop (down to "only" 10k gates) which is "just about borderline
> > acceptable".
> 
> Another way we could structure the decoder is: rather than using a giant
> shifter/aligner to compact the decoded instructions into a contiguous list,
> instead, we could just have overlapping decoders that start at each 16-bit
> offset and we just ignore the decoded instructions for those decoders that
> didn't start at the right offset. This totally avoids the need to have the
> aligning matrix at the cost of having much wider issue, which might be worth
> it.

i like the principle: sadly though, the PowerDecoder2 for OpenPOWER opcodes is
so insanely large (4k gates and that's just for the 150-or-so integer
operations) that it's not really feasible.

i'd have to double-check given that it's now subdivided into "Subset Decoders"
but i'd be really *really* surprised if it was small enough to be multipliable.

OpenPOWER is not like MIPS or RISC-V, the encoding switch-case cascade is
unfortunately absolutely massive.

plus, the algorithm for which instructions go into which actual "buckets" still
has yet to apply SimpleV!  we have *another* phase to add into the middle, in
between the length/mode decoder and the "multi-issue buckets" which is the VL
for-loop!

and that for-loop, remember, needs to be subdivided by (64/elwidth) to actually
determine the SIMD quantity and the corresponding auto-predication-mask (which
deals with cases where VL=5 or VL=3).

once the SIMD-if-i-cation is carried out *finally* actual "element"
instructions can be dropped into the "real" multi-issue buckets.

-- 
You are receiving this mail because:
You are on the CC list for the bug.