[Libre-soc-dev] Load-Store Offset and shift

Tue Jun 22 18:59:17 BST 2021

Paul et al,

bit of background:
https://news.ycombinator.com/item?id=24459041

i did a presentation at ICS2021 last week, basically emphasising that
OpenPOWER is a supercomputer-grade ISA.

however the one key thing missing from it is the Load/Store indexed
plus shift pattern.  ARM assembler:

  ld ra, rb#8

pseudocode:

  EA = RA+(RB<<imm)

this lack turns out to be quite costly in inner loops, where an extra
shift instruction is needed to get RB multiplied by the size of a word
(4) or dword (8).

with all RA+RB operations being in opcode 31 it is quite expensive use
of opcode space to try packing into XO, some duplicated variants of
LD/ST which have shift by 2, shift by 4, shift by 8, but it is
technically doable.

looking at the appendix map for Minor 31, six free columns would be needed:

* LD << 2, LD << 4, LD <<8
* ST ...

turns out there are that many free unused columns, two more marked
"reserved" (no info given as to why).

however this is not all.  we would like to do in-place FFT in SVP64 in
what is called "Zero Overhead Loops", first seen in Texas Instruments
VLIW DSPs, they can pack 14 micro-instructions into one instruction,
and it's enough to be able to spam the twin FP pipelines 100% full for
an entire FFT of huge size.  amazing design by TI.

we woukd like to do the same but it involves bitreversed logic on
Loads a la Cooley algorithm:

https://en.wikipedia.org/wiki/Cooley%E2%80%93Tukey_FFT_algorithm#Data_reordering,_bit_reversal,_and_in-place_algorithms

we need *this* EA computation:

EA = RA + (i*imm)<<RC

where i is the VECTOR loop element index, and RC is a register value
(yes, RC not RB) which unlike the shift in offset-with-shift, it
really does have to be a register not an immediate.

ironic that they are swapped.  i thought initially that the two could
be merged into one operation but the ranges and needs simply do not
match.

the use for FFT we can bury that in a special SVP64 Mode, even a new
Form (SVD-Form instead of D-Form) does not need to be part of a future
v3.N scalar.

but the scalar offset-with-shift (EA=RA+RB<<imm), it has some merit to
be part of scalar v3.NinTheFuture.

i wondered what peoples' general reaction to it might be, and was
curious if there was any background to why it was not added years ago.

l.

[Libre-soc-dev] Load-Store Offset *and shift*

[Libre-soc-dev] Load-Store Offset and shift