[Libre-soc-dev] Load-Store Offset *and shift*

Luke Kenneth Casson Leighton lkcl at lkcl.net
Tue Jun 22 18:59:17 BST 2021


Paul et al,

bit of background:
https://news.ycombinator.com/item?id=24459041

i did a presentation at ICS2021 last week, basically emphasising that
OpenPOWER is a supercomputer-grade ISA.

however the one key thing missing from it is the Load/Store indexed
plus shift pattern.  ARM assembler:

  ld ra, rb#8

pseudocode:

  EA = RA+(RB<<imm)

this lack turns out to be quite costly in inner loops, where an extra
shift instruction is needed to get RB multiplied by the size of a word
(4) or dword (8).

with all RA+RB operations being in opcode 31 it is quite expensive use
of opcode space to try packing into XO, some duplicated variants of
LD/ST which have shift by 2, shift by 4, shift by 8, but it is
technically doable.

looking at the appendix map for Minor 31, six free columns would be needed:

* LD << 2, LD << 4, LD <<8
* ST ...

turns out there are that many free unused columns, two more marked
"reserved" (no info given as to why).

however this is not all.  we would like to do in-place FFT in SVP64 in
what is called "Zero Overhead Loops", first seen in Texas Instruments
VLIW DSPs, they can pack 14 micro-instructions into one instruction,
and it's enough to be able to spam the twin FP pipelines 100% full for
an entire FFT of huge size.  amazing design by TI.

we woukd like to do the same but it involves bitreversed logic on
Loads a la Cooley algorithm:

https://en.wikipedia.org/wiki/Cooley%E2%80%93Tukey_FFT_algorithm#Data_reordering,_bit_reversal,_and_in-place_algorithms

we need *this* EA computation:

EA = RA + (i*imm)<<RC

where i is the VECTOR loop element index, and RC is a register value
(yes, RC not RB) which unlike the shift in offset-with-shift, it
really does have to be a register not an immediate.

ironic that they are swapped.  i thought initially that the two could
be merged into one operation but the ranges and needs simply do not
match.

the use for FFT we can bury that in a special SVP64 Mode, even a new
Form (SVD-Form instead of D-Form) does not need to be part of a future
v3.N scalar.

but the scalar offset-with-shift (EA=RA+RB<<imm), it has some merit to
be part of scalar v3.NinTheFuture.

i wondered what peoples' general reaction to it might be, and was
curious if there was any background to why it was not added years ago.

l.



More information about the Libre-soc-dev mailing list