[Libre-soc-dev] GPR-to-FPR and FPR-to-GPR move operations

Sat May 29 10:04:58 BST 2021

links:
https://bugs.libre-soc.org/show_bug.cgi?id=230#c71
https://libre-soc.org/openpower/sv/int_fp_mv/

Lauri is kindly investigating MP3 in SVP64 assembler and it's turning out to
be a good test of what opcodes are needed.  in the bi-weekly meeting last
week, Paul, we mentioned briefly the need for GPR-to-FPR and FPR-to-GPR
mv operations (straight bit-wise) given that VSX/SIMD will not be added to
Libre-SOC as a GPU / VPU.

Jeff Bush's Nyuzi paper makes it clear that the cost of transferring
GPU-style
workloads through L1/L2 cache is hugely expensive, and describes the efforts
he went to to reduce power consumption
https://www.researchgate.net/publication/282269512_Nyami_A_synthesizable_GPU_architectural_model_for_general-purpose_and_graphics-specific_workloads

additionally, Lauri points out that just to get zero into an FPR is also
costly: it requires a LD operation which takes up data segment space
and unnecessarily activates both memory as well as L2 and L1 data
cache paths when compared to a MV-from-GPR operation.

in addition to that, in an Out-of-Order system the cycle latency of the
path through L1 cache will be much higher than a straight MV operation
(which in some micro-architectures may be a macro-op-fused operation).

* this in turn requires a larger number of "in-flight" operations
* this in turn increases the number of Reservation Stations
* this in turn increases O(N^2) the size of Dependency Matrices

the impact therefore of using the LD-ST path is extremely costly: all
of which points to a straight bit-copy between GPR and FPR being
necessary.

in some micro-architectures the MV may end up being a macro-op
fused operation: it may end up actually being removed entirely from
the pipelines, instead being used to mark the source or destination
of INT or FP operations as targetting the *other* regfile:

     fmv2int  fp5, r3
     addi r3, 0x5

becomes (macro-fused):

     addi fp5, 0x5

it should be clear that when adding bitmanip operations as well, the
possibilities expand to be able to perform bitmanipulation on FPRs.

l.