[Libre-soc-isa] [Bug 1056] questions and feedback (v2) on OPF RFC ls010

Fri Jun 2 16:34:57 BST 2023

https://bugs.libre-soc.org/show_bug.cgi?id=1056

--- Comment #51 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
(In reply to Paul Mackerras from comment #41)

> So every instruction whose behaviour is modified by vectorization has a
> SVP64 prefix? 

has to, yes.  HOWEVER... and this is waaay into the future: due to
the startling sililarity to ZOLC i have long-term plans to *SEPARATE*
the 24-bits into a SEPARATE (3rd) L1 Cache.

https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.301.4646&rep=rep1&type=pdf

that will be a huge research project on its own.

> I haven't seen a clear and unambiguous answer as to whether
> that is true or not. (You do seem to say it is true below, except that each
> such statement seems to have some sort of caveat on it.)

it is.  an Embedded Finite State Machine (and Libre-SOC's TestIssuer
does this) would:

* read the PO9-word
* cache the 24-bit RM area and prohibit interrupts
* read the next 32-bit word
* throw {24}{32} at decode+issue+execute and re-enable interrupts

and that is an important Micro-Architecture to have (minimum resources).

i considered at some point having an actual SPR to store the state in
between the two (the PO9-word and Defined-word-instruction) but i feel
it is a tiny bit overkill.

attitudes on that vary, certainly the use of the same technique in
RISC-V does not make people happy (the 18-bit in one instruction being
concatenated with a 12-or-so bit immediate in the following instruction)

> It did seem like a "bare" addi (without SVP64 prefix) in a vertical-first
> loop might be subject to register index modification, 

no absolutely not.  ok, i considered it, it is called "register tagging"
which historically has been left by the wayside but is making a
comeback in "Vector Streaming" in ARM SVE, Eth-Zurich Snitch, and
the European Processor Initiative.

the problem with tagging is that it becomes part of the Architectural
State (an SPR or in this case *group* of SPRs), which massively
complexifies simulators debuggers etc.
but also context-switch becomes absolute hell.

there are 3 bits needed per QTY 32-of GPR, FPR, CR, and  QTY 64-of VR.
4 is better. that's something like... what.... 32 64-bit SPRs? (!!!)
jacob and i went through a LOT of compression schemes on that, but
they were barely workable and involved a high instruction overhead.

also, imagine me thinking ahead and going "what would the ISA WG
accept?" - i don't bother with things that would not pass that filter :)

> element-width
> overrides, saturation, etc., from the VF loop. Does that happen, or is it
> the case that an addi without SVP64 prefix is never subject to any
> modification (i.e. it only ever accesses the GPRs specified by RA and RT in
> the instruction word)?

correct. [future-SVP64-Single on the other hand is an entirely different
matter, best left for another time]

what that means - and this is really neat and innovative - you can
use an *entire chain* of temporary intermediate Scalar instructions
in what otherwise is a Vector Loop

(!?!?!)

the classic example is Complex Number FFT.  i implemented that as
a Vertical-First Loop

what *that* means is that you are not wasting massive amounts
of temporary Vector Registers just because all the Vector Arithmetic
is "Horizontal" (elemts-first), you can *mix and match*.

https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_fft.py;h=fceb6b38#l612

 612             "sv.fmuls 24, *0, *16",    # mul1_r = r*cos_r
 613             "sv.fmadds 24, *8, *20, 24",  # mul2_r = i*sin_i
 ...
 620             "sv.ffadds *0, 24, *0",    # vh/vl +/- tpre

here it is:

* reading *vector-indexed* sources but the destination r24 is scalar
* r24 the *scalar* goes into FP mul-add-sub producing r24 *scalar*
* r24 *scalar* goes into twin /+- butterfly taking
  *0 as one side of the input-output and
  *(0+MAXVL) as the other

and if the loop is small enough to fit into Multi-Issue Reservation
Stations then WaW register-renaming may AUTO-VECTORIZE r24 and place
it into the exact same massive wide SIMD back-end ALUs as the other
(explicit) Vector registers.

Any Horizontal-only ISA whether Vector or SIMD *has* to allocate
an entire *Vector* r24 because there is no other option but to
work EXPLICITLY at the width of the SIMD/Vector register,
element-for-element.

> I was concerned with the case where there is no SVP64 prefix before an
> instruction. In that case, is it correct to say that it is guaranteed to
> behave exactly in all respects as specified in the current architecture,
> regardless of any values in SVSTATE or any other SPR?

aabsolutely correct. can you imagine the freaking-out that would occur?
i can :)

(and save/restore context-switch would become a nightmare, you'd have
no idea if you could safely use even one GPR!)

-- 
You are receiving this mail because:
You are on the CC list for the bug.