[Libre-soc-dev] SimpleV development discussion (behind that "libresoc mode" bit)

Thu Oct 8 21:59:12 BST 2020

(hi folks i am sending this to the openpower-hdl-cores list so that it
is recorded for when the OPF ISA WG mailinglist is formed)

https://bugs.libre-soc.org/show_bug.cgi?id=213#c37
https://libre-soc.org/openpower/opcode_regs_deduped/

for anyone interested, we started the discussion of how to leverage
the OpenPOWER scalar ISA in combination with prefixing (similar to
v3.1B prefixes) to create a vector ISA... *with no
b
additions to the underlying scalar ISA*.

the advantage this gives us (in time-saving alone) becomes especially
clear when examining the number of opcodes for VSX, compared to the
rest of the ISA.

not only that but at a microarchitectural perspective a preliminary
and correct implementation of SimpleV can be had by literally
inserting a hardware for-loop 0 to VL-1 in after instruction decode
and before instruction issue.

the (long, prior, detailed) similar development of SV using RISC-V
(since abandoned) gave some extremely valuable features that we would
definitely like to see included. these are:

1) predication
2) data-dependent fail-on-first

the latter is outlined in this paper and its strategic inportance in a
modern ISA cannot be underestimated:

https://arxiv.org/pdf/1803.06185.pdf

predication is something that, unfortunately, has to be "built-in" to
an ISA at the start... *or* retrofitted with careful analysis using
"prefixes".

my feeling is that a PowerISA v3.2 *might* be able to use major opcode
1 prefixing on VSX to introduce predication, if there are spare bits,
and if so i thoroughly advocate investigating ffirst retrofitting at
the same time.

the difference that these two things make to an ISA is just
staggering.  the strncpy VSX patch to glibc6 is a whopping 250
assembly instructions, and the current proposed patch as of last
submission *segfaults if called at the top of a page boundary*!

there's not even a way you can detect if it *would* segfault, not
without introducing quite serious Spectre-grade security holes into
the ISA or OS by allowing userspace to probe valid / invalid memory
pages.

by contrast, RVV and SV vector-based ffirst-based strncpy? 14
instructions.  the entire function.  no SIMD setup.  no SIMD cleanup.
no SIMD conditional preparatory memory segfault testing.  14
instructions for the *whole* implementation of vectorised strncpy.

our first task has been to identify the "categories" of OpenPOWER ops:

* 2xGPRin-1xGPRout
* 1xGPRin-1xCRout

and so on, then to group instructions that match those "patterns" and
analyse them in depth.

this then allows us to allocate and prioritise bits in the "prefix"
which will say "actually RA is vectorised" or, "actually BA Condition
Register is vectorised".

yes, really: vectorised Condition Registers.

although it is very early days we are finding that PowerISA is
extremely cool and lends itself well to useful vectorisation, and
that's mainly down to the CRs.

it goes like this: if CRs are vectorised and used to store the
comparisons from vector results, then follow-up manipulation with
vectorised crand/crnor is more powerful and flexible than what went
into VSX *and it's not limited to 16 bytes* because VL can be set to
any value from 1 to 64.

not only that but the SV Prefix we are going to add predication to it
as well, meaning that not just the Arithmetic ops can have elements
skipped, but CR vectorised ops can have element operations skipped as
well.

by keeping each individual element compares in sequential CRs we
clearly have far more flexibility, which is made even more powerful as
further described below.

one crucial idea proposed so far: No to SO.  XER.SO and its
propagation to CR0 is the bugbear of even the scalar ISA (creating a
major RD-MOD-WR hazard in OoO engines) reducing performance so much
that nobody wants to use it.

this makes both SO and OE utterly dead (wasted space), and effectively
leaves us with two opportunities when flipping to "libresoc mode":

1) hypothetically the CR bit that mirrors SO is free to be
re-interpreted *as* a predicate bit

2) the OE bit in any given opcode can be re-interpreted to *be* the
indicator to enable predication for a given instruction.

the combination here makes sense for "applying" of predication, as follows:

if CPU in libresocmode:
   for i in range(VL):
         if opcode.OE and CR(i).SObit = 0:
              skipthisloop
         GPR(RT+i) = ScalarOp(GPR(RA+i))
else:
    # do normal Power v3.0B OE stuff
    GPR(RT) = ScalarOp(GPR(RA))

by ensuring that, unlike OE SO, these are not set / used on a tight
interdependent READ-MODIFY-WRITE cycle but are instead parallel and
entirely independent elements *including* the predication, as shown in
the above VL loop, we get plenty of scope for high performance
microarchitectures.

the point of mentioning all of this is for a number of reasons.

firstly: to let people in general across the OpenPOWER community know
that this work has started, and the above gives an (unfinalised)
glimpse of what to expect.

secondly: to say that anyone interested is most welcome to participate
(at libre-soc.org)

thirdly: that this is sufficiently complex and needs people with such
in-depth knowledge of the OpenPOWER ISA that we know well in advance
we need help with review and commentary, and are not afraid to ask.

fourthly: a headsup that we intend to submit this to the OPF ISA WG
for consideration as an official extension to PowerISA, and that if
that were done without any kind of advance knowledge it would be
very.. how to say... "unkind" of us.  the reason being that we know
from experience when doing the RV version it was *18* months work (24
if including the ISA simulator and unit tests) and dropping that much
work "at" people with no advance notice is not reasonable.

feel free to drop by any time, on the dev list, isa list, or the bugtracker.

btw you'll like this: the page on the wiki for spotting duplicates was
auto-generated by a simple script that read the exact same
machine-readable CSV files that have gone into the LibreSOC ISA
decoder.  yes this was one of the planned strategic uses that made us
choose CSV in the first place :)

https://libre-soc.org/ikiwiki.cgi?do=goto&page=openpower%2Fsv_analysis.py

l.