[Libre-soc-dev] svp64 review and "FlexiVec" alternative

Sun Aug 7 12:14:49 BST 2022

On Sun, Aug 7, 2022 at 12:19 AM Jacob Bachmeyer <jcb62281 at gmail.com> wrote:
>
> lkcl wrote:
> > once you have read those papers you will see the possibilities
> > intuitively and understand that what is in the current SV Spec is by
> > no means the final word.
>
> Indeed they are, and I am still digesting them.

Snitch is pretty easy - barrel processor with Deterministic Scheduled
FIFOs on LD/ST.  Extra-V is also easy, it's a Deterministic
 Scheduled "pre-process-filter".  ZOLC is mind-blowing and
 takes a lot to get but is also Deterministic Scheduled nested
 looping.  and SVP64 is of course Deterministic
Scheduled hardware-level looping (esp. matrix, dct and fft).

the key phrase here: deterministic scheduling, deterministic
 scheduling, deterministic scheduling.  the thing is, all of them
have absolutely astonishing power bandwidth and execution
reduction figures.

ZOLC 45% reduction in instructions executed for MPEG estimation
Snitch 85% power reduction.
EXTRAV bandwidth reduction equal to the sparseness of graphs
SVP64 program size reduction 50% and complexity reduction.

these are not "pissing about" numbers.  other ISA designers are
absolutely delighted if they get 5% reduction in anything

> > https://libre-soc.org/nlnet_2022_opf_isa_wg/
>
> I wish you luck in your endeavors here.

appreciated. we will start with small RFCs (2 instructions), get the
process established.  then slowly ramp up.  interestingly everyone
else has to do the same (incl. IBM)

> > going back to architectural resources: nah. it's 5 instructions
> > with 5/6-bit XO (like addpcis or the crand/or/xor group) and
> > 25% of EXT001. that is in no way a "lot of opcode resources".
>
> Unless I badly misread the opcode maps, EXT001 is the /vast/ majority of
> the remaining opcode space.

i get the feeling simply from IBM's experience with multi-issue
that they're reluctant to use 64 bit ops.  they *know* that if all
ops go 64 bit then that basically cuts NUM_ISSUE_PER_CLOCK
in half.

    8 wide read => 32 bit => 8 wide issue
    8 wide read => 64 bit => **FOUR** wide issue

the only reason we get away with SVP64 is because

    8 wide read => 64 bit => MASSIVE spamming execution

and that's due to "times VL"

> > the moment you add 48 bit the variable length encoding
> > massively complexifies multi issue detection and starts
> > to interfere with the parallelism achievable.
>
> Not variable length:  the VPU

(you mean Vector Processor)

> would have its own instruction set

which is exactly what i don't want to go *anywhere* near.
the whole purpose of the exercise here for nearly 4 years
has been to learn from the "hard knocks" lesson of other
non-uniform GPU/VPU (VPU==Video Processing Unit)
architectures.

the complexity of the software drivers for 3D is absolutely
bonkers.

> number.  The VPU would *not* implement Power ISA at all.

that's very very very VERY much such a nightmare in terms
of compilers, software ecosystem and bandwidth that i want
to go noooowhere near it :)

i have studied GPU design it is beyond insane.  the only reason
it is sold as products today is because it is multi billion dollar
companies doing so. they throw hundreds of millions at compilers
alone.

even just developing binutils for it would be a from-scratch project
with zero possibility of leveraging any other ISA.

to give some idea of timescales it took TWELVE years for OR1K
to reach a linux boot.

>  The VPU
> interface in the Power core would be a Custom Extension.

yep, this is unfortunately GPU-nightmare-territory with 20+ manyear
timescales when the software ecosystem is brought in.
we did the analysis 2 years ago, when moving from RV. took a few
weeks/months to go over everything.

> Wait... a /different/ paper?  

no

>URL?

https://libre-soc.org/openpower/sv/SimpleV_rationale/

> What about some form of packetized serial Wishbone on DisplayPort PHY?

interestingly wishbone is actually not adequate on its own.
although once "tags" kick in then it could be leveraged.

the PHY is a separate matter and separate discussion, for
early prototypes remaining parallel bus PHY is perfectly fine

> Now you are suggesting a return to the "sub-ISAs" that IBM already tried
> once and rolled back into a single mainline, if I understand the history
> of Power ISA correctly.

and caused immense problems for everyone !IBM as a result.

that decision, which was of course made precisely because the
ISA was both ITU-style closed and also there *was* no other
competitor product in *any* market, they basically went "well
screw this, we might as well drop all the software support for
everything but what *WE* need".

that's the point when they f*****d the EABI (1.9 makes VSX
mandatory), started submitting "#ifdef POWER9" to glibc6,
ripped out all of the hwcaps dynamic loading of alternative
library implementations.

they've done one hell of a lot of damage that now needs to
be undone.

> > again: look at the SV rationale, for the link to the Snitch paper,
> > they suggest synchronous time-division multiplexing and
> > achieve 85% power reduction as a result.
>
> Which is basically the CDC 6600's "barrel" peripheral processor.

the difference being here it is turned around: the *main core*
is a TDMx processor.  FIFOs bypassing L2 and L1 are used
between LDST to ALUs. the data doesn't even hit regfile. 

> > instructions that make FIFO queues between ALUs the primary
> > building blocks.
>
> No queues, not if ALU latency is short enough and f_CLK low enough to
> push the data through multiple ALUs within a single half-period.  In
> other words, executing instructions in groups as they can be collected
> in-order.

yes fine under those conditions.  i am thinking inter-core and
Snitch-style between non-core subsystems (such as Memory or
more accurately DMA controllers) where the DMA controller
*might actually be* yet another full Power ISA core.

basically i want to take the brakes off the idea introduced in
Snitch where you have a FIFO connected directly to a register
by way of "tagging".

but for the *other end* of that FIFO to be connectable to literally
anything - a DMA controller, another register, another register
*on another core* - anything.

> > you forgot to ask the corollary question, how *do* you do it?
> > and the answer is:
> >
> > sv.addi r67, r123, 8
>
> Why should this be a separate assembler mnemonic?  Why would the
> assembler not simply accept "ADDI R67, R123, 8" and produce the prefixed
> opcode?

i did not want to get into stepping on the standard Scalar Power
ISA namespace.  an "sv." prefix makes it very clear that we fully
intend to be respectful and nondusruptive.

[there has already been a significant amount of bluster which has
had to be quelled].

if someone from *IBM* asks that question then the discussion
can take place safely.

> > i assume you mean "if you extend the GPRs to 128 entries are
> > there circumstance where scalar 32-bit nonprefixed ops can't
> > get at them" and the answer is "of course, but you always
> > just use the prefixed version of the exact same op to do so".
>
> No, I mean "are there any existing fixed-point instructions that are
> only applicable to a subset of the fixed-point registers in Power ISA?"

ah, okay.  only in VLE book (16 bit Compressed) which was
invented by the Motorola team that were tragically all killed
in a Malaysia plane crash 12 years ago.

everything else is 100% uniform. which is one of the reasons
why the MAJOR opcode space (which is 6 bit not 5) is under
such enormous pressure.

it does not help that IBM has secret processor designs
with unpublished ISA extensions... oh, *and* Power ISA is
part-compatible with z-Series mainframes!

urrr...

> All of your examples are PackedSIMD instruction sets that were
> subsequently extended to wider SIMD.

they extended QTY as well, but yes.

> does not work for Power ISA because x86 was a giant mess even before AMD
> extended it like that, but I think this illustrates the issue at hand here.

yes, agreed.

> Then there should be a subset of SV, orthogonal to the rest of SV, that
> only extends the register file.

can't.  really, this doesn't work.  the EXTRA encoding is intimately
tied into the prefix.

i mean, you _could_ define a (fifth) Compliancy Subset which
specifically and only permits (a) Scalar EXTRA setting and
(b) requires all-zeros for all other parts of the 24-bit RM prefix...

no, not useful.

you have to understand and appreciate that within that EXTRA
encoding there is just not enough space.  see EXTRA2/3
https://libre-soc.org/openpower/sv/svp64/#index13h1

now look at the table rows which are "Scalar", only.  ignore
the ones marked "Vector".  you see how in EXTRA2 the numbers
stop at 63? there are 128 regs but for any instruction with
EXTRA2 you can only access scalars up to 63?? moo?

CR Fields are "even worse", EXTRA2 scalar only goes up to CR15!
Even EXTRA3 scalar only goes up to CR31.

accessing these high numbered registers as Scalars *critically
 relies* on having predicate masks (1<<r3 aka 2^r3 mode in
 particular) and at least a vestigial understanding of this being
a Vector ISA Extension.

i really don't want to go anywhere near putting forward a sub-par
Compliancy Level that would cause people very reasonably to
reject it.

it would be much better to design a completely different extension
using (yet more of) EXT001 rather than damage SVP64 which
was very much designed with Scalable Vector Processing in mind.

a separate EXT001 (one of the 64 squares) for Scalar numbering
would work well because there will be 20 bits available, to easily
cover 4 bits per register (for CR Fields) even if there are 4 operands

the only downside being, as i explained right at the top, that's now
64 bit opcodes and you reduce multi-issue and/or have to
increase (double) fetch bandwidth to sustain the same issue
width. just to get scalar ops into execution.

if that was a standalone proposal i would expect the reaction to
be "meh".

but as a supplement to SVP64 it's really good.

l.