[Libre-soc-dev] svp64 review and "FlexiVec" alternative

Wed Aug 3 05:50:39 BST 2022

lkcl wrote:
> On Tue, Aug 2, 2022 at 5:09 AM Jacob Bachmeyer <jcb62281 at gmail.com> wrote:
>   
>> linear.  The actual number of CAM cells required depends on
>> associativity, so that gets complicated.  Yes, power usage is definitely
>> more than linear in cache size, but I think quadratic is still an
>> exaggeration.
>>     
>
> i know it is surprising, but it is the cache misses and associated
> hammering of L2 and data pathways between L1 and L2
> (reloading of cache lines) that contributes to N^2.
>
> this effect *is* well-documented, well-researched, replicable,
> and often quoted.  it just doesn't sound logical.
>   

Quadratic still seems high, but it is definitely more than linear and 
more than log-linear.  I think we can agree that it can be enough to 
cause problems.

>>> and that includes Horizontal Summing and Parallel Prefix Sums.
>>>       
>> Unfortunately, FlexiVec's implicit vectorization model makes explicit
>> horizontal operations impossible.
>>     
>
> i have a feeling that Mitch worked out how to do it.  FMAC
> having in effect a Scalar accumulator (src==dest) whilst
> other operands get tagged as vectors, HW can detect that and
> go "ah HA! what you *actually* want here is a horizontal
> sum, let me just microcode that for you".
>   

Well, now that I think about it, yes, FlexiVec *can* express a 
horizontal sum by accumulating into a scalar register.  Hardware 
recognizes this very simply:  an ADD targeting a scalar register RX, 
using that same RX and a vector register RY.  This will also work with 
the null implementation.

>>  Combining FlexiVec and VSX might
>>     
>
> nooooo.  *really*, no.  ok anyone else can do that but i have
> put my foot down and said "no" on PackedSIMD.  the lesson
> from the sigarch article is very much understated
>
> https://www.sigarch.org/simd-instructions-considered-harmful/
>
> in the case of Power ISA although it was amazingly powerful
> for its time and for its purpose (Banking, supercomputing) VSX
> is an astounding *750* additional instructions which even if you
> put say 2-3 days per instruction, unit test, and Compliance Suite,
> comes out to an eye-popping six YEARs development effort.
>
> no, no, and hell no :)
>
> but, see below...
>   

FlexiVec is, well, flexible:  for Power ISA, Flexible Vectors on each of 
the register files would be a separate always optional feature, but 
flexible vectors on the fixed-point registers are a prerequisite for 
flexible vectors on either FP or VSX.

>> around this limitation in Power ISA, if VSX can do reducing sums.
>> However, multiple cycles of this, complete with LOAD/STORE would be
>> needed to eventually reduce to a final 32bit result.)
>>     
>
> another way is to use VVM/FlexVec to do a parallel reduction,
> an explicit outer loop, but honestly it is easier to detect the
> use of the FMA (or madd) and micro-code it.
>
> last resort you can always fall back to Scalar.
>   

I had not realized that an accumulating ADD allows to indicate a 
horizontal sum.  Hardware can do that in varying ways; the complexities 
are manageable even for an SIMT implementation.

>> There is a possibility of combining FlexiVec (for "string" operations)
>> and Simple-V (for reductions),
>>     
>
> if you recall the LD/ST Fault-First capability of RVV, which was
> inspired by ARM SVE FFirst, i added *Data-Dependent* Fail-First
> to SVP64 as well as LDST FFirst, to help with strings and other
> sequential data-dependence.
>
> when an Rc=1 test fails (using the same BO encoding from
> branch: one bit inverts the other 2 bits say which of EQ LT GT or SO
> to test from the CR Field) then the Vector Length VL is truncated
> at that point.
>
> an extra bit in the RM Prefix (VLi) specifies whether the truncation
> *includes* the element that failed or *excludes* it in the setting VL.
> this means for example that you can do
>
>     sv.cmpi/ff=eq/ew=8/vli *r8, 0 # ew=8 to do 8-bit cmp to zero
>
> and the truncation of VL will include the null-termination zero of
> a string.
>
> no pissing about.
>   

My use of the term "string" here was a bit unclear.  FlexiVec does not 
deal in C strings; all vectors have a counted length (in CTR), but are 
intended to be arbitrarily-long "strings" of elements.  The idea is that 
FlexiVec can handle long parallel operations, while Simple-V is used for 
shorter operations.  Since Simple-V introduces its own iteration counter 
(in SVSTATE if I understand correctly), what would prevent a Simple-V 
inner loop inside a FlexiVec outer loop?  Earlier, you mentioned that 
some algorithms have relatively simple repeatable sub-kernels.  If the 
individual applications of those sub-kernels are independent, Simple-V 
could express the per-group computation while FlexiVec expands that 
across multiple concurrent groups.

>> if Simple-V is also generalizable to FP
>> and VSX.
>>     
>
> yes to Scalar FP.  SV can be applied to anything: it actually doesn't
> care about the operation (per se).  you can do something as dumb
> as apply SV to the PackedSIMD operations of VSX if you feel so
> inclined [i don't].
>   

The main reason for wanting implicit vector operations to generalize to 
VSX is orthogonality, lack of which would likely bother the OPF ISA WG 
severely.

> [...
>>> it is a compromise, but the important thing is, it is *our choice*,
>>> absolutely bugger-all to do with the ISA itself. anyone else could
>>> *choose* to do better (or worse).
>>>       
>> Now you have variable-latency vector lanes.  :-)
>>     
>
> yyep.  not a problem for an OoO microarchitecture, in the least.
> any in-order architect will be freaking out and crying home to
> momma, but an OoO one no problem.
>   

That would make Simple-V dependent on a specific microarchitectural 
strategy, which is probably very bad in an ISA.

> [...]
>   
>> Not done in FlexiVec -- avoiding those conflicts is the programmer's
>> responsibility.
>>     
>
> to date i have been extraordinarily careful to emphasise that
> SV does not break programmer expectations in this type of way.
>   

The (non-)aliasing requirements in FlexiVec are no different than the C 
runtime memcpy() requirements and similar common restrictions.  These 
should not be difficult in the slightest for programmers to understand.  
Loop unrolling has similar requirements.  In fact, since FlexiVec *is* 
loop unrolling, it has the exact same requirements.

>> The idea is that the vector unit probably has either its own L1 cache or
>> is directly connected to L2.  (I believe the AMD K8 did that with SSE.)
>> If an SIMT vector chain has its own L1, that cache is likely also
>> divided into per-lane sub-caches.
>>     
>
> this starts to get completely out of the realm of normal ubiquitous
> compute, needing specialist programmer knowledge and hardware
> level intricate internal knowledge.
>   

None of these details of vector unit implementation are visible to the 
programmer.  I mention them only to demonstrate that solutions are 
possible.  Would a hardware architect not be expected to have that 
knowledge?

> [...]
>> So then the best route would be to abandon both Simple-V and FlexiVec
>> and implement VVM, on the basis that Mitch likely knows (many)
>> something(s) that we do not?  :-)
>>     
>
> sigh, i wish :)
>
> no but seriously, we're committed to SV and Power ISA, now.  2
> years on SVP64 (so far), we have to see it through.
>   

So Libre-SOC is committed to Simple-V at this point and FlexiVec must be 
left as a possible future option.

> [...]
>> Not so at all:  hardware commits to a MAXVL at each FVSETUP and MAXVL
>> can vary with the vector configuration.
>>     
>
> whatever amount that is, once the loop is started you are
> committed to finishing it with no possibility of contextswitch
> (unless saving the entire state which, by design, has to include
>  the behind-the-scenes hidden vector SRAM contents).
>   

FlexiVec has /exactly/ that:  two privileged save/restore instructions 
to support context switching.  However, since most vector loops are 
expected to complete uninterrupted and hardware need not actually write 
(or load) the entire vector SRAM if the vector register contents are 
currently undefined, the overall performance cost is expected to be low.

>> Larger vector processors are expected to use the SIMT topology, where
>>     
>
> [checking: SIMT = "synchronously broadcast an instruction to
> Cores that have their own regfiles caches LDST access just no PC"
> is that accurate in this case, yes it is. ok i'm good. i say "good",
> i am not a fan of SIMT. at all. although the opportunity to
> entirely hide its very existence behind autovectorisation is
> pretty compelling i have to admit]
>   

Having SIMT hidden behind a simple vector loop was /the/ major 
motivation for FlexiVec.

>> the entire vector register set is not stored in a single physical
>> register file, but is instead distributed across the vector lanes.
>>     
>
> this just means that every SIMT core [standard-core-in-every-respect
> other-than-receiving-broadcast-instructions] has to perform a save of the
> hidden SRAM used for behind-scenes vector autoregs, on a
> contextswitch.
>
> it moves the problem, it doesn't solve the problem.
>   

Each of them has its own (fixed) subset of the vector context.  An SIMT 
vector unit already has parallel memory access, so context save/restore 
is no more expensive than the largest possible user vector store/load 
operations.

> [...]
>> Power ISA currently has
>> branches, but not predication.  Using a short forward branch to imply
>> predication /might/ be possible.
>>     
>
> a better bet would be some form of "tagging". "if you use r5
> then please consider the instruction to be predicated, please
> use Condition Code {n} as the predicate".  or, just,
> "the next instruction is predicated" which is pretty much exactly
> what a short branch would be, yes.
>   

Predication is thus solved:  the null implementation executes the 
forward branch and produces the correct result, while vector-capable 
implementations must recognize forward branches and translate them to 
vector predication.

At this point:  "While in use, the Flexible Vector Facility overrides 
the Branch Facility. ..."

>>> unfortunately for RVV it means that 100% of portable algorithms
>>> *have* to contain a loop. i note that in the comparison table.
>>> at least with VVM/FlexiVec the auto-vectorisation is a hardware
>>> abstraction.
>>>       
>> No, FlexiVec absolutely requires a loop unless the vector length is
>> exactly one element.
>>     
>
> you mean CTR=1 i assume?
>   

Yes.

> regardless, the very fact that loops are required (for hardware
> parallelism either automatically for VVM/FlexiVec or
> explicitly in the case of RVV) i consider to be a limitation.
>   

The reason that a loop is absolutely required is that the null 
implementation only processes one element on each iteration of the 
loop.  Software cannot know what MAXVL will be in advance, only that 
MAXVL is greater than zero.

> SV although you have to call SETVL you can use sv.ld / sv.st with
> oredication to do a stack save/restore, effectively Compressed
> LDST multi
>   

This also works because Simple-V uses the main register file; other 
vector models define separate vector storage.

> [...]
>>> the problem with the design of FlexiVec is that once you
>>> are committed to any given loop you ABSOLUTELY MUST
>>> complete it.
>>>       
>> Nope!  You do not even have to complete any one instruction either,
>> since you can use the slot in the scalar PRF to track the current vector
>> progress.  FlexiVec is not available to privileged code, so the only
>> requirement is that the interrupt handler restore any registers it
>> changes, which any existing interrupt handler already meets.  (To
>> privileged code, the scalar registers will appear to contain the
>> internal vector offsets when problem state is running a vector
>> computation; as long as that is restored properly the vector computation
>> will resume correctly.)
>>     
>
> unfortunately i am running out of time to verify whether that's
> correct or not. i know it will take weeks and potentially months to do
> a hardware-cycle-accurate simulator and unfortunately i have to draw a
> line under FlexiVec and eliminate it from SV.
>
> when we had time was when SV was in development. ironically if the
> discussion had taken place 2 years ago when we started SV for Power i
> would not have known what you were talking about :) Vertical-First was
> only added about a year ago and that was only when i finally recognised
> VVM as being also Vertical-First. what i am going to do however is create
> a comp.arch thread referring to this discussion. i think people there
> will be interested to share insights esp. on FlexiVec.
>   

It is worth noting that I could not have proposed FlexiVec prior to 
those developments.  :-/

> [...]
>>> the key thing i would like to know is, would you be interested to
>>> properly investgate this, funded by an NLnet EUR 50,000 Grant,
>>> to put Snitch, EXTRA_V, SVP64 and OpenCAPI all together in
>>> one pot and create Coherent Distributed Compute?
>>>       
>> That sounds very interesting, but I will need to do quite a bit of
>> reading before I will be able to say with confidence that I /can/ do that.
>>     

Also, as mentioned below, OpenCAPI has to be excluded from that mix at 
this time if I am involved.

>> Are we confident that Simple-V will actually be accepted at OPF?
>>     
>
> it'll be proposed.  if it is rejected then RED Semiconductor will to the
> strict letter activate the provisions allowed to be followed, as set
> out on page xii of the Power ISA Specification, to use EXT022.
> *we* will have acted "in-good-faith", in other words, and it becomes
> neither our problem nor allows any OPF Member to complain
> [including IBM] if we have, in fact, followed precisely and exactly
> the proposal procedures set out.
>
> one company _did_ in fact attempt to blithely drive a coach and
> horses through the OpenPOWER ISA EULA and the ISA WG,
> even telling me that they actually intended to go ahead and design
> 3D and VPU instructions, create a mass-volume processor, then
> expect IBM and other OPF Members to "accept a fait-accomplit".
> which if they'd bothered to read the EULA they'd have found that
> approach to be a direct violation.
>
> they're not around any more.
>   

This is the main reason I would have wanted FlexiVec for Power ISA 
("Flexible Vector Facility" to put it in quasi-IBM-speak) accepted as a 
Contribution *before* even /beginning/ an implementation.

>>  (That
>> is actually what I see as the biggest risk right now -- that Simple-V
>> will end up needing too many architectural resources and being rejected
>> on those grounds.
>>     
>
> it was designed such that it could be explained to IBM that there exists
> an *option* - not repeat not repeat not a REQUIREMENT -to leverage and
> exploit *their* preexisting IBM POWER proprietary architecture. which is
> when you try to compare it against any FOSSHW design definitely already
> well into "too many architectural resources".
>   

Read the definition of "architectural resources" in the OpenPOWER spec 
license terms.  In this case, mostly opcode assignments.

> [...]
>   
>>  I suppose that FlexiVec could be considered a
>> contingency strategy for that case; its design specifically minimizes
>> required architectural resources.)
>>     
>
> i'm estimating a delay of approx 8 to 12 months, putting RED
> Semiconductor's entire future "on hold" in order to due proper
> due diligence on VVM/FlexiVec. including Simulator, unit tests,
> documentation. compared to SV for which that work (2x over) has already
> been done, giving me confidence and evidence that SV is sound, i have
> to rationally conclude "no" on VVM/FlexiVec.
>
> it is what it is.
>   

Fair enough -- you already have FlexiVec-Null.  :-)

>> The VPU concept was a little different than the distributed Processing
>> Elements; the idea was based on the CDC6600 architecture, with a VPU
>> that would probably be larger than its host processor, with a 1024 slot
>> register file, 48-bit instructions (to accommodate the large register
>> file), and 2-issue VLIW (to align 48-bit instructions with the basic
>> 32-bit wordsize used in Power ISA).  Interestingly, the use of VLIW also
>> means that you can use one side of the VLIW pair to provide an extended
>> immediate to the other, and possibly a fourth word in each VLIW packet
>> to provide loop control or other functions.
>>     
>
> VLIW makes me nervous. i saw how that went with SGI and with
> TI. i had to watch a colleague do the entirety of CEDAR Audio's
> DSP realtime processing in pure assembler because the compilers
> were that bad.
>   

The other option would be to have the VPU run 64-bit instructions and 
drop VLIW, since 64-bit instructions inherently align with Power ISA's 
32-bit words.  (VLIW was suggested purely to resolve the misalignment 
between the 32-bit words Power ISA uses and the suggested 48-bit VPU 
instructions.)

> with IBM having created SMP coherency and atomics clean over
> even a hundred tnousand uniform cores there is precedent for
> keeping to a uniform ISA and not going VLIW.
>
> my feeling is that it is critically important that the main CPU(s)
> be capable in full of executing the satellite core programs.
>   

I referenced the CDC6600 architecture.  The Power core would be the 
/peripheral/ processor that handles I/O and the OS.  The VPU would 
handle only bulk compute.  The VPU interface would therefore be a Custom 
Extension.

> thus if VLIW is to be executed on the satellite cores then VLIW
> must become part of mainstream Power ISA. at which point
> IBM is going to go on "high alert" (probably a metaphorical
> DEFCON 3 i would guess).
>
> but let us take a step back.
>
> 1. Power ISA v3.1 now has 64-bit prefixing. 34 bit pc-relative branches are
>     possible. 32-bit GPR immediates are possible.
>
> 2. the primary algorithms of Video are oddly DCT and FFT.
>     this was why i spent over 8 weeks on the DCT and FFT REMAP
>     Schedules.  VLIW is *not* required despite both TI's TMS320
>     and Qualcomm's Hexagon both being VLIW DSPs specialising
>     in FFT (inner loop only).  SV's REMAP covers the *entire*
>     triple loop
>     https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_dct.py;hb=HEAD
>
> 3. an entire NLnet Grant has been dedicated to Audio/Video
>     https://bugs.libre-soc.org/show_bug.cgi?id=137
>
>     in this Grant we have performed significant analysis of algorithms
>     and instructions, resulting in extracting a number of A/V opcodes
>     and confirming that SVP64 will cover the gaps.  Horizontal-Sum
>     Schedules, Parallel-Prefix Schedules and so on. a starting point:
>     https://libre-soc.org/openpower/sv/av_opcodes/
>
> 4. a demo of MP3 decode showed a *75%* reduction in assembler.
>     450 instructions crashed down to only 100.  at no time was VLIW
>     considered.
>     https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=media/audio/mp3/mp3_0_apply_window_float_basicsv.s;hb=HEAD
>
> 5. in the ZOLC paper the motion-time-estimation algorithm was SPECIFICALLY
>    chosen as one of the toughest examples traditionally causing massive
>    problems for DSPs (and VLIW).  despite *SIX* nested for-loops he
>    achieved an astounding 45% reduction in the number of instructions
>    executed.
>
>    that's on an almost entirely-coherent-deterministic schedule.
>
>    absolutely astonishing.
>
>    .... oh... and it was a simple in-order core.
>
> in Video you picked the *one* area where we've already done a SHED
> load of work :)
>   

Eh, "VPU" was intended as "Vector Processing Unit" not "Video Processing 
Unit".  On the other hand, you seem to have accomplished the major goals 
needed fairly well.

> [...]
>   
>> And how would that funding be arranged?  A stipend to work on the
>> problem?  Payment upon reaching milestones?  Something else?
>>     
>
> NLnet does payment-on-milestones and we're good at subdividing
> those so it's not "Do 4 Months Work Only Then Do You Get 15k".
> also there's a bit in the RED Semi bucket now - you'd have to submit
> subcontractor invoices.
>   

The project in question is to be done either way, correct?  (Such that 
milestones will need to be devised whether I do it or someone else does 
it, right?)  (Asking to see the milestones/roadmap before committing 
either way is reasonable, no?)

OK, what am I possibly getting into?

> [...]
>>  If this is so, then simplicity in the PEs is the watchword, lest
>> the problem quickly blow right back up to full SMP.
>>     
>
> have a look at the ZOLC and EXTRA-V papers.  the Coherent
> Deterministic Schedules allow for significant avoidance of clashes.
>
> plus, if you've got a shed-load of parallel processors with their
> own Memory connected directly to them, yet you're still trying
> to get them to execute sequential algorithms, you're Doing Something
> Wrong :)
>   

It sounds like the main issue then is partitioning the work out to the PEs.

>> Would OpenCAPI provide the means for each PE to have a local TLB and
>> *no* other MMU capabilities at all?  TLB miss at PE -> host processor
>> interrupt -> hypervisor (or hardware) provides PE TLB entry?
>>     
>
> i was planning to work out how to extend OpenCAPI to do exactly that.
> given the expectation that the binaries being executed would only be
> around the 1-8k mark all-in (appx size of 3D Shader binaries) i would
> not expect thrashing.
>   

8KiB == 2 4KiB pages.  Could we limit PE programs to 16KiB and specify 
that the PE has 4 instruction TLB entries, controlled by host software?

> [...]
>> Further, "OpenCAPI" is not what I would consider open at all; read
>> <URL:https://opencapi.org/license/> and barf.  (It explicitly excludes
>> producing an implementation.  Here we go again.)  Take a look at their
>> membership levels, too.  I see the RISC-V Foundation mess here all over
>> again, although at least RISC-V was not offering "study-only" licenses.
>>  >:-<
>>     
>
> raised it with OPF a couple of times.  done so again.
>   

Until those license issues are fixed, I am not touching OpenCAPI with 
the proverbial ten-foot pole.

>>> my feeling is that it is very important for PEs to be able to
>>> execute the same ISA, btw. in case the PEs are too busy,
>>> the *main core* can execute its programs as well!
>>>       
>> Probably best to keep a *subset* of the main processor's ISA.  To use
>> Power ISA for this would likely require (again) some special approval
>> from OPF because the PEs will /not/ meet any compliancy subset.
>>     
>
> you've missed that it is possible to go for a lower Compliancy Level
> then "step up" by adding optional instructions.  as long as you meet
> the lower level nobody cares what else you added.  but yes, it is
> something to keep an eye on.
>   

No, I mean the PEs might not meet the /lowest/ level, thus the 
requirement for special approval.  Or, perhaps in combination with a 
hypervisor running on the host processor, they /do/ meet the minimal 
level, even though the actual PE hardware does /not/ meet it?

> [...]
>> I think that this may be the fundamental issue where we have been
>> talking past each other.  I have been tacitly assuming a desktop-like
>> environment where power is fed from the AC line and a nice big heatsink
>> with forced air takes care of dissipation.  What has your assumed
>> environment been?
>>     
>
> everything.  near-memory PEs operating at only 150mhz, 3.5 watt
> quad-core SoCs, 8-core 4.8 ghz i9 killers, 64-core Supercomputer
> chiplets.
>
> everything.
>   

Actually, that prompts another idea:  perhaps we have been looking at 
Moore's Law the wrong way.  Instead of asking how high we can push 
f_CLK, perhaps we should take another look at that 150MHz DRAM sweet 
spot and ask how much logic we can pack into a 3.25ns half-period?  This 
leads to a possible VLIW /microarchitecture/ fed from a parallel Power 
instruction decoder.  What is the statistical distribution of the 
lengths of basic blocks in Power machine code?  Could chainable ALUs 
allow a low-speed Power core to transparently execute instructions in 
groups?

> [...]
>> To cut through all the fog here, how do I encode "ADDI R67, R123, 8" as
>> a scalar operation, not using Simple-V?
>>     
>
> you don't.  Power ISA 3.0 GPRs are restricted to r0-r31.
>   

This would break orthogonality in the Power ISA and I expect this to be 
likely to cause the OPF ISA WG to "freak out" as you describe it.  Are 
there any other cases of general registers not available to every 
fixed-point instruction in Power ISA?

>> I suppose I should also clarify that I believe the register file
>> extension proposal should also be cleanly separated from Simple-V.
>>     
>
> already done. https://libre-soc.org/openpower/sv/compliancy_levels/
>   

This comes back to the problem exposed above.  The register file 
extension proposal should be available entirely independent of Simple-V, 
such that a processor could implement the extended register file and 
*not* implement Simple-V or vice versa.

-- Jacob