[Libre-soc-isa] [Bug 1056] questions and feedback (v2) on OPF RFC ls010

bugzilla-daemon at libre-soc.org bugzilla-daemon at libre-soc.org
Tue Jun 6 15:15:48 BST 2023


https://bugs.libre-soc.org/show_bug.cgi?id=1056

--- Comment #64 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
(In reply to Paul Mackerras from comment #63)
> (In reply to Luke Kenneth Casson Leighton from comment #54)
> > 
> > jacob and i went to a LOT of trouble to ensure that SV is an
> > orthogonal consistent RISC paradigm.
> 
> Just as a side note, orthogonality does have an engineering cost,
> particularly in terms of verification. Sometimes it is pragmatically
> necessary to limit orthogonality in order to keep the verification state
> space manageable. In this case, that might mean having a defined set of
> scalar instructions which can be vectorized, rather than saying that almost
> any scalar instruction can be vectorized. I know that seems sub-optimal
> conceptually, but it may be necessary for practical reasons, particularly
> for an initial implementation. The set of vectorizable instructions can
> always be expanded later.

responding reverse-order, got to this point, needs a re-read a couple
of times more.

summary is: i agree with you but it cannot be a free-for-all
(hence the Compliancy Levels, which need review)

some design context first:

the bare minimum implementation is fetch-decode-{LOOP}-issue-execute.
the LOOP on register numbers goes directly into the exact same
Register-Hazard Management as if the looping did not exist.

in these naive implementations even elwidth overrides would be
single-issue

thus a simple naive first-implementation may extend by just one
pipeline stage, use byte-writeable regfiles, and call it a day.

the next advancement (ignoring REMAP entirely) is to do (sequential)
batching just like Multi-Issue. in fact exactly like Multi-Issue.
the sequential nature of the looping allows for extremely easy Hazard
Management as long as you convert binary reg#s into unary-encoding:

    rt=3, ra=8, VL=3
=>  rt=0b000111000, ra=0b00011100000000

then detecting Hazard overlaps involves simple AND gates not
massive multi-ported CAMs.

elwidth overrides also end up with Hazard Management down at byte-level
but even here unary-encoding comes to the rescue.

REMAP the next complication simply sits between decode and Hazard
Management, shuffling the offsets *before* dropping it into Hazard
Read/Write tables. [this helps explain why i say that it has to
be Deterministic as this is a critical gate-latency juncture, right
smack in Decode/Issue: if you look up Indexed REMAP you will see that
modifying the GPRs after the svindex instruction is UNDEFINED]

before Multi-Issue Hazard Management tables get so insanely large
(several million gates) that clock speeds above 500 mhz are unattainable
no matter the geometry there are two things to the rescue:
Write-after-Write (aka "Register Renaming") combined with
SVTATE.hphint.

hphint allows *intra-* batch Hazards to be utterly disregarded
*within the batch only* not *inter-* batch, and the renamed
batch gets thrown in a nice sequential order at the available
Function Units.

so that is the gamut / gauntlet of all possible (sane) implementations
based on industry-standard pre-existing Micro-Architectures.


now with that context in mind we may evaluate the proposal.

* the first insight that occurred to me *might* be that it is from
  the perspective of a standard SIMD or standard Cray-Vector ISA.
  can i check whether or not you are thinking in terms of passing
  the entire Vector operation *including VL* down into the pipelines?

  this is a perfectly legitimate implementation, to use e.g. a FSM
  (like Microwatt's FP unit) with an additional for-loop *actually in*
  the FSM itself, and to set up a communications protocol with the
  regfile that not only contains the Reg# RT RA BA BB FRS etc but
  *also the offset index*.  thus when reading/writing to the
  regfile the Function Unit *itself* sends multiple (sequential)
  read/write requests in succession.  even potentially implements its
  own miniature Vector Chaining.
  https://en.m.wikipedia.org/wiki/Chaining_(vector_processing)

  but the key is that Hazard Management *still had to be done* even
  before issuing {Instruction}+{0..VL} down into the Function Unit
  (or {Instruction}+{0..3} {Instruction}+{4..7} {Instruction}+{8..VL}
   to multiple Function Units)

* thus logically the most complex part (not in naive implementations)
  is the Hazard Management and that has to be done anyway

* therefore in order to comply with the spec you *had* to do the hard
  bit (Dependency Matrices) and once done *every* Function Unit
  can use that.

* if a given instruction for any reason is too complex to parallelise
  with the combined context of Multi-Issue *and* Looping then there is
  no problem at all, just fall back to "naive" (single-issue) looping.
  if *really* a problem then the absolute bare-minimum fallback
  is that of single-step (like in debug mode): only allow one live
  instruction at a time.

* good examples where single-issue fallback would be strongly
  advised would be tdi and twi (yes they get Vectorized! they have
  RA and RB as sources!)

* in this light to *stop* specific instructions from being Vectorized
  it actually requires more complex Decoding!  ok, some
  implementations may fire an Illegal Instruction Trap.

* and this brings us neatly onto the SV Compliancy Levels, in effect,
  because there will be certain mimimum levels of implementation
  expected performance within the anticipated categories
  (A/V DSP, GPU/HPC) given that trap-and-emulate will suck pretty
  badly on SVP64, end-users are highly likely to complain.

* bottom line, even if it is logical and sane from a hardware
  implementation perspective to not Vectorize some instructions
  it cannot become a free-for-all just as SFS and SFFS and all
  non-Vectorized Compliancy Levels cannot be a free-for-all,
  they exist for a reason and the exact same logic applies to
  Vectorized space.

* and bear in mind just like in the Vulkan Spec managed by the Khronos
  Group speed is *not* made mandatory, that is to implementors to decide,
  and compete on.  the spec mandatoriness is on *what* is implemented
  so that software developers do not go insane.

* thus the discussion becomes about the SV Compliancy Levels so that
  software (HWCAPS_SVP64_xxxxx) does not end up in total meltdown.

compliancy levels: happy to have constructive input on them
https://libre-soc.org/openpower/sv/compliancy_levels/

regarding Verification: we (RED Semiconductor Ltd) HAVE to have
Compliancy Suites, and they will be FOSS-Licensed (Libre-SOC).
the Test API allows plugging in alternative implementations
including autogenerating standalone Makefiles for static build and
test

-- 
You are receiving this mail because:
You are on the CC list for the bug.


More information about the Libre-SOC-ISA mailing list