[Libre-soc-dev] scalar instructions and SVP64

Thu Mar 11 00:08:21 GMT 2021

On Wednesday, March 10, 2021, Jacob Lifshay <programmerjake at gmail.com>
wrote:

> On Wed, Mar 10, 2021, 12:49 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
> wrote:
>
> > ok i'll go through it.  here's the 3 FSMs, intended to  be indicative for
> > future designs of pipeline stages and to make multi issue clear
> >
> >
> > https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/
> simple/issuer.py;h=e0bd35951644d10041c7635d3ee0f252879639ee;hb=HEAD#l613
> >
> > they are named fetch, issue and execute.
> >
> > * fetch performs fetch and also identifies length through bare minimum
> > identification of SVP64. it also reads PC MSR and SVSTATE.
> >
> > it is also supposed to (apart from trap and branch) be the only place
> that
> > updates PC but hey.
> >
> > * issue receives PC MSR and SVSTATE.  it also receives SVP64 RM and insn.
> >
> > it is responsible for going "if 32 bit fire execute immediately" else "if
> > SVP64 run a loop firing one instruction per SVSTATE.srcstep".
> >
>
> And all that changes is it gets changed to "if 32-bit or SVP64-scalar then
> execute immediately else loop..."

"and all that".

*think it through* jacob.

i have said this about three times in 12 hours and you still haven't got it.

*where does the detection of bits required to detect that operands are
marked scalar come from*

> >
> > here is where the core state PC MSR and SVSTATE are passed into a
> "global"
> > PowerDecoder2 which will performs the addition of srcstep onto RA, RB,
> RS,
> > RT, all CRs and all SPR numbers (TODO, that one)
> >
>
> Aside: I'd argue that SPR numbers shouldn't be incremented, that'd be more
> like incrementing the opcode than a register number, since every SPR does
> something totally different. That's a different discussion though...

briefly, it's for saving batches of SPRs in context switches for SV state
information.  those *will* be contiguously allocated.

> >
> > PowerDecoder2 has just enough to identify which pipeline should decode
> and
> > process the instruction.
> >
>
> This is where the decoder has enough info to identify the number of
> register fields in the SVP64 prefix, so we just add the few extra gates to
> OR the vector/scalar bits here.

 and it is TOO LATE.  the forward chain (pipeline) has ALREADY COMMITTED to
the decision.

pipelines CANNOT GO BACKWARDS IN TIME.

what you are asking requires the MERGING of two already-close-to-the-limit
pipeline stages.

do you understand?  it is quite surreal to repeat this three times.

> >
> > * execute is where (because this is a Test Issuer) one and ONLY one
> > pipeline receives the instruction.
> >
> > by this point it is PURELY a 32 bit instruction, register data has
> already
> > been read.  a SATELLITE PowerDecoderSubset performs decoding UNIQUE and
> > SPECIFIC to that Function Unit.
> >
> >
> > now let's do that again, this time in a multi issue environment
> >
> > * multiple instructions are fetched.  they are all length-decoded in
> > parallel (using that superb carry-lookahead-like algorithm you devised,
> > Jacob)
> >
> > any 32 bit instructions are sent through to the next phase along with an
> > incremented "PC+0" PC+4 PC+8 etc.
> >
> > when a 64 bit instruction is encountered it has to be the last one sent
> on
> > (for now, optimisations come later)
> >
>
> all the 32-bit vs. 64-bit here gets changed to 32-bit/64-bit scalar vs.
> 64-bit vector

i have no idea what you are referring to here.

>
> >
> > * any 32 bit instructions get further decoded and sent to relevant
> > pipelines.
> >
> > however 64 bit ones the SVSTSTE.srcstep is autoincremented INSTEAD of the
> > PC, their PowerDecoder2s then have all the information they need, and
> > proceed just like the 32 bit ones.
> >
> > * all pipelines receive ONLY 32 bit instructions just like in the FSM
> case.
> >
> >
> > now.
> >
> > can you see that by adding in a BACKWARDS dependency between the
> > PowerDecoder2s, which are the ONLY PLACES where the EXTRA2/3 information
> > may be decoded, and where there are MASSIVE mux cascades, the above
> forward
> > structure which is otherwise completely independent and (apart from PC
> and
> > setvl changes which use precise speculation and branch prediction to
> > solve), is completely compromised?
> >
> > the only way to get what you are advocating is to combine two of the 3
> > stages above, introduce huge latency, which completely compromises high
> > performance.
> >
>
> If we instead go with the alternative encoding

NO.  far, far too late.

raise it as a bugreport, document it, then please drop it and help with
implementation.

i have said this a number of times, we are under time and funding pressure
and need to get the implementation done.

> described in my previous
> email: "...scalar/vector-bit for the first/dest reg...as a
> whole-instruction scalar/vector-bit", since that encoding has the
> whole-instruction-level scalar/vector bit in every SVP64 instruction and
> it's always in the same place,

this is far too late! this should have been raised 20 months ago!

certainly not right smack in the middle of implementation!

certainly not when i have made it repeatedly clear that we are under both
time and funding pressure *and have code already written*

we cannot request twice the money from NLnet for throwing away code and
implementing something else at this late stage.

that allows us to trivially change the
> vector vs. scalar determination to vector 64-bit vs. scalar 32/64-bit
> instead of what we currently have -- vector 64-bit vs. scalar 32-bit.
>
> >
> > can you see that?
> >
>
> I can see what you're fearing, *however*:
>
> > the alternative scheme with the whole-instruction-level scalar/vector bit
> working just fine is clearly visible :)
>
> Also, the non-alternative scheme I'm proposing with OR-ing together
> vector/scalar-bits will work just fine: the fetch pipeline

i have no idea what you are talking about because the focus is on
implementation and on getting the spec implemented.

spec design phase ended MONTHS ago.

new spec design ideas *actively* interfere with the completion of
implementation because they require abandoning working knowledge that is in
people's heads.

it is too much, Jacob.

i have repeated this many times and you do not seem to respect it.

> (everything
> before instructions are added to the dep. matrixes) *has* to at some point
> decode the instructions enough to know which registers are read/written --
> I'm saying we just move that decoding to sometime after length decode and

no!  this completely destroys SV by creating a massive gate chain!

this would completely destroy SV's chances by creating a hard limit on the
maximum clock rate regardless of geometry.

how many times do i have to say NO!

> before SV looping because that *exact* information that we need to decode
> anyway *is the same info.* that's required to decide which SVP64 bits to OR
> together to form the whole-instruction scalar/vector bit, which then tells
> the SV looping stage to pass the instruction unvectorized (scalar
> 32/64-bit) or to loop VL*SUBVL times. Basically, we re-order the stages
> somewhat to get a trivially-small dependency graph,

no.  again, no.  this is *far* too late Jacob.

we are in IMPLEMENTATION mode.  NOT specification total redesign mode.

*please* help with the IMPLEMENTATION so that we can fulfil our obligations
and promises to customers and to NLnet.

i have asked quietly several times now.

we are under an obligation to NLnet under their MoU to work on this full
time, and there are no source code commits or status update reports from
you in several weeks.

this has me concerned for some time.

l.

-- 
---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68