[Libre-soc-dev] scalar instructions and SVP64

Wed Mar 10 23:35:42 GMT 2021

On Wed, Mar 10, 2021, 12:49 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
wrote:

> ok i'll go through it.  here's the 3 FSMs, intended to  be indicative for
> future designs of pipeline stages and to make multi issue clear
>
>
> https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/simple/issuer.py;h=e0bd35951644d10041c7635d3ee0f252879639ee;hb=HEAD#l613
>
> they are named fetch, issue and execute.
>
> * fetch performs fetch and also identifies length through bare minimum
> identification of SVP64. it also reads PC MSR and SVSTATE.
>
> it is also supposed to (apart from trap and branch) be the only place that
> updates PC but hey.
>
> * issue receives PC MSR and SVSTATE.  it also receives SVP64 RM and insn.
>
> it is responsible for going "if 32 bit fire execute immediately" else "if
> SVP64 run a loop firing one instruction per SVSTATE.srcstep".
>

And all that changes is it gets changed to "if 32-bit or SVP64-scalar then
execute immediately else loop..."

>
> here is where the core state PC MSR and SVSTATE are passed into a "global"
> PowerDecoder2 which will performs the addition of srcstep onto RA, RB, RS,
> RT, all CRs and all SPR numbers (TODO, that one)
>

Aside: I'd argue that SPR numbers shouldn't be incremented, that'd be more
like incrementing the opcode than a register number, since every SPR does
something totally different. That's a different discussion though...

>
> PowerDecoder2 has just enough to identify which pipeline should decode and
> process the instruction.
>

This is where the decoder has enough info to identify the number of
register fields in the SVP64 prefix, so we just add the few extra gates to
OR the vector/scalar bits here.

>
> * execute is where (because this is a Test Issuer) one and ONLY one
> pipeline receives the instruction.
>
> by this point it is PURELY a 32 bit instruction, register data has already
> been read.  a SATELLITE PowerDecoderSubset performs decoding UNIQUE and
> SPECIFIC to that Function Unit.
>
>
> now let's do that again, this time in a multi issue environment
>
> * multiple instructions are fetched.  they are all length-decoded in
> parallel (using that superb carry-lookahead-like algorithm you devised,
> Jacob)
>
> any 32 bit instructions are sent through to the next phase along with an
> incremented "PC+0" PC+4 PC+8 etc.
>
> when a 64 bit instruction is encountered it has to be the last one sent on
> (for now, optimisations come later)
>

all the 32-bit vs. 64-bit here gets changed to 32-bit/64-bit scalar vs.
64-bit vector.

>
> * any 32 bit instructions get further decoded and sent to relevant
> pipelines.
>
> however 64 bit ones the SVSTSTE.srcstep is autoincremented INSTEAD of the
> PC, their PowerDecoder2s then have all the information they need, and
> proceed just like the 32 bit ones.
>
> * all pipelines receive ONLY 32 bit instructions just like in the FSM case.
>
>
> now.
>
> can you see that by adding in a BACKWARDS dependency between the
> PowerDecoder2s, which are the ONLY PLACES where the EXTRA2/3 information
> may be decoded, and where there are MASSIVE mux cascades, the above forward
> structure which is otherwise completely independent and (apart from PC and
> setvl changes which use precise speculation and branch prediction to
> solve), is completely compromised?
>
> the only way to get what you are advocating is to combine two of the 3
> stages above, introduce huge latency, which completely compromises high
> performance.
>

If we instead go with the alternative encoding described in my previous
email: "...scalar/vector-bit for the first/dest reg...as a
whole-instruction scalar/vector-bit", since that encoding has the
whole-instruction-level scalar/vector bit in every SVP64 instruction and
it's always in the same place, that allows us to trivially change the
vector vs. scalar determination to vector 64-bit vs. scalar 32/64-bit
instead of what we currently have -- vector 64-bit vs. scalar 32-bit.

>
> can you see that?
>

I can see what you're fearing, *however*:

> the alternative scheme with the whole-instruction-level scalar/vector bit
working just fine is clearly visible :)

Also, the non-alternative scheme I'm proposing with OR-ing together
vector/scalar-bits will work just fine: the fetch pipeline (everything
before instructions are added to the dep. matrixes) *has* to at some point
decode the instructions enough to know which registers are read/written --
I'm saying we just move that decoding to sometime after length decode and
before SV looping because that *exact* information that we need to decode
anyway *is the same info.* that's required to decide which SVP64 bits to OR
together to form the whole-instruction scalar/vector bit, which then tells
the SV looping stage to pass the instruction unvectorized (scalar
32/64-bit) or to loop VL*SUBVL times. Basically, we re-order the stages
somewhat to get a trivially-small dependency graph, not add massive
dependency messes because we used the wrong order.

Jacob