[Libre-soc-dev] scalar instructions and SVP64

Wed Mar 10 20:48:55 GMT 2021

ok i'll go through it.  here's the 3 FSMs, intended to  be indicative for
future designs of pipeline stages and to make multi issue clear

https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/simple/issuer.py;h=e0bd35951644d10041c7635d3ee0f252879639ee;hb=HEAD#l613

they are named fetch, issue and execute.

* fetch performs fetch and also identifies length through bare minimum
identification of SVP64. it also reads PC MSR and SVSTATE.

it is also supposed to (apart from trap and branch) be the only place that
updates PC but hey.

* issue receives PC MSR and SVSTATE.  it also receives SVP64 RM and insn.

it is responsible for going "if 32 bit fire execute immediately" else "if
SVP64 run a loop firing one instruction per SVSTATE.srcstep".

here is where the core state PC MSR and SVSTATE are passed into a "global"
PowerDecoder2 which will performs the addition of srcstep onto RA, RB, RS,
RT, all CRs and all SPR numbers (TODO, that one)

PowerDecoder2 has just enough to identify which pipeline should decode and
process the instruction.

* execute is where (because this is a Test Issuer) one and ONLY one
pipeline receives the instruction.

by this point it is PURELY a 32 bit instruction, register data has already
been read.  a SATELLITE PowerDecoderSubset performs decoding UNIQUE and
SPECIFIC to that Function Unit.

now let's do that again, this time in a multi issue environment

* multiple instructions are fetched.  they are all length-decoded in
parallel (using that superb carry-lookahead-like algorithm you devised,
Jacob)

any 32 bit instructions are sent through to the next phase along with an
incremented "PC+0" PC+4 PC+8 etc.

when a 64 bit instruction is encountered it has to be the last one sent on
(for now, optimisations come later)

* any 32 bit instructions get further decoded and sent to relevant
pipelines.

however 64 bit ones the SVSTSTE.srcstep is autoincremented INSTEAD of the
PC, their PowerDecoder2s then have all the information they need, and
proceed just like the 32 bit ones.

* all pipelines receive ONLY 32 bit instructions just like in the FSM case.

now.

can you see that by adding in a BACKWARDS dependency between the
PowerDecoder2s, which are the ONLY PLACES where the EXTRA2/3 information
may be decoded, and where there are MASSIVE mux cascades, the above forward
structure which is otherwise completely independent and (apart from PC and
setvl changes which use precise speculation and branch prediction to
solve), is completely compromised?

the only way to get what you are advocating is to combine two of the 3
stages above, introduce huge latency, which completely compromises high
performance.

can you see that?

l.

-- 
---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68