[Libre-soc-dev] [RFC] SVP64 on branch instructions

Sun Aug 8 20:11:29 BST 2021

---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68

On Sun, Aug 8, 2021 at 6:20 PM Richard Wilbur <richard.wilbur at gmail.com> wrote:
>
>
> > On Aug 8, 2021, at 06:32, lkcl <luke.leighton at gmail.com> wrote:
> >
> > i've started on the 24 bit RM decoder for BC, combining bits into 2 bit enums with only 3 entries in most cases, quite annoying that, but it is what it is.
>
> Indeed, realizing that it is not as densely packed as if all the possibilities were used can be vexing, but it leaves room to accommodate one more option if we realize later that something could be an immense improvement with an additional mode.

as a last resort, yes.  the complexity involved of first spotting
those brownfield encodings then chaining on them, it gets... yeah.

> > if we were designing something that was specifically intended for non-supercomputer non-multi-issue uses, adding critical dependencies between SVSTATE and the decoder would be perfectly fine.
>
> So I’m envisioning the supercomputer multi-issue decoder loading something
> like a cache line at a time from memory/cache,

yes, and pushing that into a queue.  (often this is not a shift
register, just an SRAM
but where the address is what moves.  exactly like how static-sized queues get
implemented in software, with a pointer to head and pointer to tail,
you move them
on)

> starting the decode by determining instruction boundaries (left-to-right cascade,
> but pretty quick/simple to determine 32-bit or 64-bit),

yes. jacob had a great idea there to use a standard carry-save-propagation
algorithm.

> then parallel decode can start on each instruction up to dispatch

correct.

> when hazards from interactions with resources modified by previous instructions need to be taken into account.

this (dispatch) is where, if you have dependencies on SVSTATE (such
as the VerticalFirst bit, or the idea of having VL==0 mean something
completely different as far as what those 64-bits *actually* mean, it
all goes to hell.

one of the prior instructions in the current "batch" might  *change*
VL, or *change* to VerticalFirst Mode.

now you want every one of those parallel decoders to be critically dependent
on something that was in a previous slot??

oink.

that's no longer a paralleliseable decoder, is it?

> It is a very cool picture—even cooler because, to the extent they are used,
> the horizontal and vertical loop/vector modes will relieve a large amount of
> instruction cache and decoder activity!

Vertical-First in "batch" mode - i.e. when the hardware has set the
VF "Hint" to a value other than 1, yes.

or, if, like in MyISA 66000 by Mitch Alsup, the hardware can determine
through lookahead that it can parallelise a whole batch (automatically
determine the number of elements in a loop that can be done entirely
in parallel)

> I suppose dispatch will need to depend on/have a hazard on-SVSTATE
> (at least VL?)

yes.  in Horizontal-First, mode, definitely.  the actual relationship between
parallelly-decoded instructions and the issued elements-which-may-be-batched
is *not* a linear one.

decoder1 decoder2 decoder3 decoder4 decoder5 decoder6
sv.add      sv.mul      setvli 5    sv.sub      ...
VL=4       VL=4                        VL=5      VL=5

the instructions that get issued will be:

decoder1: QTY 4x ADDs
decoder2: QTY 4x MULs
decoder3: QTY 1x change of SVSTATE
decoder4: QTY 5x SUBs

**ONLY** in the circumstance where all 4 ADDs may be passed straight
through to **ONE** ALU in *ONE* clock cycle will it be possible to
also consider some of the MUL operations.

in the case where that is not possible, let us assume e.g. that there
are 8 potential issue slots, we may issue QTY4 ADDs to the first
4 slots and QTY4 MULs to the next 4.

... errr.... but we have 8-way multi-issue and 8-way parallel decode?
errr what happened to the other 8 decoded instructions?

answer: the issue slots are all full, just from the first two instructions.
the rest have to wait.

this is not a bad thing per se, because execution has just been spammed
and is 100% occupied.

> in order to possibly parallelize some vector operations in an implementation-dependent fashion?  It seems likely that if VL <= the number of ALU’s that the initial multiplications of a vector dot product could be dispatched in parallel.

even if VL >= the number of ALUs, the multiplications can still be issued
in parallel.  it's just that the decoders sit there "zzzzz" and yet we're
perfectly happy with that situation because back-end execution is
100% occupied.

l.