[Libre-soc-dev] svp64 review
Luke Kenneth Casson Leighton
lkcl at lkcl.net
Sun Jul 24 22:11:43 BST 2022
hi jacob after a bit of thought and because you tried subscribing
i figured you'd find it reasonable if i forwarded your review to
the list on your behalf, then replied. if you (and everyone) cc
whilst sorting out the filters on gmail so it stops going to spam
it should work out. you'll 100% guaranteed find the invite i
sent in spam, as well.
---------- Forwarded message ---------
From: Jacob Bachmeyer <jcb62281 at gmail.com>
Date: Fri, Jul 22, 2022 at 4:05 AM
To: Luke Kenneth Casson Leighton <lkcl at lkcl.net>
A few comments from a quick partial review:
In chapter 3, "vertical" vector mode as described is ridiculous --
that is exactly equivalent to a software loop and therefore a complete
waste to support in hardware. Any optimizations that can be applied
there could also be applied to ordinary "for" loops and "svstep.bc" is
nothing more than a dedicated LOOP opcode (similar to the same
instruction from the original 8086).
In chapter 4, we finally start to get to the "meat" of the
proposal. You have a serious misunderstanding of the x86 "REP" prefix.
That prefix can only be used with the "string" opcodes, which are
actually memory-to-memory instructions, using various (specific and
hardwired) registers as they work. If I remember correctly, "MOVS" (for
example) copies an element from *%esi to *%edi and advances the pointers
used. (On the original 8086, that element was a byte from DS:[SI] to
ES:[DI].) Instruction pointer (the x86 program counter) advance past
the "REP MOVS" is inhibited until %ecx is zero; this makes the operation
interruptible and restartable, since the pointers are adjusted after
each element is processed.
The misunderstanding is that there is no "Sub-PC" in x86 -- repeatable
operations update the relevant general registers as they proceed, and
the saved PC value on an interrupt or exception will point to the
I have a change to Simple-V that would allow you to throw most of the
current limits out of the proverbial window. Simple-V does *not* "march
across the register file", instead Simple-V *replaces* selected ISA
scalar registers with sliding windows onto the vector register memory
during a vector loop. (Your current pseudocode still describes marching
across the register file.) This is very similar to the "vector tail"
model I was proposing as "RVP lanes" a few years ago.
The proposed "sub-PC" represents a problem for exception handling, but
the 8086 "REP" prefix provides precedent for an easy solution: use a
general-purpose (scalar) register as the loop variable. Actually, using
a (programmer-chosen) scalar register as the control-flow loop variable
and recognizing Simple-V as *overlaying* elements of the vector storage
into the scalar register file would allow the *physical* scalar register
corresponding to a vector to be used to store the current offset for
that vector. The simplest implementation would be a multi-port (2R1W)
scratchpad SRAM for the "vector unit", with registers in "vector mode"
actually containing (programmer-invisible) pointers into that scratchpad
during a vector loop.
With a few restrictions on allowed operations related to inter-lane data
transfers, a vector loop can then, for most operations and on
appropriate hardware, be unrolled (by hardware) across however many
vector lanes are actually implemented, with the loop variable advanced
by N (number of implemented vector lanes) on each pass through the loop.
If Simple-V really is intended to march across the register file, then I
propose an alternate "FlexiVec" as I previously described. The
interesting possibility with "FlexiVec" is that it can scale all the way
down to the baseline scalar ISA (with MAXVL=1) and up to arbitrarily
large "hybrid GPU" designs with thousands of vector lanes driven by a
single control unit.
More information about the Libre-soc-dev