[Libre-soc-bugs] [Bug 1116] evaluate, spec, and implement Vector-Immediates in SVP64 Normal
bugzilla-daemon at libre-soc.org
bugzilla-daemon at libre-soc.org
Sun Jun 11 14:26:35 BST 2023
--- Comment #4 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
(In reply to Jacob Lifshay from comment #2)
> that said, if we're going to have vector immediates at all, they should also
> account for subvl.
*click* yes of course. hmmm is that in 2 bits that is not affected by
anything? (as in: can it be picked up from the prefix and *guaranteed*
to be easy to get? yes it can!)
> it would also be very nice to account for elwid too since
> you have to decode a bunch of the prefix and suffix anyway (see note below):
that's exactly why i'm *not* recommending elwidth be part of it,
precisely because it requires the prefix-suffix combination.
the only "decode" needed is "is this instruction Arithmetic type"
and a special (small) PowerDecode can be used (in our implementation).
it's dead-easy to do: just put filters onto a PowerDecode instance
(in this case "unit" from the CSV files) and voila "if unit == ALU/LOGICAL"
you have the information needed (right at the critical point)
at which point some *further* decode of elwidth is needed.
subvl on the other hand is dead-easy.
in the interests of not hampering max CPU speed i'm quite happy for
space to be "wasted" here.
which would make it:
extra_immediates = MAXVL - 1
extra_bytes = extra_immediates * 2 * subvl
extra_words = -((-extra_bytes) // 4) # ceil div
NIA = CIA + 8 + 4 * extra_words
> XLEN = max(sw, dw) # TODO: account for f16/bf16 being 16/16-bit not 8/16-bit
exactly the kind of nightmare that will punish multi-issue :)
that would need *even more* decoding - now detecting FP-Arithmetic
as separate from Logical/ALU - to work out even how to get the elwidth
there is enough dependency already between prefix and suffix,
making both me (and the ISA WG) jittery.
> this allows trivially loading a vector of 64-bit immediates in one
> instruction -- better than any fli proposed so far.
remember that it is absolutely critical that the scalar instructions
remain orthogonal to "when Vectorised".
we *cannot* have "if Scalar then instruction means something else
if Vector it's different".
this is a HARD inviolate rule (where sv.bc is seriously pushing our luck
on that one, and the only way i can think to sell it is that bc is
"subset" behaviour of sv.bc)
what you are suggesting would involve *different* pseudocode for
*all* impacted instructions:
if "sv.addi" then
do something different from addi because the immediate is bigger
the v3.0/v3.1 addi pseudocode here
i just went through that with Paul, took ages to work out what i meant
changes to the "meaning" of an instruction - requiring "if sv.xxx else"
i am putting my foot down HARD on that.
the consequence is that with neither operands nor scalar-instruction
being any different the *same Decode* may be used for both scalar and
vector, and that's absolutely critical when it comes to high-performance
(speculative) parallel decode.
> decoding note: i expect cpus to generally treat a vector load immediate as a
> unconditional jump --
yyep. which means it has to be REALLY simple.
> this means they don't try to read instructions after
> the load immediate in the same cycle as the load immediate so taking longer
> to decode the length is perfectly fine since the instruction start
> prefix-sum tree can just treat it as a 64-bit instruction and clear out all
> attempted instructions after it, leaving time for the full decoder to decode
> the correct length and redirect fetch to the correct location.
a neat trick:
parallel speculative decode can be carried out, and if "constants are
misinterpreted as "instructions" they are skipped-over once they are
everything can be done in parallel and the actual decision deferred.
if fetch is in 64-byte aligned chunks and performs some parallel
decode, then we have to be careful crossing that boundary:
Power v3.1 public Book I Section 1.6 p11 :
Prefixed instructions do not cross 64-byte instruction
address boundaries. When a prefixed instruction
crosses a 64-byte boundary, the system alignment
error handler is invoked.
so, assuming that the vector-immediate instruction is within such blocks,
if VL is ever greater than 31 we're "in trouble" and at *that* point
the scheme you describe would be activated, but otherwise some speculative
decode is perfectly fine.
> it can be treated like a jump so the next instruction address gets added to
> the branch target buffer and the next-pc logic will speculatively fetch from
> the correct location on the next cycle, even before decoding has started.
> demo program:
> 0x08: ori r10, r10, 5
> 0x0c: and r10, r11, r10
> 0x10: sv.addi/w=32 *r3, 0, [0x12345678, 0x9abcdef0] # vector immediate
> 0x20: sv.add/w=32 *r3, *r3, *r3
this is a really nice illustrative example. needs expanding so that
it's clear that the 2nd immediate is in 0x18. and setvl MAXVL=4? 8?
0x10: PO9 sv prefix
0x14: addi (prefixed, contains SI=0x12345678)
0x18: 0x0000_0000 0x0000_0000 0x0000_0000 0x9abcdef0
0x1c: 0x0000_0000 0x0000_0000 0x0000_0000 0x0000_0000
0x20: PO9 sv prefix
0x24: add *r3, *r3, *r3
ok so there's room there for up to 8 additional constants.
so MAXVL=9 is perfectly fine (on this example).
> demo pipeline with 64-bit fetch width
IBM has been doing 64 *byte* wide decode!!
(likely a clean aligned chunk of a L1 cache line)
fetch-width will be mad: the POWER9 and POWER10
have those OpenCAPI 25 gigabit SERDES (quantity lots!)
You are receiving this mail because:
You are on the CC list for the bug.
More information about the libre-soc-bugs