[Libre-soc-bugs] [Bug 1116] evaluate, spec, and implement Vector-Immediates in SVP64 Normal

bugzilla-daemon at libre-soc.org bugzilla-daemon at libre-soc.org
Sun Jun 11 14:26:35 BST 2023


https://bugs.libre-soc.org/show_bug.cgi?id=1116

--- Comment #4 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
(In reply to Jacob Lifshay from comment #2)

> that said, if we're going to have vector immediates at all, they should also
> account for subvl.

*click* yes of course.  hmmm is that in 2 bits that is not affected by
anything?  (as in: can it be picked up from the prefix and *guaranteed*
to be easy to get? yes it can!)

> it would also be very nice to account for elwid too since
> you have to decode a bunch of the prefix and suffix anyway (see note below):

that's exactly why i'm *not* recommending elwidth be part of it,
precisely because it requires the prefix-suffix combination.
the only "decode" needed is "is this instruction Arithmetic type"
and a special (small) PowerDecode can be used (in our implementation).
it's dead-easy to do: just put filters onto a PowerDecode instance
(in this case "unit" from the CSV files) and voila "if unit == ALU/LOGICAL"
you have the information needed (right at the critical point)

at which point some *further* decode of elwidth is needed.

subvl on the other hand is dead-easy.

in the interests of not hampering max CPU speed i'm quite happy for
space to be "wasted" here.

which would make it:

   extra_immediates = MAXVL - 1
   extra_bytes = extra_immediates * 2 * subvl
   extra_words = -((-extra_bytes) // 4)  # ceil div
   NIA = CIA + 8 + 4 * extra_words

> XLEN = max(sw, dw)  # TODO: account for f16/bf16 being 16/16-bit not 8/16-bit

exactly the kind of nightmare that will punish multi-issue :)
that would need *even more* decoding - now detecting FP-Arithmetic
as separate from Logical/ALU - to work out even how to get the elwidth

there is enough dependency already between prefix and suffix,
making both me (and the ISA WG) jittery.

> this allows trivially loading a vector of 64-bit immediates in one
> instruction -- better than any fli proposed so far.

remember that it is absolutely critical that the scalar instructions
remain orthogonal to "when Vectorised".

we *cannot* have "if Scalar then instruction means something else
if Vector it's different".

this is a HARD inviolate rule (where sv.bc is seriously pushing our luck
on that one, and the only way i can think to sell it is that bc is
"subset" behaviour of sv.bc)

what you are suggesting would involve *different* pseudocode for
*all* impacted instructions:

   if "sv.addi" then
       do something different from addi because the immediate is bigger
   else
       the v3.0/v3.1 addi pseudocode here

i just went through that with Paul, took ages to work out what i meant
https://bugs.libre-soc.org/show_bug.cgi?id=1056#c69

changes to the "meaning" of an instruction - requiring "if sv.xxx else"
i am putting my foot down HARD on that.

the consequence is that with neither operands nor scalar-instruction
being any different the *same Decode* may be used for both scalar and
vector, and that's absolutely critical when it comes to high-performance
(speculative) parallel decode.


> decoding note: i expect cpus to generally treat a vector load immediate as a
> unconditional jump --

yyep.  which means it has to be REALLY simple.

> this means they don't try to read instructions after
> the load immediate in the same cycle as the load immediate so taking longer
> to decode the length is perfectly fine since the instruction start
> prefix-sum tree can just treat it as a 64-bit instruction and clear out all
> attempted instructions after it, leaving time for the full decoder to decode
> the correct length and redirect fetch to the correct location.

a neat trick:

parallel speculative decode can be carried out, and if "constants are
misinterpreted as "instructions" they are skipped-over once they are
identified.

everything can be done in parallel and the actual decision deferred.
if fetch is in 64-byte aligned chunks and performs some parallel
decode, then we have to be careful crossing that boundary:

Power v3.1 public Book I Section 1.6 p11 :

    Prefixed instructions do not cross 64-byte instruction
    address boundaries. When a prefixed instruction
    crosses a 64-byte boundary, the system alignment
    error handler is invoked.

so, assuming that the vector-immediate instruction is within such blocks,
if VL is ever greater than 31 we're "in trouble" and at *that* point
the scheme you describe would be activated, but otherwise some speculative
decode is perfectly fine.


> it can be treated like a jump so the next instruction address gets added to
> the branch target buffer and the next-pc logic will speculatively fetch from
> the correct location on the next cycle, even before decoding has started.

awesome :)

> demo program:
> 0x08: ori r10, r10, 5
> 0x0c: and r10, r11, r10
> 0x10: sv.addi/w=32 *r3, 0, [0x12345678, 0x9abcdef0]  # vector immediate
        sv.addi/vi/w=32 ...
> 0x20: sv.add/w=32 *r3, *r3, *r3
> ...

this is a really nice illustrative example. needs expanding so that
it's clear that the 2nd immediate is in 0x18.  and setvl MAXVL=4? 8?

    0x10: PO9 sv prefix
    0x14:     addi (prefixed, contains SI=0x12345678)
    0x18: 0x0000_0000 0x0000_0000 0x0000_0000 0x9abcdef0
    0x1c: 0x0000_0000 0x0000_0000 0x0000_0000 0x0000_0000
    0x20: PO9 sv prefix
    0x24:     add *r3, *r3, *r3

ok so there's room there for up to 8 additional constants.
so MAXVL=9 is perfectly fine (on this example).

> demo pipeline with 64-bit fetch width

IBM has been doing 64 *byte* wide decode!!
(likely a clean aligned chunk of a L1 cache line)
fetch-width will be mad: the POWER9 and POWER10
have those OpenCAPI 25 gigabit SERDES (quantity lots!)

-- 
You are receiving this mail because:
You are on the CC list for the bug.


More information about the libre-soc-bugs mailing list