[Libre-soc-bugs] [Bug 1116] evaluate, spec, and implement Vector-Immediates in SVP64 Normal

Sun Jun 11 11:52:22 BST 2023

https://bugs.libre-soc.org/show_bug.cgi?id=1116

Jacob Lifshay <programmerjake at gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |programmerjake at gmail.com

--- Comment #2 from Jacob Lifshay <programmerjake at gmail.com> ---
(In reply to Luke Kenneth Casson Leighton from comment #1)
>      NIA = CIA + CEIL((MAXVL-1) * 16, 4)

that's wrong, if you want a vector of 16-bit immediates you want:
extra_immediates = MAXVL - 1
extra_bytes = extra_immediates * 2
extra_words = -((-extra_bytes) // 4)  # ceil div
NIA = CIA + 8 + 4 * extra_words

that said, if we're going to have vector immediates at all, they should also
account for subvl. it would also be very nice to account for elwid too since
you have to decode a bunch of the prefix and suffix anyway (see note below):

XLEN = max(sw, dw)  # TODO: account for f16/bf16 being 16/16-bit not 8/16-bit
bytes = (XLEN // 8) * subvl * MAXVL
bytes -= 2  # first 2 bytes potentially encoded in 32-bit insn
bytes += 8  # sv prefix + 32-bit insn
bytes = (bytes + 3) & ~3  # round up to words
NIA = CIA + bytes

this allows trivially loading a vector of 64-bit immediates in one instruction
-- better than any fli proposed so far.

decoding note: i expect cpus to generally treat a vector load immediate as a
unconditional jump -- this means they don't try to read instructions after the
load immediate in the same cycle as the load immediate so taking longer to
decode the length is perfectly fine since the instruction start prefix-sum tree
can just treat it as a 64-bit instruction and clear out all attempted
instructions after it, leaving time for the full decoder to decode the correct
length and redirect fetch to the correct location.

it can be treated like a jump so the next instruction address gets added to the
branch target buffer and the next-pc logic will speculatively fetch from the
correct location on the next cycle, even before decoding has started.

demo program:
0x08: ori r10, r10, 5
0x0c: and r10, r11, r10
0x10: sv.addi/w=32 *r3, 0, [0x12345678, 0x9abcdef0]  # vector immediate
0x20: sv.add/w=32 *r3, *r3, *r3
...

demo pipeline with 64-bit fetch width

| cycle | next-pc/BTB  | fetch    | len-decode/tree | decode                |
|-------|--------------|----------|-----------------|-----------------------|
| 0     | 0x08         |          |                 |                       |
|       |              |          |                 |                       |
| 1     | 0x10         | 0x08 ori |                 |                       |
|       | BTB has 0x20 | 0x0c and |                 |                       |
| 2     | 0x20         | 0x10 sv. | 0x08 ori len=4  |                       |
|       |              | 0x14 addi| 0x0c and len=4  |                       |
| 3     | 0x28         | 0x20 sv. | 0x10 sv. len=8  | 0x08 ori     NIA=0x0c |
|       |              | 0x24 add | 0x14 addi len=4 | 0x0c and     NIA=0x10 |
| 4     | ...          | ...      | 0x20 sv. len=8  | 0x10 sv.addi NIA=0x20 |
|       |              |          | 0x24 add len=4  |                       |
| 5     | ...          | ...      | ...             | 0x20 sv.add  NIA=0x28 |
|       |              |          |                 |                       |

-- 
You are receiving this mail because:
You are on the CC list for the bug.