[Libre-soc-misc] How not to design instruction sets

Mon Dec 14 22:25:47 GMT 2020

On 12/14/20, Luke Kenneth Casson Leighton <lkcl at lkcl.net> wrote:
> On 12/14/20, Jacob Lifshay <programmerjake at gmail.com> wrote:
>> On Mon, Dec 14, 2020, 11:43 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
>> wrote:
>>
>>> On 12/14/20, Jacob Lifshay <programmerjake at gmail.com> wrote:
>>> > I found this on the RISC-V mailing lists, looks interesting:
>>> > https://player.vimeo.com/video/450406346
>>> >
>>> > it's a talk by one of the x86 AVX512 instruction set designers
>>> > covering
>>> the
>>> > benefits and mistakes of AVX512.
>>>
>>> oo fascinating.  nice find.
>>>
>>
>> One important thing they mentioned is that swizzles should not be
>> combined
>> with ALU ops. Combining them with load/stores is fine though.
> yes i noted that.  expecting people to port SSE code except they just
> chucked it out.
>
>> I'm thinking that if we have the realignment network on the input of the
>> ALUs anyway to handle packing the ALUs fuller when doing predicated ops
>> (AVX512 doesn't do that), then swizzles might be fine to combine with ALU
>> ops anyway.

btw the LDSTBuffer system may be... ahh... different from anything
intel has done, because of design input that originally came from
Mitch.

all LDSTs are independent, each has their own shift-mask.  there's no
vector version of LDST because they're broken down into multi-issue
element LDSTs.

misaligned gets split into *two* 64 bit LDSTs each with their own
byte-level mask, and a funnel checks the alignment at bit 4 of the
address, splitting into odd and even 16-byte requests.  each has its
own mask.

L0CacheBuffer has *multiple* of those come in, each with its own mask.

those masks are bit-merged and for every request with the same address
at bits 5 and above there is ONE 128 bit request per cycle to get the
entire cache line.

once complete *every* LD receives a copy of that same 16 byte request,
they all independently make the relevant shift/mask, they all
independently make a write req to the regfile.

so unlike the 512 bit shift-mask engine you may be expecting this: it
does not exist.

in other words if we want swizzled LDST vec2/3/4 (with elwidth
overrides) it will have to be done as a reg renaming and
pre/post-processing phase.

each of the LDSTs would be analysed to see which "actual" 64 bit reg
they fit into, and an "alternative" (no swizzle) suite of adjusted
element based LDSTs issued, with completely different shift/mask
offsets.

LDSTs that happen to fit into the same 64 bit reg, no problem, their
shifts/masks can be merged before issue, or they can be left to merge
at the L0CacheBuffer.  it would be best not to burden the
L0CacheBuffer though.

l.