[Libre-soc-dev] svp64 review and "FlexiVec" alternative

Tue Aug 2 20:55:55 BST 2022

just fyi i subscribe to a few of the threads on comp.arch (only a few
otherwise it's easy to be overwhelmed), this one came up:

https://groups.google.com/g/comp.arch/c/18_MJmat9_M/m/6xwLndpgDwAJ

exerpt from the conversation:

My 66000 allows vectorized loops to INValidate cache lines and write
over them when it can be determined that the entire cache line will
be written. I use the term "allocate" to denote this capture of a cache
line.

Old Supercomputer applications are KNOWN to strip mine caches, and
for this reason have been mostly cache hierarchy free. Knowing that the
entire cache (or hierarchy) is going to be strip mined could enable the
STs to be write buffered and delivered directly to memory, while LDs
interrogate the cache (hierarchy) on the way to memory accepting hits,
but not replacing miss data; but buffering lines near the L1 for efficient
access. In these situations, the prefetch needs to be hundreds of cycles
in advance of the instantaneous instruction stream for data to arrive
before the machine stalls. So, a 5 GHz processor reading from 20ns
DRAM needs 100 cycles + interconnect latency to run stall free.

In addition, when Supercomputer applications use gather/scatter sending
entire lines around the interconnect waste 7/8ths of the BW.