[Libre-soc-dev] load/store quad and svp64

Tue Apr 12 11:07:11 BST 2022

---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68

On Tue, Apr 12, 2022 at 4:10 AM Jacob Lifshay <programmerjake at gmail.com> wrote:
>
> in
> https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/svp64/appendix.mdwn;h=eb9fb4cb158a9379f122d1a9d2042948a133a136;hb=HEAD#l75
>
> lq is stated to be excluded from svp64, imho it should be included since
> 128-bit atomic operations are very useful, even in vector code.

i know you're saying that as a programmer (only) and haven't thought
through the implications for hardware implementors. at all.

atomic 128-bit is fantastically, insanely complex.

not even microwatt does it.
https://github.com/antonblanchard/microwatt/blob/master/decode1.vhdl#L96
https://github.com/antonblanchard/microwatt/blob/master/decode2.vhdl#L423

instead it's done as a non-atomic pair of micro-coded 64-bit operations, and
atomicity is not in the least bit guaranteed.

additionally, wishbone is simply not capable of handling greater than 64-bit
data buses, so we would be forced to implement WB burst-mode right the way
through the entire codebase down to the DRAM.

saying "just" implement lq etc is basically about FIVE months of work.

> The 128-bit atomic operations (from the OpenPower ISA spec. v3.1 book 2
> section 1.3):

(1.4)

> They can't be replaced by svp64 vectorized 64-bit load/stores unless svp64
> is modified to additionally guarantee atomicity at the 128-bit size, which
> I don't think is appropriate for 64-bit operations.

thought about vec2/3/4 but even there, 2-write-atomicity is insane at the
hardware level.

the only reason microwatt gets away with it is because it's single-core
only [at the moment] - no SMP.  the simplest implementation to prevent
overlaps (and keep the split dual-operation) would be to have a global
"stall" flag (hardware equivalent to a spinlock).

> One example of where vectorized 128-bit atomics are quite useful is if
> you're trying to vectorize lookups in a parallel hash table with 64-bit
> key/value pairs, you'd want the 128-bit loads to be relaxed-atomic
> operations since otherwise the key or value might be modified by some other
> cpu and you'd get an old key and a new value or something like that rather
> than the correct result of the old key and the old value, or the new key
> and the new value.

remember: the Power ISA hardware engineers are at the *BILLION*
transistor level with over 25 years of history, experience, and funding.
they've done an "incremental" upgrade through 10 revisions of the
IBM POWER {n} hardware which has masked the fact that they no
longer know what is achievable and reasonable for a small team to
implement, from scratch.

we may not have a choice as to implementation of lq (etc) at the
v3.0 scalar level, although fall-back to exceptions may be possible
due to older versions of IBM POWER x hardware not having it.

however putting lq into SVP64 is absolutely out of the question.

l.