[Libre-soc-isa] [Bug 560] big-endian little-endian SV regfile layout idea

Tue Jan 5 04:21:45 GMT 2021

https://bugs.libre-soc.org/show_bug.cgi?id=560

--- Comment #65 from Alexandre Oliva <oliva at libre-soc.org> ---
now let's look at the implications of byte endianness for vectors of bytes,
shall we?

v8qi x = { [0] = 0x21, [1] = 0x22, [2] = 0x23, [3] = 0x24, [4] = 0x25, [5] =
0x26, [6] = 0x27, [7] = 0x28 };

regardless of endianness, ((char*)&x)[0] == 0x21, and ((char*)&x)[7] == 0x28
same as if we had a char[8]:

char y[8] = { [0] = 0x21, [1] = 0x22, [2] = 0x23, [3] = 0x24, [4] = 0x25, [5] =
0x26, [6] = 0x27, [7] = 0x28 };

the expectation is that if you memcpy between an array and a vector, and
vice-versa, elements remain in the same order.  a different memory
representation for vectors would break this.

the normal expectation is that, if data is laid out in memory in accordance
with the selected endianness, it is loaded from memory into registers using
regular loads, rather than with byte-reversing loads, right?

if we load the byte vector element-wise, we get regX[0] = 0x21 and regX[7] =
0x28, *regardless* of how we lay out the elements within the register.

this is good, and is the best you can do if data is not more aligned than
strictly needed, or if vector sizes might be clamped at less than 8 byte-sized
elements.

but from a performance standpoint, loading the bytes separately is quite
inefficient.  it would be far more desirable to be able to load dwords rather
than bytes, since that's 8 bytes per effective instruction, instead of just 1.

so, what does it take to get the iteration within the register to enable wide
memory loads in the CPU/system-selected endianness?

in LE, the x above gets loaded by ld as 0x2827262524232221, as in comment 64
so, in order for the iteration order to match the declaration, element 0 is at
bits 2^0..2^7, element 1 is at bits 2^8..2^{15}, and so on.

in BE, however, the x above gets loaded by ld as 0x2122232425262728, as in
comment 60, so, in order for the iteration order to match the declaration,
element 0 has to be at bits 2^{56}..2^{63}, element 1 at bits 2^{48}..2^{55},
and so on.

this is not reversing anything, not introducing any computations anywhere, it's
just making the svp64 loop iterate over multiple elements in the same register
in the same order you'd go over them if they were in memory.

that's all am I suggesting.  not any of the other modifications you're alluding
to and apparently freaking out about.  can you please describe the change I'm
suggesting, in your own words, so that we can make sure we're not talking way
past each other any more?

-- 
You are receiving this mail because:
You are on the CC list for the bug.