[Libre-soc-isa] [Bug 571] svp64 vector loads: sub-dword selection before or after byte-reversal

Thu Jan 7 01:06:46 GMT 2021

https://bugs.libre-soc.org/show_bug.cgi?id=571

--- Comment #2 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
a copy of that pseudocode from the other bugreport.  bear in mind that this is
"unit strided" mode, which increments the (normally fixed, constant) immediate
offset by an additional amount (in bytes), src_elwidth/8.

the pseudocode will therefore be as follows (assume src_elwidth=64 to indicate
64-bit reads):

    function op_ld(rd, rs, brev) # LD not VLD! (ldbrx if brev=True)
      for (int i = 0, int j = 0; i < VL && j < VL;):

        # unit stride mode, compute the address
        srcbase = ireg[rsv] + i * src_elwidth;

        # takes care of (merges) processor LE/BE and ld/ldbrx
        bytereverse = brev XNOR MSR.LE

        # read the underlying memory
        memread <= mem[srcbase + imm_offs];

        # optionally performs 8-byte swap (because src_elwidth=64)
        if (bytereverse):
            memread = byteswap(memread, src-elwid)

        # takes care of inserting memory-read (now correctly byteswapped)
        # into regfile underlying LE-defined order, into the right place
        # within the NEON-like register, respecting destination element
        # bitwidth, and the element index (j)
        set_polymorphed_reg(rd, dest_bitwidth, j, memread)

        # increments both src and dest element indices (no predication here)
        i++;
        j++;

so the first question was: is sub-dword selection before or after bytereversal,
well, the question as asked does not make sense.

the offset selects the *area* of memory containing the element.  there is
absolutely no relation between the element indexing and the order of the bytes
*in* the element.

the only possible interpretation of the question which might make sense is
illustrated by the ARM NEON LDR (Load-Reverse) instruction, where they perform
*total* byte-reversal, bytes 0-15 in memory get placed into register bytes 15-0

the pseudocode as listed above SPECIFICALLY does not do that.  however bear in
mind that the pseudocode is drastically simplified: REMAP has been removed for
example.

REMAP ***IS*** capable of performing the same duties as NEON LDR (and then
some)

but let us get clear first about the basics before moving on to that.

* the standard behaviour of SV ld in unit strided mode goes linearly
one-for-one contiguously through memory as it goes contiguously up the register
numbers

* bytereversal as defined and required for v3.0B compliance *REQUIRES* the XNOR
oddness which removes endianness at the memory level and places data into
registers that, internally, become DEFINED as NEON-like in behaviour.  byte 0
contains the LSByte; bit 0 contains the LSB.

* AT NO TIME (when REMAP is inactive) is any other reordering, remapping, or
definitions in play.

* AT NO TIME (when REMAP is inactive) will elements be anything other than
linear, sequential and contiguous, both for src in computing the unit stride
memory offset and for dest in picking the target register

the next complication is elwidth overrides (which was where the old SV appendix
came in handy)

the dest elwidth part is easy: the registers are defined via the typedef union,
and by the set/get polymorphic pseudocode, and with the ordering of elements
clear (linear, byte 0 given index 0) and the internal element definition also
being clear (linear, LE) i.e. exactly as NEON, the placement of elements is
straightforward.

the src elwidth, due to the fact that it is memory, is where it gets odd.

bear in mind we have **THREE** widths here (!)

* ld/lw/lh/lb i.e. the original operation width
* src elwidth override
* dest elwidth override (covered already)

we therefore take the ACTUAL width and the ACTUAL LD as an ACTUAL fully
compliant v3.0B LD operation.

this means including the quirky byte-reversal which we have, as above, already
diacussed, REMOVES all and any evidence of byte-ordering from the data.

now.

***AFTER*** that data is loaded (which will have been at a nonaligned
location), and LE/BE taken care of, we now have a byte, or a hword, or a dword
etc, that is in its correct Arithmetic Order, with its bit 0 being in bit 0,
and byte 0 being in byte 0.  LSByte is in byte0, LSB is in bit 0.

now - *now* - we have to perform dest elwidth adjustment.

* for a lh operation which loaded 2 bytes, if elwidth=32 then this would
involve zero-extending to 32 bits

* for a ld operation which loaded 8 bytes, if elwidth=32 then this would
involve *truncation* to 32 bits.

etc. etc.

whilst this may seem weird and redundant because, oink, there is going to be
dest elwidth override too, you have to bear in mind that SATURATION Mode can be
applied, and that goes IN BETWEEN src and dest elwidth overrides.

we can therefore have a case where:

* lw loads 32 bit elements
* src override is 16 which truncates
* dest does not have an override so the data (now 16 bits long) is placed in a
full 64 bit register and the upper 48 bits set to zero.

and many others that make for spectacularly comprehensive combinations.

i leave it at that for now, i will re-read 570 sections on elwidths to see if
those were valid.

-- 
You are receiving this mail because:
You are on the CC list for the bug.