[Libre-soc-isa] [Bug 1056] questions and feedback (v2) on OPF RFC ls010

Wed May 31 14:13:44 BST 2023

https://bugs.libre-soc.org/show_bug.cgi?id=1056

--- Comment #35 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
(In reply to Paul Mackerras from comment #30)

> I think you mean sv.addi/elwidth=16 5,5,0x1122 (not 5,_0_,0x1122).

ah! yes

> I'll assume the 0 for RA is a typo caused by 3.27AM.
> 
> > * then inspect (verilator) GPR(5) and read its contents
> > 
> > is the answer you expect, regardless of LE/BE: 0x2356?
> > or would it be 
> > * 0x2211_0000_0000_1234 (or 0x1122_0000_0000_1234) *or*
> > * 0x0000_0000_0000_3456 due to addi being implicitly
> >   reversed-byte-order from sv.addi under BE?
> 
> I would expect 0x1122_0000_0000_1234 in BE mode, since you have operated on
> element 0 and elements are 16 bits wide.

ahhh now *that* makes it clear.  and is so far left-field of what i
was modelling/expecting from the combinatorial explosion of possibilities
that i couldn't possible guess it :)

now, here's the thing (walk through the implications).  where the LE
element-access would be this:

     # assume everything LE-ordered and LSB-numbered
     gpr_width = 8 # bytrs
     num_gprs = 128 # in "upper" SV Compliancy Levels
     GPR_sram = [0x00] * gpr_width * num_gprs
     src_elbytes = src_elwidth // 8
     for i in range(VL):
         bytenum = i * src_elbytes # element offset in SRAM bytes
         ra_element_start = RA*gpr_width     # vector start position
         ra_element_start += bytenum # element offset
         ra_element_end   = ra_element_end + (src_elbytes-1)
         ra_src_operand = GPR_sram[ra_element_start thru ra_element_end]

a BE-reversal of the underlying SRAM-access would be:

     # *still* assume everything LE-ordered and LSB-numbered
     gpr_width = 8 # bytrs
     num_gprs = 128 # in "upper" SV Compliancy Levels
     GPR_sram = [0x00] * gpr_width * num_gprs
     src_elbytes = src_elwidth // 8
     for i in range(VL):
         offset = i * src_elbytes           # element offset in SRAM bytes
         gpr_num = offset // gpr_width      # relative GPR number  
         bytenum = offset %  gpr_width      # byte-start in GPR
---->    bytenum = ~bytenum & 0b1111_1111   # BE-inversion
         # now finally we know the element-offset start pos
         ra_element_start = (gpr_num * gpr_width) + bytenum
         ra_element_start += RA*gpr_width     # add vector start position
         ra_element_end   = ra_element_end + (src_elbytes-1)         
         ra_src_operand = GPR_sram[ra_element_start thru ra_element_end]

at which point i think you'd agree that trying to explain that to
programmers, that this is the underlying model, would be a bit much :)

> > now the same thing with *scalar* instructions:
> > 
> > * let us set (verilator or "addi 5,0,0x1234") the contents of GPR(5) = 0x1234
> > * perform "addi 5,0,0x1122"
> > * then inspect (verilator) GPR(5) and read its contents
> > 
> > is it *still* 0x23567 regardless of LE/BE?
> 
> It's 0x2356 regardless of LE/BE.

and that discrepancy is a violation of (one of the) Orthogonality rule(s).
when MAXVL=VL=1 the behaviour *has* to be the same.

let us imagine that a programmer is converting Scalar Power Assembler
to SVP64.  they are doing so on a BE system.  assume that
GPR(5) starts out with a value 0x 0000_1144_5566_7788 thy do this:

     # old code
     addi 5,0,0x1122
     addis 5,5,0x3344
     # new code
     setvli MAXVL=VL=1
     sv.addi/elwidth=16 5,0,0x1122
     sv.addis/elwidth=32 5,5,0x3344

and then they inspect the contents of GPR(5) and find that it's not
0x0000_0000_3344_1122 which you'd get from running the two scalar
instructions, it's... this may not be correct...

    after the sv.addi/ew=16    0x1122_1144_5566_7788
    after the sv.addis/ew=32   0x4466_1144_5566_7788

!!!!! :)

they then run that in LE and get this:

     0x0000_1144_5566_7788 +
       0000 0000 0000 1122 +
       0000 0000 3344 0000

=      0000 1144 88aa 88aa

at which point their brains explode.

unpacking what the hell happened there (LE):

* sv.addi/ew=16 sets *two* byte-write-enable lines on GPR(5)
  leaving the entire upper 6 bytes *untouched*
* sv.addis/ew=32 sets the bottom *4* byte-write-enable lines
  leaving the entire upper 4 bytes untouched.

there is mad interaction between BE-offsets because the starting-point
for *elements within a given GPR* are critically dependent on the
operation width, and inversion of those starting-points becomes a
really crucial thing for the programmer to understand.

> If you did sv.addi/elwidth=64 5,5,0x1122 then the answer would be 0x2356
> regardless of BE/LE.

which means unfortunately that if you had a vector of elements to
add where you know the result fits in 16 bits (Audio/Video) 3/4
of the regfile is unused.

now, REMAP *can* actually handle this type of element-reordering.
in effect what you are proposing is:

* BE ew=8, element ordering:
       MSB0          MSB63
       LSB63         LSB0
  GPR0 7 6 5 4 3 2 1 0
  GPR1 ........... 9 8

* BE ew=16, element ordering:
       MSB0    MSB63
       LSB63   LSB0
  GPR0   3  2  1  0
  GPR1   7  6  5  4
  GPR2 .....   9  8

* BE ew=32, element ordering:
       MSB0    MSB63
       LSB63   LSB0
  GPR0      1     0
  GPR1      3     2
  GPR2      5     4
  ...

and looking at those sequences, svshape2 can handle them each:

* BE ew=8     svshape2 xdim=8, xinv=yes
* BE ew=16    svshape2 xdim=4, xinv=yes
* BE ew=32    svshape2 xdim=2, xinv=yes

and you can also specify *which operands should be so re-ordered* (!!!)

as in: if you wanted to you could set RA and RB to be BE-reordered,
but leave RT in *LE*-reordered numbering (!!!)

do you have to use different svshape2 instructions to get reordering
depending on different elwidths? yes.  do i see this as a problem?
mmmm... honestly, no.

> > checking (2) memory-to-register:
> > 
> > what about the same conditions (MAXVL=VL=1, a half-word load)
> > with lhbrx vs lhx?
> > 
> > * sv.lhbrx vs lhbrx, BE: same value loaded?
> > * sv.lhbrx vs lhbrx, LE: same value loaded?
> 
> What are you assuming the element size is?

sigh, it used to be over-rideable up until about 2 weeks ago.
that's the way it was for 18 months.

but finally sanity asserted itself and the *data* elwidth is now
always the same as the *operation* width.

* lh ew=16
* lb ew=8
* lw ew=32
* ld ew=64

(note that there is still elwidth over-rides on the *Effective Address*
 calculation for LD/ST-Indexed "RB". i am currently wading through a
 really intrusive slightly scary spec update on that).

> I am not clear at this point on how the element size affects loads and
> stores. Does an element size of 16 bits mean that a load does 1/4 of the
> usual number of bits, for instance?

sv.ld/ew=16 64//16=4?

no, i decided for sanity to preserve the relationship "elwidth=opwidth".
loading only 1 bit (sv.lb/ew=8) would be a step too far i feel.

> > if the answer in all cases (m2r&r2r) is "yes", then this is what i mean
> > by "instructions must be Orthogonal regardless of Prefix/Non-prefix"
> 
> I'm not sure what "yes" would mean in the addi case above.

hence i went through the example.

>  In any case, I
> would note that addi will in general give a different result from
> sv.addi/elwidth=16 in LE mode as well as in BE mode. For example, suppose r5
> contains 0xffff initially.
> 
> addi 5,5,1 will give 0x10000 in r5
> sv.addi/elwidth=16 5,5,1 will give 0 in r5 (assuming VL=1 and LE mode).

yes it will! more if r5 contained 0xffff_ffff_ffff_ffff then it
would be 0x0000_0000_0000_0000 in r5 after addi 5,5,1 but after
sv.addi/elwidth=16 5,5,1 it woud be 0xffff_ffff_ffff_0000

"sv.addi." (Rc=1) gets interesting, too. another time.

and.. drat there is no "addio" darnit.  "sv.addio/ew=16" would have
dropped the 17th bit into XER.CA

that's slightly annoying but not the end of the world.

> I don't understand what problem these solutions are trying to solve. None of
> them seem to me to be necessary or even desirable. You keep introducing byte
> reversal, which is not ever required by my proposal.

i didn't understand it fully up to now.  the "0x1122_0000_000_3344"
finally clinched it.

> In fact, depending on how elwidth affects loads and stores, there may be
> another answer to my original concern about loading an array of values into
> registers. It's possible that doing sv.ld/elwidth=16 r3,0(r4) with VL=4 will
> load four 16-bit elements into r3 in the right order for future operations,
> but I don't know for sure.

yes. Packed Elements. very similar to MMX.   wait... immediate-of-zero,
that is a special meaning (Vector LD/ST is always complex, no matter the
ISA, but retro-fitting on top of Scalar LD/ST made things especially
hairy).

https://libre-soc.org/openpower/sv/ldst/

    The els bit is only relevant when RA.isvec is clear: this
    indicates whether stride is unit or element:

    if RA.isvec:
        svctx.ldstmode = indexed
    elif els == 0:
        svctx.ldstmode = unitstride
    elif immediate != 0:
        svctx.ldstmode = elementstride

and the relevant pseudocode:

        elif svctx.ldstmode == elementstride:
          # element stride mode
          srcbase = ireg[RA]
          offs = i * immed              # we want this one
        elif svctx.ldstmode == unitstride:
          # unit stride mode
          srcbase = ireg[RA]
          offs = immed + (i * op_width)  # we don't want this one

so, to match the english-language words you use with the assembler,
you wanted:

    sv.lh/ew=16/els r3,16(r4) 

which will load QTY4 16-bit contiguous elements starting at r4,
and drop them (also contiguously) into r3.

the original assembler you used:

    sv.ld/ew=16 r3,0(r4)  

will load *64-bit* quantities, TRUNCATE them to 16-bit, and drop the
TRUNCATED elements contiguously into r3.

(removing saturation which used to be in the LD/ST spec for 18+
 months was last week's major-scary-edit, and that is down to
there being no Scalar "ld."  Rc=1 is the only way you could
activate notification (CRfield.SO) as to whether saturation
occurred)

-- 
You are receiving this mail because:
You are on the CC list for the bug.