[Libre-soc-isa] [Bug 1056] questions and feedback (v2) on OPF RFC ls010
bugzilla-daemon at libre-soc.org
bugzilla-daemon at libre-soc.org
Wed May 31 14:13:44 BST 2023
https://bugs.libre-soc.org/show_bug.cgi?id=1056
--- Comment #35 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
(In reply to Paul Mackerras from comment #30)
> I think you mean sv.addi/elwidth=16 5,5,0x1122 (not 5,_0_,0x1122).
ah! yes
> I'll assume the 0 for RA is a typo caused by 3.27AM.
>
> > * then inspect (verilator) GPR(5) and read its contents
> >
> > is the answer you expect, regardless of LE/BE: 0x2356?
> > or would it be
> > * 0x2211_0000_0000_1234 (or 0x1122_0000_0000_1234) *or*
> > * 0x0000_0000_0000_3456 due to addi being implicitly
> > reversed-byte-order from sv.addi under BE?
>
> I would expect 0x1122_0000_0000_1234 in BE mode, since you have operated on
> element 0 and elements are 16 bits wide.
ahhh now *that* makes it clear. and is so far left-field of what i
was modelling/expecting from the combinatorial explosion of possibilities
that i couldn't possible guess it :)
now, here's the thing (walk through the implications). where the LE
element-access would be this:
# assume everything LE-ordered and LSB-numbered
gpr_width = 8 # bytrs
num_gprs = 128 # in "upper" SV Compliancy Levels
GPR_sram = [0x00] * gpr_width * num_gprs
src_elbytes = src_elwidth // 8
for i in range(VL):
bytenum = i * src_elbytes # element offset in SRAM bytes
ra_element_start = RA*gpr_width # vector start position
ra_element_start += bytenum # element offset
ra_element_end = ra_element_end + (src_elbytes-1)
ra_src_operand = GPR_sram[ra_element_start thru ra_element_end]
a BE-reversal of the underlying SRAM-access would be:
# *still* assume everything LE-ordered and LSB-numbered
gpr_width = 8 # bytrs
num_gprs = 128 # in "upper" SV Compliancy Levels
GPR_sram = [0x00] * gpr_width * num_gprs
src_elbytes = src_elwidth // 8
for i in range(VL):
offset = i * src_elbytes # element offset in SRAM bytes
gpr_num = offset // gpr_width # relative GPR number
bytenum = offset % gpr_width # byte-start in GPR
----> bytenum = ~bytenum & 0b1111_1111 # BE-inversion
# now finally we know the element-offset start pos
ra_element_start = (gpr_num * gpr_width) + bytenum
ra_element_start += RA*gpr_width # add vector start position
ra_element_end = ra_element_end + (src_elbytes-1)
ra_src_operand = GPR_sram[ra_element_start thru ra_element_end]
at which point i think you'd agree that trying to explain that to
programmers, that this is the underlying model, would be a bit much :)
> > now the same thing with *scalar* instructions:
> >
> > * let us set (verilator or "addi 5,0,0x1234") the contents of GPR(5) = 0x1234
> > * perform "addi 5,0,0x1122"
> > * then inspect (verilator) GPR(5) and read its contents
> >
> > is it *still* 0x23567 regardless of LE/BE?
>
> It's 0x2356 regardless of LE/BE.
and that discrepancy is a violation of (one of the) Orthogonality rule(s).
when MAXVL=VL=1 the behaviour *has* to be the same.
let us imagine that a programmer is converting Scalar Power Assembler
to SVP64. they are doing so on a BE system. assume that
GPR(5) starts out with a value 0x 0000_1144_5566_7788 thy do this:
# old code
addi 5,0,0x1122
addis 5,5,0x3344
# new code
setvli MAXVL=VL=1
sv.addi/elwidth=16 5,0,0x1122
sv.addis/elwidth=32 5,5,0x3344
and then they inspect the contents of GPR(5) and find that it's not
0x0000_0000_3344_1122 which you'd get from running the two scalar
instructions, it's... this may not be correct...
after the sv.addi/ew=16 0x1122_1144_5566_7788
after the sv.addis/ew=32 0x4466_1144_5566_7788
!!!!! :)
they then run that in LE and get this:
0x0000_1144_5566_7788 +
0000 0000 0000 1122 +
0000 0000 3344 0000
= 0000 1144 88aa 88aa
at which point their brains explode.
unpacking what the hell happened there (LE):
* sv.addi/ew=16 sets *two* byte-write-enable lines on GPR(5)
leaving the entire upper 6 bytes *untouched*
* sv.addis/ew=32 sets the bottom *4* byte-write-enable lines
leaving the entire upper 4 bytes untouched.
there is mad interaction between BE-offsets because the starting-point
for *elements within a given GPR* are critically dependent on the
operation width, and inversion of those starting-points becomes a
really crucial thing for the programmer to understand.
> If you did sv.addi/elwidth=64 5,5,0x1122 then the answer would be 0x2356
> regardless of BE/LE.
which means unfortunately that if you had a vector of elements to
add where you know the result fits in 16 bits (Audio/Video) 3/4
of the regfile is unused.
now, REMAP *can* actually handle this type of element-reordering.
in effect what you are proposing is:
* BE ew=8, element ordering:
MSB0 MSB63
LSB63 LSB0
GPR0 7 6 5 4 3 2 1 0
GPR1 ........... 9 8
* BE ew=16, element ordering:
MSB0 MSB63
LSB63 LSB0
GPR0 3 2 1 0
GPR1 7 6 5 4
GPR2 ..... 9 8
* BE ew=32, element ordering:
MSB0 MSB63
LSB63 LSB0
GPR0 1 0
GPR1 3 2
GPR2 5 4
...
and looking at those sequences, svshape2 can handle them each:
* BE ew=8 svshape2 xdim=8, xinv=yes
* BE ew=16 svshape2 xdim=4, xinv=yes
* BE ew=32 svshape2 xdim=2, xinv=yes
and you can also specify *which operands should be so re-ordered* (!!!)
as in: if you wanted to you could set RA and RB to be BE-reordered,
but leave RT in *LE*-reordered numbering (!!!)
do you have to use different svshape2 instructions to get reordering
depending on different elwidths? yes. do i see this as a problem?
mmmm... honestly, no.
> > checking (2) memory-to-register:
> >
> > what about the same conditions (MAXVL=VL=1, a half-word load)
> > with lhbrx vs lhx?
> >
> > * sv.lhbrx vs lhbrx, BE: same value loaded?
> > * sv.lhbrx vs lhbrx, LE: same value loaded?
>
> What are you assuming the element size is?
sigh, it used to be over-rideable up until about 2 weeks ago.
that's the way it was for 18 months.
but finally sanity asserted itself and the *data* elwidth is now
always the same as the *operation* width.
* lh ew=16
* lb ew=8
* lw ew=32
* ld ew=64
(note that there is still elwidth over-rides on the *Effective Address*
calculation for LD/ST-Indexed "RB". i am currently wading through a
really intrusive slightly scary spec update on that).
> I am not clear at this point on how the element size affects loads and
> stores. Does an element size of 16 bits mean that a load does 1/4 of the
> usual number of bits, for instance?
sv.ld/ew=16 64//16=4?
no, i decided for sanity to preserve the relationship "elwidth=opwidth".
loading only 1 bit (sv.lb/ew=8) would be a step too far i feel.
> > if the answer in all cases (m2r&r2r) is "yes", then this is what i mean
> > by "instructions must be Orthogonal regardless of Prefix/Non-prefix"
>
> I'm not sure what "yes" would mean in the addi case above.
hence i went through the example.
> In any case, I
> would note that addi will in general give a different result from
> sv.addi/elwidth=16 in LE mode as well as in BE mode. For example, suppose r5
> contains 0xffff initially.
>
> addi 5,5,1 will give 0x10000 in r5
> sv.addi/elwidth=16 5,5,1 will give 0 in r5 (assuming VL=1 and LE mode).
yes it will! more if r5 contained 0xffff_ffff_ffff_ffff then it
would be 0x0000_0000_0000_0000 in r5 after addi 5,5,1 but after
sv.addi/elwidth=16 5,5,1 it woud be 0xffff_ffff_ffff_0000
"sv.addi." (Rc=1) gets interesting, too. another time.
and.. drat there is no "addio" darnit. "sv.addio/ew=16" would have
dropped the 17th bit into XER.CA
that's slightly annoying but not the end of the world.
> I don't understand what problem these solutions are trying to solve. None of
> them seem to me to be necessary or even desirable. You keep introducing byte
> reversal, which is not ever required by my proposal.
i didn't understand it fully up to now. the "0x1122_0000_000_3344"
finally clinched it.
> In fact, depending on how elwidth affects loads and stores, there may be
> another answer to my original concern about loading an array of values into
> registers. It's possible that doing sv.ld/elwidth=16 r3,0(r4) with VL=4 will
> load four 16-bit elements into r3 in the right order for future operations,
> but I don't know for sure.
yes. Packed Elements. very similar to MMX. wait... immediate-of-zero,
that is a special meaning (Vector LD/ST is always complex, no matter the
ISA, but retro-fitting on top of Scalar LD/ST made things especially
hairy).
https://libre-soc.org/openpower/sv/ldst/
The els bit is only relevant when RA.isvec is clear: this
indicates whether stride is unit or element:
if RA.isvec:
svctx.ldstmode = indexed
elif els == 0:
svctx.ldstmode = unitstride
elif immediate != 0:
svctx.ldstmode = elementstride
and the relevant pseudocode:
elif svctx.ldstmode == elementstride:
# element stride mode
srcbase = ireg[RA]
offs = i * immed # we want this one
elif svctx.ldstmode == unitstride:
# unit stride mode
srcbase = ireg[RA]
offs = immed + (i * op_width) # we don't want this one
so, to match the english-language words you use with the assembler,
you wanted:
sv.lh/ew=16/els r3,16(r4)
which will load QTY4 16-bit contiguous elements starting at r4,
and drop them (also contiguously) into r3.
the original assembler you used:
sv.ld/ew=16 r3,0(r4)
will load *64-bit* quantities, TRUNCATE them to 16-bit, and drop the
TRUNCATED elements contiguously into r3.
(removing saturation which used to be in the LD/ST spec for 18+
months was last week's major-scary-edit, and that is down to
there being no Scalar "ld." Rc=1 is the only way you could
activate notification (CRfield.SO) as to whether saturation
occurred)
--
You are receiving this mail because:
You are on the CC list for the bug.
More information about the Libre-SOC-ISA
mailing list