[Libre-soc-dev] LD/ST lacking Data-Dependent Fail-First in Indexed

Luke Kenneth Casson Leighton lkcl at lkcl.net
Sun Apr 2 14:34:21 BST 2023


0-1 2 3 4 description
00 SEA dz sz simple mode
01 SEA dz sz Strided (scalar only source)
10 N dz sz sat mode: N=0/1 u/s
11 inv CR-bit Rc=1: pred-result CR sel
11 inv zz RC1 Rc=0: pred-result z/nonz

LD/ST i initially thought it would be pointless to have
Data-Dependent Fail-First because why would you load values
and check them for zero-nonzero?

embarrassingly the answer to that is clearly "because linked list
or other chain-pointer data structure" which did not occur to
me at the time.

priority was therefore given to "element strided" mode as
distinct from "LD-SPLAT" which is what would happen if
EA="Scalar-RA plus Scalar-RB" were repeatedly loaded
instead of EA="Scalar-RA plus (Scalar-RB times i)"
which is element-strided

the way that vectorised linked-list-walking is supposed to
work is:

   RT=1 # vec - deliberately overlaps by one with RA
   RA=0 # vec - first one is valid, contains ptr
   RB=8 # scalar: how far into data structure ptr->next is
   VL=4
   for i in range(VL):
       EA = GPR(RA+i) + GPR(RB) # ptr + offset(next)
       data = MEM(EA, 8) # 64-bit address of ptr->next
       GPR(RT+i) = data  # happens to be read on next loop!
       CR.field(i) = conditions(data)
       if CR.field(i).EQ == testbit: # check if zero
           VL = i                    # update VL
           break                     # stop looping

the key here is that RT=RA+1 and the idea is simply that
the data read which is ptr->next is written into the
element that *happens* to be read on the next loop.

unnfortunately for LDST-Indexed they are all EXTRA2 and
you *can't* have RT=RA+1 only RT=RA+2 which is fine for
say doubly-linked-lists (walking simultaneously in both
directions) but not single-linked.

a cheat would be to use LD-ST-immediate as long as the offset
is within range *or* to start with RA pre-offset ur wait no
you need Vertical-First because the adding of the offset
needs to be done in the loop.

a second cheat... :)

... is to use a predicate mask 0b10101010 and skip every other
element, or use svshape2 with an "offset" of 1.


the other really important aspect, for which Pred-result is
worth dropping, is VLI (VL-inclusive) and i think bit 0 can
be used for that:

0-1 2 3 4 description
00 SEA dz sz Strided (scalar only source)
10 N dz sz sat mode: N=0/1 u/s
VLI 1 inv CR-bit Rc=1: ffirst CR sel
VLI 1 inv els RC1 Rc=0: ffirst z/nonz

however for *that* to work a new scheme has to be devised
for activating/detecting element-strided.

the key discerner there is that RA and RB must both
be scalar, which would normally be a VSPLAT encoding.
i think however it may be ok to also use "is source predication
disabled" as an additional detecter, whereby if actual
VSPLAT is required on all elements simply set r3 (or r10 or
or r30) to zero (1 instruction) and use ~r3 (or ~r10/31)
as the predicate for the Effective Address.

sigh no that fails as it cuts off options.

ok the next priority is to lose Saturated Mode, which
is fine other than the cost of the destination registers
needing to be larger. i can live with that.

Indexed:

0-1 2 3 4 description
els 0 SEA dz sz Strided (scalar only source)
VLI 1 inv CR-bit Rc=1: ffirst CR sel
VLI 1 inv els RC1 Rc=0: ffirst z/nonz

if mode = 0b01 and !RA.isvec and !RB.isvec:
   svctx.ldstmode = elementstride

and Immediate can keep saturation, it already has element-strides
but adds VLI to DD-FFirst, losing pred-result.

0-1 2 3 4 description
00 0 zz els simple mode
00 1 PI LF post-increment and Fault-First
10 N zz els sat mode: N=0/1 u/s
VLI 1 inv CR-bit Rc=1: ffirst CR sel
VLI 1 inv els RC1 Rc=0: ffirst z/nonz


-- 
---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68


More information about the Libre-soc-dev mailing list