[Libre-soc-dev] svp64 questions: various

Sat Dec 26 13:28:18 GMT 2020

On Saturday, December 26, 2020, Alexandre Oliva <oliva at gnu.org> wrote:
>
> Tagging as vectors: is that a property of the register, or of the insn?

of the prefix which is attached to the register, which the hardware loop
must unpack.

> Consider a vector-prefixed insn:
>
>    add rD, rA, rB
>
> Can rA and rB be both used as vectors, or both as scalars, or one as
> vector and the other as scalar?

all of the above.  all eight combinations because all 3 get an EXTRA2/3
marker. 2^3=8.

> The destination, if not vector, could imply identity, reduce, or
> first-predicate-that-holds, so that appears to be a property of the insn
> rather than of the register.
>
> But if one of the source registers is [r]0, then I presume this zero
> would be used as a scalar rather than as a vector, which would make this
> a property of the register rather than of the vector.
>
> However, I expected to find encodings (modes?) to tell whether each
> operand is scalar or vector, but aside from reduce mode, I haven't found
> any such thing.  So, where does 'isvec' come from?

EXTRA2/3 field.

>
>
> I'm a little concerned about vectors that encompass special-purpose
> registers, particularly r30 (sometimes used as PIC base register) and
> r31 (frame pointer).  These inherited register assignments seem to make
> trouble for us, but trying to avoid them is probably not reasonable nor
> worth the effort.

correct.  this requires extra silicon

plus, when the EXTRA2/3 field adds 32 or 64, it's no longer r30, it's r62
or r84, isn't it?

> I haven't checked whether they're mandated by ABIs,

ABIs.  which are only obeyed in some OSes.

> or just GCC conventions, but frame pointers may be specified as an
> unreliable means to build backtraces (debug info provides more reliable
> means for that, that don't require fixed registers), and PIC base
> registers may have to be set up for calls to/from dynamic libraries to
> work, since the procedure linkage table may have to use it.
>
>
>
> Given that SVP64 expands the register files to 128 or, in the future,
> 256 registers, how does this fit in with the goal set for our
> pre-decoder to issue exclusively pre-existing ppc64 insns?  It's not
> like it would be able to address so many registers in insn fields that
> can only address 32 registers.

yes it can.  see EXTRA2/3 field in svp64 and associated pseudocode.

>  Have we given up that goal?

it was one of the very first design decisions satisfied over 2 years ago,
with what is now termed "EXTRA2/3".  in the original version a small CAM
existed which performed an arbitrary lookup.  5 bit to 7 bit lookup.

> This also becomes a concern if the register file wraps around from r127
> (or r255) to r0, since there are various dedicated low-numbered
> registers, including r0 for zero and r1 for the stack pointer.  It's not
> clear to me that it does wrap around, though.

exception thrown for int/reg, wrap allowed for CRs.

>  As I read about element
> width overrides, there's no mention of it and the complications it could
> bring to the sub-element indexing.

that's because by using the c-style union there isn't any.  Register
Allocation Tables convert to a bit per byte and allocate batches.

this is already done for architectures such as AMDGPU which has 32 bit regs
that can be allocated to 64 bit ops and to texturisation ops that can use
up to 12 32 bit regs.

>  OTOH, not wrapping around seems to
> make the higher-numbered registers far less likely to be used.

RATs.  you really don't want to overwrite the r0 etc etc used by ABIs.

MAXVL sets the hard limit.  you know *exactly* how many are going to be
used.

>
> It seems to me that, when using twin predication, one of the predicate
> vectors may run out before the other, and it would be useful to be able
> to tell how far they got.

popcount.  does the job.

>
> I saw email, but I don't see much about extending the CR register file,

mostly because it is an unfamiliar concept.  it took 18 months to cone up
with SV on RV, there hasn't been as much thought gone into CRs yet (and it
has to be done a lot quicker)

> or of prefix bits to reference the extended registers.

EXTRA2/3.

> Anyway, there
> are plenty of opcodes that set specific CRs as side effects, and it's
> not clear whether those CRs are treated as scalar or vector registers.

in the svp64 section on CRs.

> Assuming CR vectors are indeed available, it seems to me that it would
> be useful to be able to "compress" CR vectors into predicate registers,
> i.e., selecting the relevant compare result bit from each CR and placing
> it in the corresponding bit in a scalar register, to eventually be used
> as a predicate.  There doesn't seem to be any way to do this, is there?

https://libre-soc.org/openpower/sv/predication/
https://libre-soc.org/openpower/sv/cr_int_predication/

the mask mode allows selection of an int *or* a CR Vector as predicate.

this stops us from having to add a bunch of predication style operations on
CR Vectors such as "CR popcount" etc.

> If so, here's an idea: have a mode that modifies the predicate register,

there are two modes allowed, one is scalar int (64 bits) one is CRs.

see the two CR tables in svp64 for which of those is allowed.  r3, r10, r30
i think.  and for CRs, this is TBD evaluated because as you noticed, CR0
and CR1 are taken and implicit, but what about Vectorised implicit Rc=1
operations?

> zeroing those bits that, in pred-result mode, either get the store
> canceled (condition not met), or those that perform the store (operation
> already performed on this element).  With this, cr logical ops can be
> used to transfer select bits in CR vectors to predicate registers, that
> can then be further operated on with bitwise opcodes.

a much simpler way is just to let Vectors of CRs *be* predicates, add
Vectorisation to CR ops (because we can) then you just do CR ops and you're
done.

with pred-result mode applying *also to CR ops* and predication of CR ops
also being permitted there is an explosion of possibilities and flexibility
here which i am not going to go into fully, it will take too long, and i
feel it better to let it "emerge" over time and, if there are limitations,
address them during an evaluation / assessment phase.

now, here i would very much welcome some thought on how that would work,
the implications etc.  it is an area that is completely new, and needs
documenting, what kinds of tricks could a compiler play when there is
effectively a free bitwise AND of the CRs (predicate = CR, on top of a CR
op, the predicate bit test is effectively a free AND)

that said, my intuition tells me that this is going to be extremely
powerful and result in compact and elegant assembler.  the CR testing is
dead simple logic, so few gates, it's definitely worth it.  hence the mode
i woukd like to see here is, "proceed with it into actual implementation
and wait for behaviour to emerge" rather than "delay everything until it is
understood 100%" or, worse, "rip its head off because clearly it's stupid
to put something not tested and understoid into an ISA".

l.

-- 
---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68