[Libre-soc-isa] [Bug 560] big-endian little-endian SV regfile layout idea

Thu Dec 31 05:14:49 GMT 2020

https://bugs.libre-soc.org/show_bug.cgi?id=560

--- Comment #28 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
(In reply to Jacob Lifshay from comment #23)
> (In reply to Luke Kenneth Casson Leighton from comment #20)
> > (In reply to Jacob Lifshay from comment #15)
> > 
> > > byte order *is* significant in registers precisely because we can treat them
> > > as an indexed vector of bytes by using vector u8 instructions. having that
> > > vector of bytes match the vector of bytes in memory is important for
> > > performance and consistency, since otherwise we will have to insert tons of
> > > byte-swap instructions for memory-order bitcasting that would otherwise be
> > > totally unneeded.
> > 
> > no, you just use either ld-reverse or not.  the ld and st operation takes
> > care of the bytereversing, that's why it was added to OpenPOWER.
> 
> no, you don't. bitcasting is a register reinterpret operation (register to
> register), using load/store operations to implement bitcasting is slow and
> wasteful (unless you needed to load/store anyway).

ah i was confused by the mention of "memory", i thought you were exclusively
referring to memory-to-register trabsfers.

> on LLVM, bitcasting usually compiles to no instructions, or rarely a
> register to register move instruction.

here, is there any reason why bitmanip would be insufficient?

also: we need use-cases to justify the time drain spent doing a comprehensive
evaluation.

> > however if you override elwidth=32 then *even when VL=1* the top 32 bits
> > WILL NOT be overwritten.
> > 
> > why?
> > 
> > because elwidth=32 is a SPECIFIC and direct command to the hardware to set
> > the underlying regfile SRAM write-enable lines  to 0b00001111
> 
> Well, I interpret elwidth=32 as a specific command to mean we operate on
> 32-bit values, so scalars are truncated/sign/zero-extended to/from 32-bits
> when reading/writing registers.
> 
> scalar registers are a *totally different kind* of argument, they are *not
> vectors*, 

they are: look at the pseudocode.  they're "degenerate vectors of length equal
to one".

i think what you might be imagining to be the case is, "if VL==1 && SUBVL==1
then SV is disabled, and a different codepath followed that goes exclusively to
a scalar-only OpenPOWER v3.0B compliant codepath"

this categorically and fundamentally is NOT the case.

i comprehensively analysed and rejected this as far back as... 2018? during the
first few weeks/months of developing SV, and documented it, implemented it in
spike-sv and added unit tests that implemented the behaviour described in the
previous comment.

the behaviour is:

    if VL==1 &&
       SUBVL==1 &&
       ELWIDTH==default && 
       predication==all1s && 
       all_other_features()==default
    then
       behaviour is identical to that
       of scalar but only because
       all for-loops are length 1 and
       all augmentations are "off".

there *is* no separate scalar code-path.  not in the hardware, and not in the
simulator.

there *is* only the for-loops, exactly as outlined in the pseudocode.

this is really fundamental and important to grasp

> so therefore are treated like is usual in the scalar instruction
> set -- registers are treated as a single value.

again: look at the pseudocode, and the riscv-isa-tests (sv branch)

over 18 months ago i implemented SV as exactly as in the pseudocode.

it's always been this way: that scalars are "degenerate VL=1".

i also mention it in the notes, and the spec: SV is never really "off", it's
just that the default settings *look* like the original unaugmented scalar ISA.

that's what the "scalar identity behaviour" is.  it's "the settings which make
SV one-for-one equivalent to scalar behaviour".

one change - just one - and that no longer holds (and the result is pretty
disastrous, see below)

to implement what you believe to be the case (which hasn't ever been the case,
not in the entire development of SV) actually requires special violation of
that behaviour! it will need a special exemption that will actually increase
gate count to implement!

more than that, it actually prevents and prohibits desired and desirable
behaviour.

what you are effectively expecting is that if VL=1, elwidth is ignored (treated
as default, regardless of its actual value) because, well, VL=1 and that's
scalar, right?

what if during some algorithm VL happens to be set to 1?   a loop counter
happens to become set to 1?

this will result in elwidth being ignored, and catastrophic data corruption
will occur, because what should have been an 8 bit operation (had VL been 2)
now becomes a 64 bit operation just because VL=1?

to avoid that case, it would *specifically* require a test, inside the inner
loop (worst possible place) looking for the case where VL=1 and deploying SIMD
style cleanup as described as follows

what if it is specifically desired to modify only the first byte of a 64 bit
register? (including for the loop when RA happens to be 1 on a setvl call)

with the current behaviour, this is dead simple: set VL=1, set elwidth=8

done.

however with the change that you propose / expect, the following (expensive,
intrusive) tricks must be played:

* read mask-modify write
 - copy the entire 64 bit register
 - use rwlinmi to insert the byte
 - write the entire 64 bit register

 compared to "set elwidth=8" this is staggeringly expensive

* use predication
  - push any current predication source
    on the stack or into a temp reg
  - set the predicate to 0b01
  - set VL=2, elwidth=8
  - perform the operation
  - restore the old predicate

 this is even more wasteful, because it is 100% guaranteed that the 2nd
Function Unit will be allocated even for a short duration whilst the predicate
register is being read.  a few cycles later the 2nd bit is discovered to be
zero abd Sgadiw Cancellation kicks in but for those few cycles, that FU is out
of commission.

* use swizzle

  .... i won't go into this one because again it should be clear by now the
reasons why i rejected the idea of considering "VL!=1" to be the exclusive,
sole guiding factor in enabling SV.

bottom line: elwidth overrides have merit and purpose on their own, regardless
of what the value of VL or SUBVL is.

> vectors (scalar with subvl!=1 counts as a kind of fixed-length vector in my
> mind) treat registers as an array of elements, not as a single value.

yes, as a sub-sub-PC, apart from the difference for predication, setting VL=1,
SUBVL=3 is near-as-damnit the same as VL=3,SUBVL=1.  it's not but you know what
i mean.

consequently yes, VL=1,SUBVL>1 "makes sense" as "being a vector".

however this is very misleading.

fundamentally you need to make an adaptation to "SV is *nevvvverrrr* switched
off".  no feature of SV - not VL, not SUBVL, not ELWIDTH, not predication, is
truly "off".

they are all independent, and they all have default values.

* predication: all 1s
* elwidth: default-for-instruction
* VL: 1
* SUBVL: 1

-- 
You are receiving this mail because:
You are on the CC list for the bug.