[Libre-soc-isa] [Bug 560] big-endian little-endian SV regfile layout idea

Thu Dec 31 16:35:39 GMT 2020

https://bugs.libre-soc.org/show_bug.cgi?id=560

--- Comment #32 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
alexandre i haven't forgotten you i want to follow,this thread first, i'll come
back to what you wrote after.

(In reply to Jacob Lifshay from comment #30)

> bitmanip can be done, it's just more instructions that wouldn't be needed if
> endian was consistent between registers/memory.

i get it: i just don't like it (due to how intrusive a change it is on the
codebase), and my intuition is also ringing some alarm bells when it comes to
the width-extension.

i have a sneaking suspicion that the width-extension would end up jamming the
bytes into completely the wrong end (as if they had been shifted up).

that the "simple looking" perfectly reasonable assumption that memorder equals
regSRAMorder is far from simple: it holds only when the bitwidth is 64, and
screws 32-bit and 16.  or if you declare it as 32 bit, it screws 64 and 16.

a case could be made that "ok ok you analyse that case and get it right in
hardware" however even just having to consider that and go through it
comprehensively *we do not have time*.

my estimates are that this exercise will add somewhere around 6 to 8 weeks onto
our already pressurised timescales.

i would far rather that this be declared "a future problem solvable with a new
MSR bit" and leave it at that for now.

> > 
> > also: we need use-cases to justify the time drain spent doing a
> > comprehensive evaluation.
> 
> yup, I know there are some, I'll look for concrete examples later when I'm
> not braindead

:)

they will be important to give us an indication of how much a priority this
really is.
> > i think what you might be imagining to be the case is, "if VL==1 && SUBVL==1
> > then SV is disabled, and a different codepath followed that goes exclusively
> > to a scalar-only OpenPOWER v3.0B compliant codepath"
> > 
> > this categorically and fundamentally is NOT the case.
> 
> yup, totally agreed.

wheww, because it would be a bit of a disaster :)

> what I meant is that if you have a SVP64 instruction with scalar arguments:
> 
> add r10.v, r3.s, r20.v, subvl=1, elwidth=32, mask=r30

ah.  right.  ok so i had to do an update to the overview page about this,
because we made the change that both src and dest can have different elwidths.

so ahh unfortunately the example you give is ambiguous because i do not know if
you meant that src *and* dest are 32 bit, or just dest (because you forgot to
add elwidth_src=something)

let me see if i can work it out / deduce it...

> for r3 (but not r10 or r20) it reads the full register,

ah.  "reads r3 at full width" this means "elwidth_src=default" was missing from
the prefix:

   add r10.v, r3.s, r20.v, subvl=1, elwidth=32, mask=r30, elwidth_src=default

r20 on the other hand being 3 bit... deep breath...., ***No***.  that is
precisely and exactly i described in the previous comment, taking about an hour
to do so, explaining why it is dangerous.

please understand: this is ABSOLUTELY FUNDAMENTALLY CRITICAL to treat scalars
as not being scalars at all but as being "vectors of length 1"

i get why it would seem to make sense, because it would mean that a 64 to 32
bit register copy is needed, with the top bits being zero in the destination. 
or that the top bits of r3.s would need to be zero'd out.

this is just how it has to be, not least because that copy-of-length-1 (into
r3.s) serves a hugely significant purpose of avoiding lane-crossing when the
vector-add is performed.

the "proper" solution is in fact to add src1 elwidth  src2 elwidth src3 elwidth
etc etc etc. all of which was part of SV-Orig and we simply don't have the
space for it in svp64.

elwidth overrides are already a sub-par performance route.  strictly speaking
we shouldn't be allowing them at all because the lane-crossing that results has
a huge impact.

it's CISC basically.

but, Lauri made a good case for allowing src-dest elwidth overrides, to support
saturation properly.  so... it's in.

> independent of
> whatever values VL and r30 have, and then truncates the read value to
> 32-bits then does the adds.
> 
> add r3.s, r10.v, r20.v, subvl=1, elwidth=32, mask=r30

this is a declaration that the destination is 32 bit.  that means DO NOT touch
the top word of r3.  end of story.  (otherwise it has to go via the
lane-crossing path)

> for r3 it writes the full register, independent of whatever values VL and
> r30 have (unless r30==0, then r3 is unmodified), sign/zero-extending the
> 32-bit sum into the full 64-bit value that is written to r3.

NO.  i will say it again: this is not a good idea. i will say it again: it
results in lane-crossing and that kills performance.

> this full register read/write is particularly important for f32 operations,
> where the scalar representation is in full f64 format (because OpenPower's
> weird):

right.  here we need to overload fmv and/or add an fcvt-for-mv operations,
specifically to deal with this.  and/or use a mode bit (somehow) to indicate
that the fmv is to perform fcvt.

in RISCV the fcvt operation already exists because of the difference in the
formats.

overloading the elwidths on RV was easy.

VSX *does* actually have such an operation precisely because the FP32 values
are indeed packed.

with both RV and VSX having fcvt operations it is not unreasonable to add them
to SV.

yes it is a royal nuisance.

yes keeping the behaviour of "scalar is just a degenerate vector of length 1"
is that critical.

we *do not* want to be special-casing instructions based on what the width
happens to be.

things are already complicated enough and borderline CISC.

> basic summary: VL=1 is not special, mask with only 1 bit set is not special.
> SUBVL=1 *and* reg set to scalar is special. 

categorically, fundamentally and absolutely NO.

the problems this will cause are too great.

scalars are degenerate vectors of length 1, period.

no exceptions.  no special cases.  not ever.

if there are problems caused by that, such as OpenPOWER annoyingly storing
scalar FP32 sprinkled throughout the full 64 bits then it is dealt with by
adding an fcvt instruction.

*not* by violating the rules of SV by adding special cases.

the reason is down to the fact that elwidths is already complex enough, causing
lane-crossing that will dramatically slow down operations.

far from being a "disadvantage" that fvct operation will actually speed up
execution by aligning the elwidths of all operands.

no special cases.  aside from anything it will be weeks to go through all the
documentation and update them all.

and you know the answer on that one: we don't have time.

we need to move to implementing in the next few days, maximum of 10-14 days
further delay and even that is too long.

-- 
You are receiving this mail because:
You are on the CC list for the bug.