[Libre-soc-dev] svp64
    Jacob Lifshay 
    programmerjake at gmail.com
       
    Sun Dec 20 06:43:59 GMT 2020
    
    
  
On Sat, Dec 19, 2020, 06:01 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
wrote:
> On Friday, December 18, 2020, Jacob Lifshay <programmerjake at gmail.com>
> wrote:
> > On Fri, Dec 18, 2020, 15:35 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
> > wrote:
> >> * still do not know what the best arrangement for CRs is.
> >>
> >
> > I'm for the arrangement that mirrors the register layout I picked for
> > FP/Int registers.
>
>
> CR[i] is the notation used by the OpenPower spec to refer to CR field #i,
> so FP instructions with Rc=1 write to CR[1] aka SVCR1_000.
>
>
> so when vectorisation is enabled CR[2] and onwards are destroyed.
nope. only when VL > 8 is CR[2] destroyed.
CR registers in VL order (element numbers assuming write starts at CR[1]):
SVCR1_000 aka. CR[1] -- used for element 0
SVCR1_001 no corresponding CR[...] -- used for element 1
SVCR1_010 no corresponding CR[...] -- used for element 2
SVCR1_011 no corresponding CR[...] -- used for element 3
SVCR1_100 no corresponding CR[...] -- used for element 4
SVCR1_101 no corresponding CR[...] -- used for element 5
SVCR1_110 no corresponding CR[...] -- used for element 6
SVCR1_111 no corresponding CR[...] -- used for element 7
SVCR2_000 aka. CR[2] -- used for element 8
SVCR2_001 no corresponding CR[...] -- used for element 9
SVCR2_010 no corresponding CR[...] -- used for element 10
SVCR2_011 no corresponding CR[...] -- used for element 11
SVCR2_100 no corresponding CR[...] -- used for element 12
SVCR2_101 no corresponding CR[...] -- used for element 13
SVCR2_110 no corresponding CR[...] -- used for element 14
SVCR2_111 no corresponding CR[...] -- used for element 15
SVCR3_000 aka. CR[3] -- used for element 16
Note that I'm thinking Rc=1 vector instructions should always start at
CR[6] aka. SVCR6_000 instead of CR[0] or CR[1], since that will match where
the mask starts to be read from when using CRs as mask registers.
  this
> means that every vector operation requires callee saving of CRs.
That's precisely why I'm advocating starting at CR[6] which means functions
using vector ops won't have to save/restore CR in the prologue/epilogue.
>
> it would be much more sensible to start from say CR[8] for INT operations
> and say from CR[32] for FP (debatable).
>
> deliberately in increments of 8 so that the hardware is kept simple for the
> DMs.
>
> the concept of compatibility with a SIMD system designed in 1998 needs to
> be expunged :)
>
> however when a given reg result is marked as scalar we need to have
> compatibility with v.3.0B/1B so that an extra mv is not required plus there
> are no "surprises".
>
> in other words the exact same algorithm for reg naming that you came up
> with 18 months ago.
>
> i'm going to remove the new naming and replace it with the simole concept,
> "regs are extended linearly". CR0.. CR63, r0..r127
The hw implementation of what I proposed is utterly simple, just add 2/3
lsb bits to all reg fields everywhere (including non-SV, they just are
zeros for non-SVP64 instructions).
If you're going to replace the naming scheme with a flat list of integers,
please at least *don't* use the same naming scheme as OpenPower, since the
register numbers don't match due to inserting lsb bits.
>
> this is understandable.
>
> in 5 years when we have time and funding extending to 256 regs can be
> investigated.
>
If we spend a little effort planning ahead we can avoid a lot of the SIMD
troubles with every future expansion requiring a whole new ISA which we're
partially inheriting by having compiler-allocated 64-bit backing registers
for vectors instead of RVV-style expand-as-big-as-you-please giant
registers. The scheme I proposed is designed to handle expanding the
register file to as big as we please (limited to powers of 2, of course) by
interleaving more registers between the existing registers. It can also
handle backward compatibility to both OpenPower v3.1 as well as versions of
itself with a smaller register file by having setvli's extra bits switch
the cpu to a mode where it skips the new registers when vectorizing
instructions (basically changing the register number increment to 2^n
instead of 1).
All we need is a plan and some encoding space in setvl[i] (and sprs), no
extra hardware required for the initial implementation (ignoring
setvl[i]/sprs encoding).
Jacob
    
    
More information about the Libre-soc-dev
mailing list