[libre-riscv-dev] Instruction sorta-prefixes for easier high-register access

Thu Jan 17 06:12:12 GMT 2019

On Thu, Jan 10, 2019 at 12:34 AM Luke Kenneth Casson Leighton <lkcl at lkcl.net>
wrote:

> On Thu, Jan 10, 2019 at 1:04 AM Jacob Lifshay <programmerjake at gmail.com>
> wrote:
>
> > I think that adding 16-bit instruction prefixes will be useful to encode
> > the high bits of the register numbers and extra bits for stuff like
> > selecting vectorization settings since those will change rapidly enough
> > that constantly writing to the rename table csrs may use more instruction
> > bandwidth.
>
>  darn it, i was hoping that wouldn't happen.
>
>  an alternative is that RVV has a way to set multiple settings at once,
>  using a pattern.  however SV is a bit more complicated.
>
>  another alternative is to have not just one set of CSR settings but
>  multiple of them, and allow bank-switching.
>
>
> > The encoding I was envisioning will change depending on the underlying
> > instruction.
> >
> > One of the important parts is that a prefixed 16-bit instruction fits in
> > the 32-bit custom space, a prefixed 32-bit instruction fits in the
> reserved
> > 48-bit space, and a prefixed 48-bit instruction fits in the 64-bit space.
> > This allows them to not conflict with other standard/custom instructions
> > allowing any instruction to be prefixed.
>
>  yes, this concept was discussed (i think) some time last year.
>  also, it means that Compressed (16-bit) instructions *also* get extended
>  to only 32-bit, whilst still keeping the prefixes.
>
>  however for extending the 16-bit C opcodes, they will need 4 extra
>  bits (per register) to extend to the full 128 regs.  we may end up using
> the
>  entire 48-bit opcode space, although C opcodes have less operands.
>
Since C opcodes are only used for compressing commonly used instructions
and we can use the full opcodes to access everything, we could have the
prefixed C opcodes only specify some of the registers. Since the prefixed
versions are going to always be vectorized, I think multiplying the
register number by 4 or 8 is a good idea, allowing us to use the extra
instruction bits to specify VL multipliers and other misc things.

>
>  with 32-bit instructions, only 2 extra bits per register as a prefix are
>  needed (as you outline below)
>
>  oo, one idea is: on C, still use only 2 bits, and let it be the top 2
> bits.  so it's xx xxx 00 where xx is the 2-bit bank, xxx is the 3-bit
> reg num from the C instruction.
>
>  the same trick could hypothetically be applied to 32-bit, with say a
> single 0 in the bottom of the reg num.  the justification: if using
> this for vectorisation, the group of elements may be aligned on an
> even boundary (LSB=0) and for C on a "modulo 4 = 0" boundary
> LSBs='b00)
>
> the only issue there is, how do you access the upper registers as scalars?
>
I think that accessing the upper registers as scalars will be uncommon
enough that we can just set VL to 1 and use a vector instruction.

>
> > For 32-bit underlying instructions, we can use the two lsb bits in the
> > underlying instruction that specify that the instruction is 32 bits as
> > extra bits:
> >
> > 0x00b5_0533 add x10, x10, x11
> > becomes
> > 0x00b5_0530_001f add x10, x10, x11 with 12 available bits (some of which
> we
> > will need to leave constant for other uses of the 48-bit space).
>
>  there's a way round that, called the "isa-mux" scheme.  it's similar
> to the proposed prefix scheme except it's "hidden" ISA
> opcode-extending-bits that apply persistently rather than temporarily.
>
>  the isa-mux scheme may be used to enable / disable the 48/64 prefix
> extension scheme, which would allow us to use the entire encoding
> space.  when this bank-prefixing scheme is disabled, the underlying
> 48/64-bit opcode space becomes "standard" again.
>
We need to ensure that we won't need to use the 48/64-bit "standard"
instructions with SV for that to work. I think it will work better to have
the same encoding represent the same instruction everytime, allowing us to
not need a pipeline flush each time we need the other instructions. This
will also make the compiler/debugger much simpler.

>
> > I think we should use them this way:
> > 2 for each of rs1, rs2, and rd to allow addressing 128 registers
> > 2 for specifying a vl multiplier of 1x, 2x, 3x, or 4x
> > 1 for selecting predicated/non-predicated with a fixed predicate register
> > of x9/s1 (in the range of rvc registers and not reserved for something
> else)
> > 2 for:
> >     for 4 arg instructions like fma, 2 high bits of rs3
> >     for integer, selecting packed modes from 8-bit, 16-bit, 32-bit, and
> > 64-bit
> >     we can pick something for other instruction types
> > 1 as constant to allow other 48-bit instructions
>
> couple of comments:
>
> * setting VL and keeping it set across a range of instructions, it's
> clear and explicit.  VL is a persistent global setting, basically.
> usually if VL is set, it's definitely going to be used for a loop.
> however... i *can* see the value of a "one-off" VL override (not in
> loops, for example).
>
> * by removing VL it actually becomes possible to consider proposing
> this as a general-purpose RISC-V extension.
>
> * the 2 bits for packed-mode being dependent on the (future) opcode:
> this is a red flag, for me (makes me nervous).  it complicates the
> decoder phase.  everything else proposed may be extracted using a few
> gates, and stored in latches that the *next* part of the instruction
> decoder may use.  i'd only be happy with this if it was a last resort.
>
> * elwidth setting for FP is quite important.  it's the only way to get
> FP16 for example, and it's the only way to have the top 32-bits of a
> 64-bit FP register not be wasted (i.e. pack in 2 FP32 values).
>
You forgot that the standard FP instructions already have a 16/32/64/128
bit selector field that we can use.

>
> i wonder if one of the bits is best used to set the "type" of
> extension.  by that i mean, if a bit is set, it indicates that
> predication is to be set.  this would allow one prefix to specify a
> predicate (in full, rather than only to use one hard-coded register).
> however, the encoding space is so extremely small (see below) that it
> may be better to use the 64-bit opcode space for specifying
> predication.
>
> also... given the extremely limited space, i wonder if it's a good
> idea to have a 2-bit prefix for rd and a 2-bit prefix for *all*
> rs1/2/3 registers?  that would allow a kind-of... bank-swapping.  a
> 2-bit prefix for *all* rd and rs1/2/3 would result in complete
> isolation of registers into any given "bank", whereas 2-bit for src
> and 2-bit for dest would allow a sequence of ops to access multiple
> "banks".
>
> oh: also... dang there's a lot here... :)
>
> 00 means "use the standard 5-bit regs".  that's wasteful of precious
> encoding space.  i'm reeeeasonably confident that we can think of a
> use for that.
>
On the other hand, it's really useful to be able to encode everything else
the prefix can do and use it with the standard 32 regs, allowing the
compiler to treat all the regs the same for vector operations. I would hate
to have to move data out of the lower 32 regs before we can use vector ops.

>
>
> > We can come up with something similar for 16 and 48-bit underlying
> > instructions.
> >
> > Note that we won't end up with the problems with SIMD always needing to
> add
> > more instructions
>
>  [thank goodness... :) ]
>
> > since the list of element types isn't going to expand and
> > all of the instructions are vectorized with predication and variable vl.
> >
> > The prefixed instructions would bypass the SV rename table since the
> prefix
> > specifies the high register bits and the predication.
>
>  i'd advocate still _allowing_ the SV rename table to apply, in
> instances where it's being used, however that for entries which have
> been prefixed, the prefix takes top precedence.  i haven't thought it
> through, though.
>
>  the reason i like the SV CSR table setup (which is now a "stack") is,
> it applies to multiple registers.  there will be circumstances where
> that's more efficient.  just as there will be circumstances where this
> prefixing idea is more efficient.
>
>
> > Multiple prefixes in a single instruction are reassigned to operations
> like
> > reduction, packed type conversions, indexed/strided ld/st and others as
> > needed.
>
> it occurs to me that multiple prefixes may be problematic for the
> instruction decode phase.  it's starting to get into CISC territory.
> how many prefixings would be needed (or permitted)?
>
I'm proposing that we only allow a single prefix and for the encoding space
that would be multiple prefixes in a row, we reassign it to other
operations we will need.

>
> an optimisation of this approach is to use a 64-bit encoding to hold a
> 32-bit instruction.
>
>  or, even a 48-bit encoding to hold, at the end, a 16-bit C.
>
>  ok so looking at figure 1.1 of the RISC-V Spec, it says that the
> 48-bit encoding prefix is 'b011111.  that's 6 bits.  that only leaves
> TEN bits total for use in this scheme, some of which need to be used
> to say whether the opcode is 16-bit, 32-bit, or if the space to be
> used
>
>  so:
>  xxxxxxxx 11 'b011111 = reserved, for standard 48-bit (or future use,
> or something)
>  xxxxxxxx 00 'b011111 = encoding for 16-bit C to follow
>  xxxxxxxx 01 'b011111 = encoding for 32-bit op to follow
>  0xXX xxxxxxxx 10 'b011111 = encoding for 16-bit op to follow however
> there are 8+16 bits of prefix to play with
>
> and for 64-bit, the prefix is 'b0111111 and would probably be best
> used to go straight to a 7+16 bits of prefix plus a "reserved".
>
> so in the 48-bit space that's *only* 8 bits for extension-prefixes!
>
> example for 48-extending-32: 2 for rd, 2 for rs1/2/3, 2 for elwidth, 2
> for... VL-override?
>
> oh!  hang on.... something else just occurred to me: by having the
> above alternative prefix encodings, it's possible to strip off (and
> use) the bits from the standard 16-bit and 32-bit encoding.  that
> means an extra 2 bits for a 16-bit op, and a full 5 bits for a 32-bit
> op.  in the 32-bit case that's actually enough to be able to specify a
> predicate (0 meaning "no predicate").
>
Actually, 16-bit ops use all their bits, there are not any constant bits
that we can reassign. 32-bit ops have the 2 LSB bits that we can reassign.
48-bit ops have 6 LSB bits.

>
> comprehensive! :)
>
>
> l.
>
> _______________________________________________
> libre-riscv-dev mailing list
> libre-riscv-dev at lists.libre-riscv.org
> http://lists.libre-riscv.org/mailman/listinfo/libre-riscv-dev
>