[Libre-soc-dev] sv.mv x: the instruction from hell

lkcl luke.leighton at gmail.com
Sun Jun 5 14:24:06 BST 2022


the s-box idea is a good one.  it reminds me very much of how Aspex's ASP worked, although, there, the 256-entry 8-bit DRAM was also a CAM.

i mentioned this before: my role at Aspex in 2004 was as a Cryptographic FAE, helping sell the ASP in the area of cryptography by porting standard algorithms to it.  given that the ASP was a massively-parallel SIMD Array of 2-bit ALUs i had to literally unpack algorithms down to the underlying fundamental mathematics.

unsurprisingly, GF(2) of Rijndael for example turned out to be a perfect fit for 2-bit ALUs!

but it meant that if attempting to utilise e.g. the "standard" method of S-Box implemented in 32/64-bit ISAs "because it's quicker", this turned out to be s***.

after 3 months i had enough of a handle on the algorithms and the ASP ISA to begin recommending new instructions.  *almost every single time* they answered, "oh we used to have that, nobody used it so we took it out".

sigh.

one of those instructions was a mv.x

the reason why they could add it was because each ALU was under 3,000 gates total. every single instruction without exception including its decode phase completed in a single clock cycle.  the only reason for taking longer was if the "left-right-neighbour" connection was left open, resulting in, in some cases, a massive 4096-long combinatorial ripple that was up to the compiler to resolve by inserting the prerequisite number of NOPs needed to guarantee signal integrity.

the reason i suggested the addition of GPR(GPR(src)) to the ASP was precisely to do S-Box lookups.  the tables could be calculated i think in around 43 instructions by going back to the underlying GF2 2-bit arithmetic *behind* Rijndael S-Boxes, and the substitution could have been done literally in one clock cycle from that point.

4096 *simultaneous* 8-bit lookups, per clock cycle.  this was in 2004 in 130 nm, at 250 mhz, using only 3.5 watts.

imagine doing that in a 5 ghz 3nm CPU with say 512k 2-bit ALUs.  the numbers come out at an astounding 2,560,000,000,000,000 S-Box lookups per second: a number i'm frankly having some conceptual difficulty with. 2.5 peta-ops per second? in a single CPU? moo?

anyway. bottom line is, yes indexing is perfect for s-boxes.  256 bytes subdivided across 64-bit registers is... QTY 32of 64-bit registers.  that's not unreasonable to dedicate 25% of a 128-entry regfile temporarily for the task of doing S-box lookups.

(just not as a mv.x)

i'm currently documenting Indexed.REMAP and including a limited 2D remapping capability *on top* of the GPR indexing.  actually, the order is to algorithmically compute an index first *and then* look it up in the GPR.  pseudocode included here:

    https://libre-soc.org/openpower/sv/shape_table_format/

noteworthy is that there is not just a GPR starting index point (the equivalent of the 3rd argument RB you suggested, jacob) but there is an *elwidth override as well*.

in other words the Index Table starting at the GPR can be 8-bit indices, 16-bit values or even 64-bit values if that turns out to be more convenient for the programmer.

one key reason for including Matrix remapping as a pre-map is to be able to perform transposition of the GPR Indices *without* having to reload or copy them.  i can foresee circumstances where in-place transposition would save instructions.

a mv.x would in no way be a suitable place to add such capability, whereas given the "overwhelm" risk that SV as a whole presents to the ISA WG, hiding this capability behind "advanced features" is a way to avoid freaking them out.

l.









On June 5, 2022 11:20:00 AM GMT+01:00, Jacob Lifshay <programmerjake at gmail.com> wrote:
>On Sun, Jun 5, 2022, 02:58 lkcl <luke.leighton at gmail.com> wrote:
>
>> On Sun, Jun 5, 2022 at 8:39 AM Jacob Lifshay
><programmerjake at gmail.com>
>> wrote:
>>
>> > sorry, having mv.x be 2-operand like that is totally unworkable --
>> sv.mv.x is supposed to be dynamic swizzle where the indexes read from
>one
>> vector are used as the indexes of which elements to copy from a
>second
>> vector to the destination. It inherently needs 3 operands.
>>
>> the rule is: the scalar instruction has to exist, first, and it has
>to
>> exist
>> on its own merit.
>>
>
>well, sorry, that rule needs to have exceptions: there are some very
>necessary/important vector operations that don't really have a scalar
>version with much merit: mv.x is one of those. (though it just occurred
>to
>me it could be justified as an s-box operation for cryptography).
>
>>
>> what *scalar* instruction can be proposed - stand-alone - which does
>> *not* rely *in any way shape or form* on SV - that is not so
>absolutely
>> awful that the ISA WG rejects it for very valid reasons.
>>
>
>then propose it as part of SV or an extension that depends on SV. don't
>try
>and make it independent of SV. that would be like trying to get fsin
>added
>but without making it require the existance of the fp registers it uses
>for
>input/output -- just plain silly (though modifying all fp ops to use
>integer registers could be useful -- risc-v has Zfinx and Zdinx for fp
>in
>integer regs).
>
>>
>> you're laser-focussing on SV only to the detriment of the rest of the
>ISA,
>> and we have to be in a position to have good justification for
>*everything*
>> that is proposed.  mv.x as a stand-alone Scalar instruction is so bad
>that
>> i would *agree* with any objections by the ISA WG to its inclusion,
>and
>> consequently it cannot be relied on to exist as a basis for an SVP64
>> variant.
>>
>> when added as a REMAP we can get away with calling it "advanced
>SVP64"
>> and there will be not only far less objection (because it's part of
>> something
>> that's entirely new that has no effect on the Scalar ISA), it has far
>more
>> uses and makes the entire ISA that much more powerful.
>>
>
>being completely general has downsides...a big one is efficiency. I
>think
>we should limit the completely general parts to the instructions where
>they're actually *the operation* those instructions are designed to do
>--
>mv.x. everything else greatly benefits from not being completely
>general
>which allows it to easily pack the elements into simd alus (the decoder
>can
>do things like issue 8x8-bit microops) without having to do something
>like
>independently route each byte from/to the register file since every
>byte
>could come from any byte in any register.
>
>Jacob


More information about the Libre-soc-dev mailing list