[Libre-soc-bugs] [Bug 1157] Implement poly1305

bugzilla-daemon at libre-soc.org bugzilla-daemon at libre-soc.org
Mon Sep 18 20:26:15 BST 2023


https://bugs.libre-soc.org/show_bug.cgi?id=1157

--- Comment #19 from Sadoon Albader <sadoon at albader.co> ---
I'm trying to put as much of my focus on the multiplication right now, here
goes:


The multiplication is literally just long multiplication with a nice trick
(read https://loup-vaillant.fr/tutorials/poly1305-design to understand better.)

With normal multiplication, if we had hundreds times hundreds (because we are
using 3 64-bit integers of course), it'd go like this:


                          h2        h1        h0
                          r2        r1        r0   *
--------------------------------------------------
                        h2r0      h1r0      h0r0  |
         h2r1           h1r1      h0r1            |
h2r2     h1r2           h0r2                      |
--------------------------------------------------
                                                    +

and the result would be:

h2r2      h2r1+h1r2      h2r0+h1r1+h0r2      h1r0+h0r1      h0r0

Note that h2r2 can also be 2 digits, which makes the final number a maximum of
6
digits.
So we limit the 64-bit variables of both h[x] and r[x] to 44 bits for
elements 0 and 1 and 42 bits for element 2, to ensure that they don't overflow
in addition over 2^130, and don't exceed 128-bits in 64*64-bit multiplication.

Modular multiplication does things a little differently, to get the remainder,
we don't divide by (2^130)-5, we instead multiply by 5 * 2^(130-128), and lo
and
behold, we get our remainder *if* the number is less than p (2^130-5, our
prime).

To do this, we divide the two leftmost numbers by 2^128 (essentially moving
them
to the right as you will see below) and then multiply them by 5 * 2 ^ 2

Sounds familiar? It's literally sll and srl, nothing special. Now let's look at
what happens:

                          h2        h1        h0
                          r2        r1        r0   *
-------------------------------------------------------
                        h2r0      h1r0      h0r0      |
                        h1r1      h0r1      h2r1*20   |
                        h0r2      h2r2*20   h1r2*20   |
-------------------------------------------------------
                                                        +

Now if you read the C/Python implementation, it makes perfect sense why:
s1 = r1 * (5 << 2)
s2 = r2 * (5 << 2)

d0 = h0r0 + h1s2 + h2s1
d1 = h0r1 + h1r0 + h2s2
d2 = h0r2 + h1r1 + h2r0

Stitch the d's together and do carry propagation:

h0 = (d0           ) & 0xfffffffffff (44 bits)
h1 = (d1 + d0 >> 44) & 0xfffffffffff
h2 = (d2 + d1 >> 44) & 0x3ffffffffff

h0 = (h0 + (d2 >> 42)) * 5
h1 = h1 + (h0 >> 44)
h0 = h0 & 0xfffffffffff

This way, we can have an array of 3 64-bit integers h[x] where
h[0-2] = h0,h1,h2

And that's our hash for one 16-byte block.

Thoughts re: implementing in SVP64:

I still don't fully understand how SVP64 works but I'll make assumptions to
brainstorm and hopefully be pushed in the right direction:

We can use a maximum of 64-bit registers *unless* we use bigint. However,
64-bit registers are what is needed for implementing a 32-bit version of
poly1305, not 64-bits, because multiplying 44*44 (or 42*42 minimum) exceeds
64-bits but is well below 128-bits.

Given the parallel nature of SVP64, we might as well implement a 32-bit version
which limits itself to 26-bits in all variables (130 / 26 = 5) and we would
need a total of 10 registers for r[] and d[].

Since bigint basically stitches numbers together and uses 64-bit multiplication
anyways (please do correct me if I'm wrong), there's no benefit to using
64-bits as opposed to 32-bits, could be quite the opposite with the bigint
overhead (again correct me if I'm wrong)

Now with that in mind, let's assume this (it'll be more numbers but the pattern
repeats)

d0 = h0r0 + h1s2 + h2s1
d1 = h0r1 + h1r0 + h2s2
d2 = h0r2 + h1r1 + h2r0

If we load h[] and r[] to our SVSHAPEs, we can use the clear pattern of

0,0  1,2  2,1
0,1  1,0  2,2
0,2  1,1  2,0

The leftmost number is n where n = 0..2, the rightmost number is n where n =
0..2 in the first case, n = 2,0,1 in the second case (overflowing 2 to 0) and n
= 1,2,0 (again starting from 1, overflowing from 2 to 0)

We multiply all the numbers first in parallel, then add the first 3 additions
in parallel, then the second addition. Seems doable in 3 instructions after
register preparation at least for this part?

Long post I know but I wanted to get things right :)

-- 
You are receiving this mail because:
You are on the CC list for the bug.


More information about the libre-soc-bugs mailing list