[Libre-soc-dev] next things to work on -- bigint rsa mul algorithm for bigint presentation

Wed Oct 5 18:03:36 BST 2022

On Tuesday, October 4, 2022, Jacob Lifshay via Libre-soc-dev <
libre-soc-dev at lists.libre-soc.org> wrote:
> Now that the nlnet deadline is effectively over (no time left for new
> work), I'd like to work on my bigint presentation:
>
> If for some reason my presentation is rejected, this would just get
> published online somewhere instead.

Toshaan mentioned he may advocate for a Libre-SOC webinar.

> I want to build a practical 2048-bit modular exponentiation function for
> RSA --

oooo love it.  can i make a suggestion? do one that fits first
exclusively in registers? no LD-loops, do those as a 2nd round.

the reason i suggest that it because if you set e.g. a fixed
width at 256-bit you don't need an outer loop, just do
"sv.adde", and Knuth Algorithm-M and D become almost laughably
trivial (which is the point)

also some ECC is really small sizes (255-19) and would
fit in-regs.

>the subtasks should fit under
> https://bugs.libre-soc.org/show_bug.cgi?id=773 and probably other gigabit
> router subtasks too.

yes definitely

> it would use montgomery multiplication and a toom-cook multiplication
> algorithm generator that we can instantiate with different parameters to
> find the most efficient one. toom-cook covers all of (Toom-1) traditional
> O(n^2) long multiplication, (Toom-2) O(n^1.58) karatsuba multiplication,
> and higher algorithms such as Toom-3 O(n^1.46)

i love karatsuuuba :)

> It would be run on a simplistic model cpu with 1 256-bit simd add unit and
> 1 256-bit wide-mul unit (the model would be adjustable) and scalar units,
I
> would create a simulator for that, the simulator would cover ALU pipeline
> timing and dependencies but not much else.

i would greatly prefer that you not do that. i have explained
already that it is better to wait until cavatools is ready
and then use its Hardware-Cycle-Accurate capability.

> the simulator would be designed
> to output timing info such that blender can generate an animation of those
> ALUs executing the algorithm. the simulator would probably be written in
> python and use our existing isacaller simulator to generate execution
> traces for it to run on.

if it is completely independent of hardware microarchitecture
and in absolutely no way gives an implied "performance guarantee"
then i am perfectly fine with it.

> I'm estimating this taking 3-4 weeks of work, it would probably be the
most
> significant demo of svp64 actually providing a significant speed-up so far
> (isacaller doesn't count since it doesn't simulate actual cpu timing,
> soc.git isn't far enough along to simulate multi-issue and operation
> merging for SIMD backends).

i really do not want timing of any kind to be part of any
estimates on performance as they are deeply misleading
and give completely the wrong impression.

SVP64 is *not* tied to one Micro-Architecture.  it was
several weeks before I convinced Lauri to not attempt
to "optimise for performance" as it actually adversely
impacts portability and is far too early.

so please: *do not* spend time giving *any* impression
that there is *any* specific performance guarantees or
estimates of *any kind*.

performance is a direct factor of the underlying microarchitecture
chosen by the HARDWARE implementor, and we are primarily at
this stage the designers of the *ISA*.

about the only way it would be ok would be to cover a massive
range of microarchitectures all at the same time (8 to 10)
which hugely increases the scope and makes it an entire new
project in its own right.

hence why i said: please *do not* put *any* kind of estimates
of performance into *anything* stated publicly, or spend time
on it.

this is really important.

l.

-- 
---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68