[Libre-soc-bugs] [Bug 782] add galois field bitmanip instructions

Tue Mar 8 08:02:53 GMT 2022

https://bugs.libre-soc.org/show_bug.cgi?id=782

--- Comment #43 from Jacob Lifshay <programmerjake at gmail.com> ---
(In reply to Luke Kenneth Casson Leighton from comment #42)
> (In reply to Jacob Lifshay from comment #41)
> > though, do we really need gfdiv? 
> 
> i'm sure i've seen modulo (remainder) used somewhere.

yes, in defining gf(2^n) in terms of gf(2) polynomials, the polynomial
remainder is used. that is *not* a gfbrem.
if we were to have a gfbrem instruction, it would be kinda useless, always
returning 0, since gfb is a field, meaning that, like complex numbers, every
non-zero value has a multiplicative inverse. therefore, solving the equation
for a remainder:

n = q * d + r
q = n / d # always produces exact results in any field
n = (n / d) * d + r
n = n + r
0 = r

> if producing that, might as well have div.
> 
> > it's a lot of extra complexity

well, if we're using a pipeline for gfbinv (which we totally want to if we want
AES to be usably fast without needing 7 zillion gfbinv FSMs), it's much harder
to tack another several stages on the end of a pipeline (unless we want to not
share it with gfmul), unless we want to do microcoding.

> how? https://libre-soc.org/openpower/sv/gf2.py/
> that's what... 7 lines?
> 
> > In any case, here's the list of binary gf ops I think we should implement:
> > 
> > * clmul, clmulh (for gf operations that are > 64-bit, and other stuff)
> 
> see wiki, clmulr is missing.

ah, ok. yeah, we can add clmulr.
> 
> > * gfbmul RT, RA, RB
> 
> ack. is it worth doing a gfbmulh? looking at
> multGF2 it looks like it could return (res >> deg) & mask2
> just as easily to give the top half.

there is no top half. gfbmul returns the entire answer. if we tried to make a
gfbmulh, it would always return 0 unless the reducing polynomial had a degree
of  more than 64 (64 is the max in the current design).
> 
> > * gfbmuli RT, RA, imm # discard if we don't have enough encoding space
> 
> no chance. snowball, hell etc. typically imm
> takes an entire major opcode. mulli addi they
> have dedicated major opcodes. makes no sense
> except for less than degree 16, with gf jumping
> all over the place.
> 
> if constructing consts then addi and addis can be used.
> on presenting to OPF ISA WG they really want a gfbmuli
> i see no reason why not, but it does take an entire
> major op.
> 
> > * gfbmadd RT, RA, RB, RC
> 
> ack. remember it has to match madd and fmadd for REMAP
> purposes.  this is IMPORTANT.
> 
> > * gfbinv rt, ra
> >     input 0 gives result 0 even though that is division by zero,
> >     that's what's needed for AES.
> 
> ok great.
> 
> > gfbadd is covered by the existing xor instructions.
> 
> ack.
> 
> > gfbmaddsubr is not needed, since sub is identical to add, this is just two
> > gfbmadd instructions.
> 
> ack.
> 
> gftwinmuladd is missing.
> 
> btw see wiki page. please do read and
> refer to it, i get the impression you're not reading it
> regularly.

yeah, because i have no easy way of figuring out if it has changed (automatic
notifications would be nice!), and i don't like rereading stuff i've already
read 5 times in the hope that you might have changed it meanwhile... git log is
nearly useless since nearly everything has no useful commit message.
> 
> > If that all sounds good, i can create a new tracking bug for binary gf
> > instructions and start working on implementing them.
> 
> the basics (lowlevel modules) and bugreport, yes.
> 
> btw do not consider pipeline designs to be "the peak epitomy
> of absolute performance", it is misleading.
> 
> multiple simpler FSMs with multiple RSes can be more efficient
> and take advantage of early-out and jump-ahead opportunities
> that pipelines cannot.  this makes them *less* efficient rather
> than more, particularly for long-running ops.

that depends...if the pipeline allows us to specialize stages to specific parts
of the algorithm it can be faster (e.g. fpmul with one stage for
unpacking/denormal-shifting, one stage for partial product computation and
first part of carry-save adders, another stage of carry-save adders, final
stage of carry-save adders and final carry-propagating add, stage
rounding&packing&denormal handling) since each stage only needs to implement
the parts it handles, making the stages run faster. if it were instead several
totally separate FSMs for the same throughput, each FSM would need the logic
for all stages of the algorithm, making the data path much more complex and
possibly taking more time due to the complexity, reducing max clock speed.
> 
> [yes a FSM can be done which processes multiple chained combinatorial
> stages]
> 
> RSes need to farm-in and fan-out to pipelines (which have
> to be BIG muxes), where FSMs can wire directly. each RS has
> its own FSM, each FSM stores its result in the RS, no
> massive fan-in/out.

that's kinda deceptive, since the muxes with the massive fan-out/in are now in
the scheduler logic and the busses for carrying intermediate register values
around (i temporarily forgot the name) rather than in the pipeline fan-in
fan-out.

for example, lets say the algorithm takes 10 cycles, and that every pipe/FSM
needs 2 extra FUs in order to stay 100% utilized, and that there are 20 other
FUs in the whole cpu:
pipeline:
12 FUs -> pipeline

12 -> 1 mux on input
1 -> 12 mux on output
(getting big)
something -> 32 mux to fill all cpu FUs from reg reads/operand forwarding
(pretty big)

10 FSMs (for same throughput):
3 FUs -> 10-step FSM
3 FUs -> 10-step FSM
3 FUs -> 10-step FSM
...
3 FUs -> 10-step FSM

30 FUs

10x 3 -> 1 muxes on input
10x 1 -> 3 muxes on output (meh, tiny, who cares)
something bigger -> 50 mux to fill all cpu FUs from reg reads/operand
forwarding
(woah, huge!! aah, run!)

> 
> bottom line you need to think of the context in which the
> pipeline is used.

-- 
You are receiving this mail because:
You are on the CC list for the bug.