[Libre-soc-bugs] [Bug 1044] SVP64 implementation of pow(x,y,z)

Tue Oct 10 08:00:09 BST 2023

https://bugs.libre-soc.org/show_bug.cgi?id=1044

--- Comment #46 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
(In reply to Jacob Lifshay from comment #44)
> (In reply to Luke Kenneth Casson Leighton from comment #42)
> > (In reply to Jacob Lifshay from comment #39)
> > > (In reply to Luke Kenneth Casson Leighton from comment #37)
> > > > but just going straight to something inefficient (such as the
> > > > loop-unrolled mul256 algorithm you wrote, although it gets us
> > > > one incremental step ahead), this is *not* satisfying the conditions
> > > > of the grant.
> > > 
> > > which definition of efficiency are you using?
> > 
> > the one that meets customer requirements which i repeated many times:
> > top priority on code size. number of regs second.
> 
> ok.
> > 
> > it is down to the hardware to merge VF and HF elements into
> >  "issue batches".  which is here repeatedly everyone including
> > you keeps assuming VF is incapable of doing that "thrrefore it
> > musy be inefficient performance wise".
> 
> I was basing my efficiency claims on both:
> * the complexity I expect will be required to get a vertical-first divmod to
> work at all. I fully expect it to take *more* (and more complex)
> instructions than the horizontal-first version, because afaict it doesn't
> cleanly map to VF mode. this is bad for both code size and power and
> probably performance.
> * it will most likely require lots of dynamic predicates (more than just
> 1<<r3) with *large* amounts of bits that are zeros,

again: please please please, patiently i repeat: please
*stop* making assumptions about what the hardware is capable of.

the assumption that you are making is one that a *SIMD* architecture
has because the ISA directly connects to the back-end hardware.

this is *not true* with Simple-V.

 this inherently is
> rather inefficient from a performance perspective, because I'm assuming
> either:
>   * the predicate will have to be handed to the decode pipe
>     before the predicated operations can be issued. this is
>     bad for performance

which is *not the focus of this research*

>    because you're forced to stall the
>     entire fetch/decode pipe for several cycles while waiting
>     for the predicate to be computed.

*only on naive implementations*

>   * the predicate is not known at decode/issue time, so the
>     full set of element operations are issued, potentially
>     blocking issue queues,

*only on naive implementations*

>     the predicate not being known at issue time also means
>     that propagating results to registers and/or any following
>     instructions is also blocked for any instructions that
>     use twin-predication, since the cpu needs to wait until
>     it knows which registers to write to.

only on underresourced implementations.

it basically is not your problem "as an assembler writer" to
worry about what hardware does, because there is no direct connection
(unlike in a SIMD ISA).

this is something it seems that literally everyone who has ever used
SIMD has to unlearn, including you.

please completely and utterly forget, stop talking about, stop 2nd
guessing and stop trying to assess, and kind of hardware performance
*completely*, and just trust the hardware, and focus exclusively on
getting program size down as compact as possible ok?

-- 
You are receiving this mail because:
You are on the CC list for the bug.