[Libre-soc-bugs] [Bug 236] Atomics Standard writeup needed

bugzilla-daemon at libre-soc.org bugzilla-daemon at libre-soc.org
Tue Jul 19 03:50:28 BST 2022


https://bugs.libre-soc.org/show_bug.cgi?id=236

--- Comment #37 from Jacob Lifshay <programmerjake at gmail.com> ---
(In reply to Luke Kenneth Casson Leighton from comment #35)
> (In reply to Jacob Lifshay from comment #34)
> 
> > added initial atomics assembler, along with script that generates it.
> 
> which i've now had to remove and deal with yet another force-master push.

why'd you remove the python script too? it is not autogenerated... I spent a
lot of time writing it...it is a separate commit from the commit adding the
generated output so is easy to retain.
> 
> please *do not* break the hard rule of adding auto-generated output to
> repositories.

as explained in the commit message, I added the autogenerated markdown
specifically because it is the wiki and there isn't really an easy way to be
able to see the results on the website otherwise, you'd have to download and
run the script which is quite inconvenient.

Sorry, I didn't think to ask first.

> 
> *especially* given that it is a massive batch of identical code.

it's not actually identical...it's the assembly for all atomic ops in the c11
standard (except consume memory ordering .. no one uses that).
> 
> 
> allow me to be clearer in the instructions.
> 
> we need to demonstrate that the POWER9 recommended c++ spinlocks and
> atomics are or are not efficient, and to what extent.
> 
> the program therefore that needs to be created must:

I'm assuming by "process" you really mean threads.

> 1) have an option to specify the number of SMT forked processes to run
> 2) have an option to specify how many lock and unlocks shall be performed
>    per forked process
> 3) have an option to specify the range of memory addresses to be
> lock/unlocked
>    ("1" being "all processes lock and unlock the same address)
> 4) use RECOMMENDED sequences known to be used in c, c++, and the linux
>    kernel.

the sequences generated for the standard c11 atomic operations (as in the
python script I wrote) are the recommended sequences for those standard c11
operations.

>    such as these (or other already present in the linux kernel
>    and other common libraries)
>    http://www.rdrop.com/~paulmck/scalability/paper/N2745r.2011.03.04a.html
> 5) have an option to use the "eh" hints that Paul mentioned are in
>    Power ISA 3.1 p1077 eh hint
> 6) time the operations ("time ./locktest" would do).

no, it has poor accuracy for shorter times...using clock_gettime (or APIs
wrapping that) inside the program is better because we can use realtime clocks
and not measure program load/terminate time, as well as loop the measuring
process multiple times to do statistical analysis on it, discarding outliers --
this avoids measuring the extra time used by linux to map the program/data into
memory or allocate memory.

> 
> there is no need to add this program in markdown form.
> 
> it is purely experimental in nature for the purposes of research.

well, for the purposes of research it would be quite handy to see what assembly
is used for each standard atomic operation without having to run the compiler
yourself or write the input c program.
> 
> it is not for the publication of a specification.

yup.
> 
> it is for the purposes of actually being executed to obtain
> information for which a report (manually) will have to be written.
> 
> when executed on the TALOS-II workstation with different numbers of
> processes and different memory ranges, this will tell us whether
> IBM properly designed the ISA or not.  it will not tell us exactly
> *how* they actually implemented them but will give at least some
> black-box hints.
> 
> if the locking remains linear
it won't due to hyperthreads on the same core conflicting with each
other...expect at least an inflection point at 18 threads since that's where it
has to start sharing cores between multiple threads.

> for up to all 72 hyperthreads and it
> is of the order of a million locks per second per core regardless
> of the number of memory addresses then we can reasonably conclude
> that they did a damn good job.
> 
> if they do *not* work then we are 100% justified in proposing additional
> enhancements to the ISA

even if they do work, we still will want improvements to support atomic fadd,
fmin/fmax, and a few others.
> 
> but *not* until the *actual* statistics have *actually* been measured
> and real-world reports obtained.
> 
> we do not have access to an IBM POWER10 system so IBM POWER9 will have to do.
> 
> bottom line is that if we cannot demonstrate good knowledge of IBM's
> atomics then we have absolutely no business whatsoever in proposing
> alternatives or enhancements.

Some other things we should test are some common 3d gpu shaders that use
atomics, as well as parking_lot's performance (we'll need to use Rust for
parking_lot, since it is a Rust library).

parking_lot is used by firefox.

-- 
You are receiving this mail because:
You are on the CC list for the bug.


More information about the libre-soc-bugs mailing list