[Libre-soc-sim] Status update
Peter Hsu
peter.hsu at bsc.es
Tue Aug 24 19:07:49 BST 2021
Major breakthrough: much of the multiprocessing instability I was
experiencing was due to difficulty emulating
load-reserve/store-conditional on X86. I disassembled and pattern
matched several OpenMP binaries (including Gromacs mdrun; Gromacs is a
million-line production program). I discovered the only usage of lr-sc
in all cases is to implemented compare-and-swap. So I invented a "cas"
instruction to replace the sequence and things are much better. Not
perfect yet, I cannot run multithread HPCG to completion, it dies with
futex() error still. But simple OpenMP programs seem reliable now.
Turning off debug code and replacing RISC-V integer instructions with
"fast code" I now get this performance on 4-core 1.9GHz Intel processor:
Specmark gcc: 216.2 MIPS (all integer code)
HPCG single thread: 141.4 MIPS (many FP instructions using Spike
semantics with SoftFloat)
64-core Cannon's Matrix Multiply: 308.3 MIPS (double precision
using SoftFloat, barrier every iteration, 8 iterations)
The code is checked in (uspike branch). I shall move on to integrating
the timing simulation while (slowly and painfully) debugging the futex()
problem.
-Peter
------------------------------------
peterhsu at DELL-LAPTOP:~/TRY$ time uspike cc1 cccp.i
96 Load-Reserve found, 0 substitution failed
rt_sigaction called
rt_sigaction called
1300000000 insns 6.0s 217.4 MIPS (100%) do_assert do_unassert
check_assertion compare_token_lists read_token_list free_token_list
assertion_install assertion_look 2100000000 insns 9.8s 215.1 MIPS
(100%) error_from_errno warning error_with_line pedwarn
pedwarn_with_file_and_line print_containing_files line_for_error
grow_out 2600000000 insns 12.0s 216.3 MIPS (100%) deps_output fatal
fancy_abort perror_with_name pfatal_with_name memory_full xmalloc
xrealloc xcalloc savestring file_size_and_mode
time in parse: 0.792904
time in integration: 0.047232
time in jump: 0.442716
time in cse: 1.612495
time in loop: 0.375239
time in cse2: 1.578654
time in flow: 0.716339
time in combine: 2.142405
time in sched: 0.703997
time in local-alloc: 0.799159
time in global-alloc: 0.943267
time in sched2: 0.519865
time in dbranch: 0.871179
time in shorten-branch: 0.021809
time in stack-reg: 0.000000
time in final: 0.437276
time in varconst: 0.007196
time in symout: 0.000000
time in dump: 0.000000
2636627871 insns 12.2s 216.2 MIPS (100%)
2636627871 insns 12.2s 216.2 MIPS (100%)
real 0m12,205s
user 0m12,160s
sys 0m0,012s
------------------------------------
peterhsu at DELL-LAPTOP:~/TRY$ time uspike ~/rvbin/xhpcg-ref
108 Load-Reserve found, 0 substitution failed
276861997751 insns 1957.9s 141.4 MIPS (100%)
276861997751 insns 1957.9s 141.4 MIPS (100%)
real 32m37,947s
user 32m37,049s
sys 0m0,600s
------------------------------------
peterhsu at DELL-LAPTOP:~/MatMul$ export OMP_NUM_THREADS=64
peterhsu at DELL-LAPTOP:~/MatMul$ time uspike --stat=10 cannon
163 Load-Reserve found, 0 substitution failed
rt_sigaction called
rt_sigaction called
rt_sigprocmask called
64 threads (8x8), block matrix Q=(128x128)
130000000 insns 0.4s 308.3 MIPS (100%)
Begin parallel multiplication
7188219784 insns 23.3s 308.3 MIPS (64 cores)
7188219784 insns 23.3s 308.3 MIPS (64 cores)
real 0m23,327s
user 2m59,776s
sys 0m0,093s
Matrix multiply 16 threads: 3773305827 insns 16.9s 223.7 MIPS (total on
4-core Intel laptop)
On 24/8/21 12:32, lkcl wrote:
> On Tue, Aug 24, 2021 at 10:03 AM Pete Wilson <peter.wilson at bsc.es> wrote:
>> very impressive. 405 MIPS…..
> dang.
>
> CPU ramps up fast :)
More information about the Libre-soc-sim
mailing list