[Libre-soc-sim] Status update

Peter Hsu peter.hsu at bsc.es
Tue Aug 24 19:07:49 BST 2021


Major breakthrough:  much of the multiprocessing instability I was 
experiencing was due to difficulty emulating 
load-reserve/store-conditional on X86.  I disassembled and pattern 
matched several OpenMP binaries (including Gromacs mdrun; Gromacs is a 
million-line production program).  I discovered the only usage of lr-sc 
in all cases is to implemented compare-and-swap. So I invented a "cas" 
instruction to replace the sequence and things are much better.  Not 
perfect yet, I cannot run multithread HPCG to completion, it dies with 
futex() error still.  But simple OpenMP programs seem reliable now.

Turning off debug code and replacing RISC-V integer instructions with 
"fast code" I now get this performance on 4-core 1.9GHz Intel processor:

     Specmark gcc: 216.2 MIPS (all integer code)

     HPCG single thread:  141.4 MIPS (many FP instructions using Spike 
semantics with SoftFloat)

     64-core Cannon's Matrix Multiply:  308.3 MIPS (double precision 
using SoftFloat, barrier every iteration, 8 iterations)

The code is checked in (uspike branch).  I shall move on to integrating 
the timing simulation while (slowly and painfully) debugging the futex() 
problem.

-Peter

------------------------------------

peterhsu at DELL-LAPTOP:~/TRY$ time uspike cc1 cccp.i
96 Load-Reserve found, 0 substitution failed
rt_sigaction called
rt_sigaction called
   1300000000 insns 6.0s 217.4 MIPS (100%) do_assert do_unassert 
check_assertion compare_token_lists read_token_list free_token_list 
assertion_install assertion_look  2100000000 insns 9.8s 215.1 MIPS 
(100%) error_from_errno warning error_with_line pedwarn 
pedwarn_with_file_and_line print_containing_files line_for_error 
grow_out  2600000000 insns 12.0s 216.3 MIPS (100%) deps_output fatal 
fancy_abort perror_with_name pfatal_with_name memory_full xmalloc 
xrealloc xcalloc savestring file_size_and_mode
time in parse: 0.792904
time in integration: 0.047232
time in jump: 0.442716
time in cse: 1.612495
time in loop: 0.375239
time in cse2: 1.578654
time in flow: 0.716339
time in combine: 2.142405
time in sched: 0.703997
time in local-alloc: 0.799159
time in global-alloc: 0.943267
time in sched2: 0.519865
time in dbranch: 0.871179
time in shorten-branch: 0.021809
time in stack-reg: 0.000000
time in final: 0.437276
time in varconst: 0.007196
time in symout: 0.000000
time in dump: 0.000000
   2636627871 insns 12.2s 216.2 MIPS (100%)
   2636627871 insns 12.2s 216.2 MIPS (100%)

real    0m12,205s
user    0m12,160s
sys    0m0,012s

------------------------------------

peterhsu at DELL-LAPTOP:~/TRY$ time uspike ~/rvbin/xhpcg-ref
108 Load-Reserve found, 0 substitution failed
276861997751 insns 1957.9s 141.4 MIPS (100%)
276861997751 insns 1957.9s 141.4 MIPS (100%)

real    32m37,947s
user    32m37,049s
sys    0m0,600s

------------------------------------

peterhsu at DELL-LAPTOP:~/MatMul$ export OMP_NUM_THREADS=64
peterhsu at DELL-LAPTOP:~/MatMul$ time uspike --stat=10 cannon
163 Load-Reserve found, 0 substitution failed
rt_sigaction called
rt_sigaction called
rt_sigprocmask called
64 threads (8x8), block matrix Q=(128x128)
    130000000 insns 0.4s 308.3 MIPS (100%)
Begin parallel multiplication

   7188219784 insns 23.3s 308.3 MIPS (64 cores)
   7188219784 insns 23.3s 308.3 MIPS (64 cores)

real    0m23,327s
user    2m59,776s
sys    0m0,093s




Matrix multiply 16 threads:  3773305827 insns 16.9s 223.7 MIPS (total on 
4-core Intel laptop)

On 24/8/21 12:32, lkcl wrote:
> On Tue, Aug 24, 2021 at 10:03 AM Pete Wilson <peter.wilson at bsc.es> wrote:
>> very impressive. 405 MIPS…..
> dang.
>
> CPU ramps up fast :)



More information about the Libre-soc-sim mailing list