Re: 50% system CPU usage when parallel running NAMD on Rocks cluster

From: ΦάκΗκΐ (malrot13_at_gmail.com)
Date: Wed Dec 18 2013 - 06:43:18 CST

Many thanks, guys. I'll keep trying anything I can.

I compiled SMP version of NAMD. Charm was compiled through the following
command:
"./build charm++ mpi-linux-x86_64 mpicxx smp -j16 --with-production"
NAMD was compiled with SMP version of charm++. I tested it with the same
system (100,000 atoms). The result ,as predicted, had some improve. :D
I ran NAMD with the following command:
"mpirun -np {number of nodes}
 /apps/apps/namd/2.9/Linux-x86_64-g++-SMP/namd2 +ppn {number of cores for
computing on each node} {configuration file} > {log file}"
1 node,1 communication CPU
Info: Benchmark time: 15 CPUs 0.0908161 s/step 0.525556 days/ns 725.562 MB
memory
Info: Benchmark time: 15 CPUs 0.0902649 s/step 0.522366 days/ns 727 MB
memory
Info: Benchmark time: 15 CPUs 0.0901796 s/step 0.521873 days/ns 733.023 MB
memory

2 node,2 communication CPUs, total network speed=100Mb/s
Info: Benchmark time: 30 CPUs 0.0875021 s/step 0.506378 days/ns 577.547 MB
memory
Info: Benchmark time: 30 CPUs 0.0884966 s/step 0.512133 days/ns 584.617 MB
memory
Info: Benchmark time: 30 CPUs 0.0880416 s/step 0.5095 days/ns 582.617 MB
memory

3 node,3 communication CPUs, total network speed=220Mb/s
Info: Benchmark time: 45 CPUs 0.0661309 s/step 0.382702 days/ns 520.734 MB
memory
Info: Benchmark time: 45 CPUs 0.0663962 s/step 0.384238 days/ns 520.625 MB
memory
Info: Benchmark time: 45 CPUs 0.066104 s/step 0.382547 days/ns 524.559 MB
memory

4 node,4 communication CPUs, total network speed=280Mb/s
Info: Benchmark time: 60 CPUs 0.0707821 s/step 0.409619 days/ns 494.953 MB
memory
Info: Benchmark time: 60 CPUs 0.0704617 s/step 0.407765 days/ns 498.125 MB
memory
Info: Benchmark time: 60 CPUs 0.0700849 s/step 0.405584 days/ns 502.969 MB
memory

Benchmark data of 3 nodes seems the best result that I've ever had. The
system CPU in all SMP test is low (5%), the user CPU is high (95%). I'll
paste MPI, UDP, and SMP result for comparison:
1 node:
MPI: 16.8% system

Info: Benchmark time: 16 CPUs 0.125436 s/step 0.725901 days/ns 230.043 MB
memory

Info: Benchmark time: 16 CPUs 0.123779 s/step 0.716314 days/ns 230.316 MB
memory

Info: Benchmark time: 16 CPUs 0.125215 s/step 0.724626 days/ns 230.25 MB
memory

UDP:

Info: Benchmark time: 16 CPUs 0.093186 s/step 0.539271 days/ns 65.9622 MB
memory

Info: Benchmark time: 16 CPUs 0.0918341 s/step 0.531448 days/ns 66.1155 MB
memory

Info: Benchmark time: 16 CPUs 0.0898816 s/step 0.520148 days/ns 66.023 MB
memory

SMP:

Info: Benchmark time: 15 CPUs 0.0908161 s/step 0.525556 days/ns 725.562 MB
memory

Info: Benchmark time: 15 CPUs 0.0902649 s/step 0.522366 days/ns 727 MB
memory

Info: Benchmark time: 15 CPUs 0.0901796 s/step 0.521873 days/ns 733.023 MB
memory

2nodes:

MPI: 157Mb/s 32.8% system

Info: Benchmark time: 32 CPUs 0.0746501 s/step 0.432003 days/ns 232.23 MB
memory

Info: Benchmark time: 32 CPUs 0.0743704 s/step 0.430384 days/ns 232.703 MB
memory

Info: Benchmark time: 32 CPUs 0.0738113 s/step 0.427149 days/ns 232.773 MB
memory

UDP: 200Mb/s 40%user 5%system 54%idle

Info: Benchmark time: 32 CPUs 0.124091 s/step 0.71812 days/ns 57.4477 MB
memory

Info: Benchmark time: 32 CPUs 0.123746 s/step 0.716121 days/ns 57.4098 MB
memory

Info: Benchmark time: 32 CPUs 0.125931 s/step 0.728767 days/ns 57.6321 MB
memory

SMP: 100Mb/s

Info: Benchmark time: 30 CPUs 0.0875021 s/step 0.506378 days/ns 577.547 MB
memory

Info: Benchmark time: 30 CPUs 0.0884966 s/step 0.512133 days/ns 584.617 MB
memory

Info: Benchmark time: 30 CPUs 0.0880416 s/step 0.5095 days/ns 582.617 MB
memory

3nodes:

MPI: 268Mb/s 50%system

Info: Benchmark time: 48 CPUs 0.0833897 s/step 0.482579 days/ns 228.672 MB
memory

Info: Benchmark time: 48 CPUs 0.0728247 s/step 0.421439 days/ns 228.672 MB
memory

Info: Benchmark time: 48 CPUs 0.0776507 s/step 0.449367 days/ns 229.07 MB
memory

UDP: 270mb/s 28%user 5%system 66%idle

Info: Benchmark time: 48 CPUs 0.133027 s/step 0.769833 days/ns 55.1507 MB
memory

Info: Benchmark time: 48 CPUs 0.135996 s/step 0.787013 days/ns 55.2202 MB
memory

Info: Benchmark time: 48 CPUs 0.135308 s/step 0.783031 days/ns 55.2494 MB
memory
SMP: 220Mb/s

Info: Benchmark time: 45 CPUs 0.0661309 s/step 0.382702 days/ns 520.734 MB
memory

Info: Benchmark time: 45 CPUs 0.0663962 s/step 0.384238 days/ns 520.625 MB
memory

Info: Benchmark time: 45 CPUs 0.066104 s/step 0.382547 days/ns 524.559 MB
memory

4nodes:

MPI: 233Mb/s ,68.1% system

Info: Benchmark time: 64 CPUs 0.1216 s/step 0.703706 days/ns 229.785 MB
memory

Info: Benchmark time: 64 CPUs 0.116776 s/step 0.675788 days/ns 229.785 MB
memory

Info: Benchmark time: 64 CPUs 0.118104 s/step 0.683472 days/ns 229.785 MB
memory

UDP: 340Mb/s 24%user 5%system 70%idle

Info: Benchmark time: 64 CPUs 0.137098 s/step 0.793394 days/ns 53.4818 MB
memory

Info: Benchmark time: 64 CPUs 0.138207 s/step 0.799812 days/ns 53.4665 MB
memory

Info: Benchmark time: 64 CPUs 0.137856 s/step 0.797777 days/ns 53.4743 MB
memory

SMP: 280Mb/s

Info: Benchmark time: 60 CPUs 0.0707821 s/step 0.409619 days/ns 494.953 MB
memory

Info: Benchmark time: 60 CPUs 0.0704617 s/step 0.407765 days/ns 498.125 MB
memory

Info: Benchmark time: 60 CPUs 0.0700849 s/step 0.405584 days/ns 502.969 MB
memory

This SMP version is based on MPI. Next, I'll try SMP based on charm++.
BTW, do all the gigabit switches have the same switching-latency? If yes,
that may be the reason why there is no any performance change after I use a
new gigabit switch.

Neil Zhou

2013/12/18 Norman Geist <norman.geist_at_uni-greifswald.de>

> Hi again,
>
> I will see though the settings you posted when I got some time. A quick
> look didn't show something obvious. So what I already posted earlier and
> what axel also suggests :
>
> >but notice, that 16 cores per node is really heavy for 1Gbit/s Ethernet
> and you might want to consider spending some >money into a HPC network like
> Infiniband or at least 10Gbit/s Ethernet.
>
> could become the painful truth. You should also really try using an smp
> binary to reduce the number of network communicators, as axels posted. You
> might notice slower timings for single node cases, but maybe an improvement
> for multiple nodes.
>
> What you see regarding system cpu is the default "idlepoll" behavior of
> the MPI you used formerly (I guess openMPI) that improves latency. You can
> and should reproduce that behavior by adding +idlepoll to namd2 when using
> it with charmrun. This usually makes a great difference.
>
> Norman Geist.
>
>

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:24:06 CST