50% system CPU usage when parallel running NAMD on Rocks cluster

From: ÖÜêÇêÀ (malrot13_at_gmail.com)
Date: Tue Dec 03 2013 - 07:43:07 CST

Dear all,
I’m tuning NAMD performance on a 7 compute node Rocks cluster. The problems
is when running NAMD (100,000 atoms) with 32 cores (on 2 nodes) the system
CPU usage is about 50%. Increasing cores (48 cores, on 3 nodes) will
increase system CPU usage and decrease speed.
The detail information of one compute node shows below:
CPU: 2 * Inter Xeon E5-2670 (8Cores/ 2.6GHz)
Mem: 64G (1600)
HardDrive: 300G (15000)
Network card: Intel Gigabit Ethernet Network Connection
Switch: 3Com Switch 2824 3C16479 (24-port unmanaged gigabit)(a pretty old
switch :| )

Compiling & running :
Charm-6.4.0 was build with “./build charm++ mpi-linux-x86_64 mpicxx -j16
 --with-production” options. Some error was ignored when compiling it. For
example:
“Fatal Error by charmc in directory
/apps/apps/namd/2.9/charm-6.4.0/mpi-linux-x86_64-mpicxx/tmp
   Command mpif90 -auto -fPIC -I../bin/../include -O -c pup_f.f90 -o
pup_f.o returned error code 1
charmc exiting...”.
NAMD was compiled with Linux-86_64-g++ option. Some warning was showed when
compiling NAMD.
Openmpi (from HPC roll of Rocks) was used to run namd. The command is:
”mpirun -np *{number of cores}* -machinefile hosts
/apps/apps/namd/2.9/Linux-x86_64-g++/namd2 *{configuration file}* >*
{output file}*”
SGE(Sun Grid Engine) was also used. The job submitting command is:
“qsub –pe orte *{number of cores} {job submitting script}*”
“Job submitting script” contains:
#!/bin/bash
#
#$ -cwd
#$ -j y
#$ -S /bin/bash
/opt/openmpi/bin/mpirun /apps/apps/namd/2.9/Linux-x86_64-g++/namd2
{configuration file} > {output file}

Performance test:
Test system contains about 100,000 atoms. Running (using mpirun) on 1 node
with 16 cores, I got the following benchmark data:
1 node, 16cores:
Info: Benchmark time: 16 CPUs 0.123755 s/step 0.716176 days/ns 230.922 MB
memory
CPU usage:
Tasks: 344 total, 17 running, 327 sleeping, 0 stopped, 0 zombie
Cpu0 : 85.0%us, 15.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si,
 0.0%st
..

2 nodes, 32 cores:
Info: Benchmark time: 32 CPUs 0.101423 s/step 0.586941 days/ns 230.512 MB
memory
CPU usage:
Tasks: 344 total, 9 running, 335 sleeping, 0 stopped, 0 zombie
Cpu0 : 56.3%us, 43.7%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si,
 0.0%st
..

3nodes, 48cores:
Info: Benchmark time: 48 CPUs 0.125787 s/step 0.727932 days/ns 228.543 MB
memory
CPU usage:
Tasks: 344 total, 9 running, 335 sleeping, 0 stopped, 0 zombie
Cpu0 : 39.3%us, 60.7%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si,
 0.0%st
..

The problem is obvious. When using 48 cores (on 3 nodes), the speed is
slower than 16 cores (on 1 node). Note that the number of process varies
when running NAMD; some processes are sleeping. :///

Other information (on 48 cores,3 nodes)
vmstat 1 10
procs -----------memory---------- ---swap-- -----io---- --system--
-----cpu------
 r b swpd free buff cache si so bi bo in cs us sy id
wa st
17 0 0 64395660 204864 389380 0 0 0 1 7 1 3 2
95 0 0
17 0 0 64399256 204864 389384 0 0 0 0 11367 2175 37 63
 0 0 0
17 0 0 64403612 204864 389384 0 0 0 0 11497 2213 38 62
 0 0 0
17 0 0 64397588 204864 389384 0 0 0 0 11424 2215 38 62
 0 0 0
17 0 0 64396108 204864 389384 0 0 0 0 11475 2262 37 63
 0 0 0
17 0 0 64400460 204868 389384 0 0 0 364 11432 2227 37 63
 0 0 0
17 0 0 64401452 204868 389384 0 0 0 0 11439 2204 38 62
 0 0 0
17 0 0 64405408 204868 389384 0 0 0 0 11400 2230 37 63
 0 0 0
17 0 0 64396108 204868 389384 0 0 0 0 11424 2245 39 61
 0 0 0
17 0 0 64395276 204868 389384 0 0 0 0 11396 2289 38 62
 0 0 0

Mpstat –P ALL 1 10
Average: CPU %user %nice %sys %iowait %irq %soft %steal
%idle intr/s
Average: all 37.27 0.00 61.80 0.00 0.03 0.90 0.00
 0.00 11131.34
Average: 0 38.32 0.00 61.48 0.00 0.00 0.20 0.00
 0.00 999.00
Average: 1 36.60 0.00 63.20 0.00 0.00 0.20 0.00
 0.00 0.00
Average: 2 38.26 0.00 61.64 0.00 0.00 0.10 0.00
 0.00 0.00
Average: 3 36.03 0.00 63.77 0.00 0.00 0.20 0.00
 0.00 0.00
Average: 4 38.16 0.00 61.64 0.00 0.00 0.20 0.00
 0.00 0.00
Average: 5 38.00 0.00 61.90 0.00 0.00 0.10 0.00
 0.00 0.00
Average: 6 37.06 0.00 62.74 0.00 0.00 0.20 0.00
 0.00 0.00
Average: 7 38.26 0.00 61.54 0.00 0.00 0.20 0.00
 0.00 0.00
Average: 8 36.36 0.00 63.44 0.00 0.00 0.20 0.00
 0.00 8.08
Average: 9 36.26 0.00 63.54 0.00 0.00 0.20 0.00
 0.00 0.00
Average: 10 38.36 0.00 61.54 0.00 0.00 0.10 0.00
 0.00 0.00
Average: 11 35.56 0.00 61.84 0.00 0.10 2.50 0.00
 0.00 1678.64
Average: 12 35.66 0.00 61.34 0.00 0.10 2.90 0.00
 0.00 1823.35
Average: 13 37.34 0.00 60.36 0.00 0.00 2.30 0.00
 0.00 2115.77
Average: 14 36.90 0.00 60.40 0.00 0.10 2.60 0.00
 0.00 2790.02
Average: 15 38.96 0.00 58.44 0.00 0.10 2.50 0.00
 0.00 1716.67

Iostat 1
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 19.00 0.00 200.00 0 200
sda1 19.00 0.00 200.00 0 200
sda2 0.00 0.00 0.00 0 0
sda3 0.00 0.00 0.00 0 0
sda4 0.00 0.00 0.00 0 0
sda5 0.00 0.00 0.00 0 0

avg-cpu: %user %nice %system %iowait %steal %idle
          39.10 0.00 60.90 0.00 0.00 0.00

Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 0.00 0.00 0.00 0 0
sda1 0.00 0.00 0.00 0 0
sda2 0.00 0.00 0.00 0 0
sda3 0.00 0.00 0.00 0 0
sda4 0.00 0.00 0.00 0 0
sda5 0.00 0.00 0.00 0 0

The speed will be better if I use SGE (Sun Grid Engine) to submit NAMD job.
1 node, 16cores
Info: Benchmark time: 16 CPUs 0.125926 s/step 0.728737 days/ns 230.543 MB
memory
CPU usage:
Tasks: 346 total, 11 running, 335 sleeping, 0 stopped, 0 zombie
Cpu0 : 87.5%us, 12.5%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si,
 0.0%st
..

2node, 32cores:
Info: Benchmark time: 32 CPUs 0.0742307 s/step 0.429576 days/ns 228.188 MB
memory
CPU usage:
Tasks: 341 total, 8 running, 333 sleeping, 0 stopped, 0 zombie
Cpu0 : 72.0%us, 27.7%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.3%si,
 0.0%st
..

3node, 48cores:
Info: Benchmark time: 48 CPUs 0.0791372 s/step 0.45797 days/ns 174.879 MB
memory
CPU usage:
Tasks: 324 total, 12 running, 312 sleeping, 0 stopped, 0 zombie
Cpu0 : 45.8%us, 53.8%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.3%si,
 0.0%st
..

In general, the benchmark data is:
Mpirun:
1node,16cores 0.716176 days/ns 15% system cpu usage
2nodes,32cores 0.586941 days/ns 45% system cpu usage
3nodes,48cores 0.727932 days/ns 60% system cpu usage
SGE:
1node,16cores 0.728737 days/ns 15% system cpu usage
2nodes,32cores 0.429576 days/ns 35% system cpu usage
3nodes,48cores 0.45797 days/ns 50% system cpu usage
Number of running processes varies in both Mpirun and SGE. The maximum data
transfer rate is about 200MB/s in these benchmark.
As you can see, the scaling is bad; system cpu usage increases when more
cores are used. I don't know why. Maybe it has something to do with our
switch.
If you know anything about the problem, please tell me. I really appreciate
your help!

Neil Zhou
School of Life Science, Tsinghua University, Beijing
China

This archive was generated by hypermail 2.1.6 : Wed Dec 31 2014 - 23:21:57 CST