From: ÖÜêÇêÀ (malrot13_at_gmail.com)
Date: Tue Dec 03 2013 - 07:43:07 CST
Dear all,
I’m tuning NAMD performance on a 7 compute node Rocks cluster. The problems
is when running NAMD (100,000 atoms) with 32 cores (on 2 nodes) the system
CPU usage is about 50%. Increasing cores (48 cores, on 3 nodes) will
increase system CPU usage and decrease speed.
The detail information of one compute node shows below:
CPU:  2 * Inter Xeon E5-2670 (8Cores/ 2.6GHz)
Mem:  64G (1600)
HardDrive: 300G (15000)
Network card: Intel Gigabit Ethernet Network Connection
Switch: 3Com Switch 2824 3C16479 (24-port unmanaged gigabit)(a pretty old
switch :| )
Compiling & running :
Charm-6.4.0 was build with “./build charm++ mpi-linux-x86_64   mpicxx  -j16
 --with-production” options. Some error was ignored when compiling it. For
example:
“Fatal Error by charmc in directory
/apps/apps/namd/2.9/charm-6.4.0/mpi-linux-x86_64-mpicxx/tmp
   Command mpif90 -auto -fPIC -I../bin/../include -O -c pup_f.f90 -o
pup_f.o returned error code 1
charmc exiting...”.
NAMD was compiled with Linux-86_64-g++ option. Some warning was showed when
compiling NAMD.
Openmpi (from HPC roll of Rocks) was used to run namd. The command is:
”mpirun -np *{number of cores}* -machinefile hosts
/apps/apps/namd/2.9/Linux-x86_64-g++/namd2 *{configuration file}* >*
{output file}*”
SGE(Sun Grid Engine) was also used. The job submitting command is:
“qsub –pe orte *{number of cores} {job submitting script}*”
“Job submitting script” contains:
#!/bin/bash
#
#$ -cwd
#$ -j y
#$ -S /bin/bash
/opt/openmpi/bin/mpirun /apps/apps/namd/2.9/Linux-x86_64-g++/namd2
{configuration file} > {output file}
Performance test:
Test system contains about 100,000 atoms. Running (using mpirun) on 1 node
with 16 cores, I got the following benchmark data:
1 node, 16cores:
Info: Benchmark time: 16 CPUs 0.123755 s/step 0.716176 days/ns 230.922 MB
memory
CPU usage:
Tasks: 344 total,  17 running, 327 sleeping,   0 stopped,   0 zombie
Cpu0  : 85.0%us, 15.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,
 0.0%st
..
2 nodes, 32 cores:
Info: Benchmark time: 32 CPUs 0.101423 s/step 0.586941 days/ns 230.512 MB
memory
CPU usage:
Tasks: 344 total,   9 running, 335 sleeping,   0 stopped,   0 zombie
Cpu0  : 56.3%us, 43.7%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,
 0.0%st
..
3nodes, 48cores:
Info: Benchmark time: 48 CPUs 0.125787 s/step 0.727932 days/ns 228.543 MB
memory
CPU usage:
Tasks: 344 total,   9 running, 335 sleeping,   0 stopped,   0 zombie
Cpu0  : 39.3%us, 60.7%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,
 0.0%st
..
The problem is obvious. When using 48 cores (on 3 nodes), the speed is
slower than 16 cores (on 1 node). Note that the number of process varies
when running NAMD; some processes are sleeping. :///
Other information (on 48 cores,3 nodes)
vmstat 1 10
procs -----------memory---------- ---swap-- -----io---- --system--
-----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id
wa st
17  0      0 64395660 204864 389380    0    0     0     1    7    1  3  2
95  0  0
17  0      0 64399256 204864 389384    0    0     0     0 11367 2175 37 63
 0  0  0
17  0      0 64403612 204864 389384    0    0     0     0 11497 2213 38 62
 0  0  0
17  0      0 64397588 204864 389384    0    0     0     0 11424 2215 38 62
 0  0  0
17  0      0 64396108 204864 389384    0    0     0     0 11475 2262 37 63
 0  0  0
17  0      0 64400460 204868 389384    0    0     0   364 11432 2227 37 63
 0  0  0
17  0      0 64401452 204868 389384    0    0     0     0 11439 2204 38 62
 0  0  0
17  0      0 64405408 204868 389384    0    0     0     0 11400 2230 37 63
 0  0  0
17  0      0 64396108 204868 389384    0    0     0     0 11424 2245 39 61
 0  0  0
17  0      0 64395276 204868 389384    0    0     0     0 11396 2289 38 62
 0  0  0
Mpstat –P ALL 1 10
Average:     CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal
%idle    intr/s
Average:     all   37.27    0.00   61.80    0.00    0.03    0.90    0.00
 0.00  11131.34
Average:       0   38.32    0.00   61.48    0.00    0.00    0.20    0.00
 0.00    999.00
Average:       1   36.60    0.00   63.20    0.00    0.00    0.20    0.00
 0.00      0.00
Average:       2   38.26    0.00   61.64    0.00    0.00    0.10    0.00
 0.00      0.00
Average:       3   36.03    0.00   63.77    0.00    0.00    0.20    0.00
 0.00      0.00
Average:       4   38.16    0.00   61.64    0.00    0.00    0.20    0.00
 0.00      0.00
Average:       5   38.00    0.00   61.90    0.00    0.00    0.10    0.00
 0.00      0.00
Average:       6   37.06    0.00   62.74    0.00    0.00    0.20    0.00
 0.00      0.00
Average:       7   38.26    0.00   61.54    0.00    0.00    0.20    0.00
 0.00      0.00
Average:       8   36.36    0.00   63.44    0.00    0.00    0.20    0.00
 0.00      8.08
Average:       9   36.26    0.00   63.54    0.00    0.00    0.20    0.00
 0.00      0.00
Average:      10   38.36    0.00   61.54    0.00    0.00    0.10    0.00
 0.00      0.00
Average:      11   35.56    0.00   61.84    0.00    0.10    2.50    0.00
 0.00   1678.64
Average:      12   35.66    0.00   61.34    0.00    0.10    2.90    0.00
 0.00   1823.35
Average:      13   37.34    0.00   60.36    0.00    0.00    2.30    0.00
 0.00   2115.77
Average:      14   36.90    0.00   60.40    0.00    0.10    2.60    0.00
 0.00   2790.02
Average:      15   38.96    0.00   58.44    0.00    0.10    2.50    0.00
 0.00   1716.67
Iostat 1
Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda              19.00         0.00       200.00          0        200
sda1             19.00         0.00       200.00          0        200
sda2              0.00         0.00         0.00          0          0
sda3              0.00         0.00         0.00          0          0
sda4              0.00         0.00         0.00          0          0
sda5              0.00         0.00         0.00          0          0
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          39.10    0.00   60.90    0.00    0.00    0.00
Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda               0.00         0.00         0.00          0          0
sda1              0.00         0.00         0.00          0          0
sda2              0.00         0.00         0.00          0          0
sda3              0.00         0.00         0.00          0          0
sda4              0.00         0.00         0.00          0          0
sda5              0.00         0.00         0.00          0          0
The speed will be better if I use SGE (Sun Grid Engine) to submit NAMD job.
1 node, 16cores
Info: Benchmark time: 16 CPUs 0.125926 s/step 0.728737 days/ns 230.543 MB
memory
CPU usage:
Tasks: 346 total,  11 running, 335 sleeping,   0 stopped,   0 zombie
Cpu0  : 87.5%us, 12.5%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,
 0.0%st
..
2node, 32cores:
Info: Benchmark time: 32 CPUs 0.0742307 s/step 0.429576 days/ns 228.188 MB
memory
CPU usage:
Tasks: 341 total,   8 running, 333 sleeping,   0 stopped,   0 zombie
Cpu0  : 72.0%us, 27.7%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.3%si,
 0.0%st
..
3node, 48cores:
Info: Benchmark time: 48 CPUs 0.0791372 s/step 0.45797 days/ns 174.879 MB
memory
CPU usage:
Tasks: 324 total,  12 running, 312 sleeping,   0 stopped,   0 zombie
Cpu0  : 45.8%us, 53.8%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.3%si,
 0.0%st
..
In general, the benchmark data is:
Mpirun:
1node,16cores    0.716176 days/ns    15% system cpu usage
2nodes,32cores   0.586941 days/ns    45% system cpu usage
3nodes,48cores   0.727932 days/ns    60% system cpu usage
SGE:
1node,16cores    0.728737 days/ns    15% system cpu usage
2nodes,32cores   0.429576 days/ns    35% system cpu usage
3nodes,48cores   0.45797 days/ns     50% system cpu usage
Number of running processes varies in both Mpirun and SGE. The maximum data
transfer rate is about 200MB/s in these benchmark.
As you can see, the scaling is bad; system cpu usage increases when more
cores are used. I don't know why. Maybe it has something to do with our
switch.
If you know anything about the problem, please tell me. I really appreciate
your help!
Neil Zhou
School of Life Science, Tsinghua University, Beijing
China
This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:24:03 CST