AW: 50% system CPU usage when parallel running NAMD on Rocks cluster

From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Tue Dec 03 2013 - 08:47:54 CST

Your switch is too slow in switching. Try something like the netgear gs748t, not that expensive and “ok” scaling. You can temporarily improve the situation by trying the tcp congestion control algorithm “highspeed”. Set it via sysconfig on all the nodes.

 

Additionally, are these 16 cores per node physical or logical (HT). If it is HT, leave them out, no speed gain, only more network load.

 

Norman Geist.

 

Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im Auftrag von ???
Gesendet: Dienstag, 3. Dezember 2013 14:43
An: namd-l_at_ks.uiuc.edu
Betreff: namd-l: 50% system CPU usage when parallel running NAMD on Rocks cluster

 

Dear all,

¡¡¡¡I’m tuning NAMD performance on a 7 compute node Rocks cluster. The problems is when running NAMD (100,000 atoms) with 32 cores (on 2 nodes) the system CPU usage is about 50%. Increasing cores (48 cores, on 3 nodes) will increase system CPU usage and decrease speed.

The detail information of one compute node shows below:

CPU: 2 * Inter Xeon E5-2670 (8Cores/ 2.6GHz)

Mem: 64G (1600)

HardDrive: 300G (15000)

Network card: Intel Gigabit Ethernet Network Connection

Switch: 3Com Switch 2824 3C16479 (24-port unmanaged gigabit)(a pretty old switch :| )

 

¡¡¡¡Compiling & running :

Charm-6.4.0 was build with “./build charm++ mpi-linux-x86_64 mpicxx -j16 --with-production” options. Some error was ignored when compiling it. For example:

“Fatal Error by charmc in directory /apps/apps/namd/2.9/charm-6.4.0/mpi-linux-x86_64-mpicxx/tmp

   Command mpif90 -auto -fPIC -I../bin/../include -O -c pup_f.f90 -o pup_f.o returned error code 1

charmc exiting...”.

NAMD was compiled with Linux-86_64-g++ option. Some warning was showed when compiling NAMD.

Openmpi (from HPC roll of Rocks) was used to run namd. The command is:

”mpirun -np {number of cores} -machinefile hosts /apps/apps/namd/2.9/Linux-x86_64-g++/namd2 {configuration file} > {output file}”

SGE(Sun Grid Engine) was also used. The job submitting command is:

“qsub –pe orte {number of cores} {job submitting script}”

“Job submitting script” contains:

#!/bin/bash

#

#$ -cwd

#$ -j y

#$ -S /bin/bash

/opt/openmpi/bin/mpirun /apps/apps/namd/2.9/Linux-x86_64-g++/namd2 {configuration file} > {output file}

 

¡¡¡¡Performance test:

Test system contains about 100,000 atoms. Running (using mpirun) on 1 node with 16 cores, I got the following benchmark data:

1 node, 16cores:

Info: Benchmark time: 16 CPUs 0.123755 s/step 0.716176 days/ns 230.922 MB memory

CPU usage:

Tasks: 344 total, 17 running, 327 sleeping, 0 stopped, 0 zombie

Cpu0 : 85.0%us, 15.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

....

 

2 nodes, 32 cores:

Info: Benchmark time: 32 CPUs 0.101423 s/step 0.586941 days/ns 230.512 MB memory

CPU usage:

Tasks: 344 total, 9 running, 335 sleeping, 0 stopped, 0 zombie

Cpu0 : 56.3%us, 43.7%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

....

 

3nodes, 48cores:

Info: Benchmark time: 48 CPUs 0.125787 s/step 0.727932 days/ns 228.543 MB memory

CPU usage:

Tasks: 344 total, 9 running, 335 sleeping, 0 stopped, 0 zombie

Cpu0 : 39.3%us, 60.7%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

....

 

The problem is obvious. When using 48 cores (on 3 nodes), the speed is slower than 16 cores (on 1 node). Note that the number of process varies when running NAMD; some processes are sleeping. :///

 

Other information (on 48 cores,3 nodes)

vmstat 1 10

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------

 r b swpd free buff cache si so bi bo in cs us sy id wa st

17 0 0 64395660 204864 389380 0 0 0 1 7 1 3 2 95 0 0

17 0 0 64399256 204864 389384 0 0 0 0 11367 2175 37 63 0 0 0

17 0 0 64403612 204864 389384 0 0 0 0 11497 2213 38 62 0 0 0

17 0 0 64397588 204864 389384 0 0 0 0 11424 2215 38 62 0 0 0

17 0 0 64396108 204864 389384 0 0 0 0 11475 2262 37 63 0 0 0

17 0 0 64400460 204868 389384 0 0 0 364 11432 2227 37 63 0 0 0

17 0 0 64401452 204868 389384 0 0 0 0 11439 2204 38 62 0 0 0

17 0 0 64405408 204868 389384 0 0 0 0 11400 2230 37 63 0 0 0

17 0 0 64396108 204868 389384 0 0 0 0 11424 2245 39 61 0 0 0

17 0 0 64395276 204868 389384 0 0 0 0 11396 2289 38 62 0 0 0

 

Mpstat –P ALL 1 10

Average: CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s

Average: all 37.27 0.00 61.80 0.00 0.03 0.90 0.00 0.00 11131.34

Average: 0 38.32 0.00 61.48 0.00 0.00 0.20 0.00 0.00 999.00

Average: 1 36.60 0.00 63.20 0.00 0.00 0.20 0.00 0.00 0.00

Average: 2 38.26 0.00 61.64 0.00 0.00 0.10 0.00 0.00 0.00

Average: 3 36.03 0.00 63.77 0.00 0.00 0.20 0.00 0.00 0.00

Average: 4 38.16 0.00 61.64 0.00 0.00 0.20 0.00 0.00 0.00

Average: 5 38.00 0.00 61.90 0.00 0.00 0.10 0.00 0.00 0.00

Average: 6 37.06 0.00 62.74 0.00 0.00 0.20 0.00 0.00 0.00

Average: 7 38.26 0.00 61.54 0.00 0.00 0.20 0.00 0.00 0.00

Average: 8 36.36 0.00 63.44 0.00 0.00 0.20 0.00 0.00 8.08

Average: 9 36.26 0.00 63.54 0.00 0.00 0.20 0.00 0.00 0.00

Average: 10 38.36 0.00 61.54 0.00 0.00 0.10 0.00 0.00 0.00

Average: 11 35.56 0.00 61.84 0.00 0.10 2.50 0.00 0.00 1678.64

Average: 12 35.66 0.00 61.34 0.00 0.10 2.90 0.00 0.00 1823.35

Average: 13 37.34 0.00 60.36 0.00 0.00 2.30 0.00 0.00 2115.77

Average: 14 36.90 0.00 60.40 0.00 0.10 2.60 0.00 0.00 2790.02

Average: 15 38.96 0.00 58.44 0.00 0.10 2.50 0.00 0.00 1716.67

 

Iostat 1

Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn

sda 19.00 0.00 200.00 0 200

sda1 19.00 0.00 200.00 0 200

sda2 0.00 0.00 0.00 0 0

sda3 0.00 0.00 0.00 0 0

sda4 0.00 0.00 0.00 0 0

sda5 0.00 0.00 0.00 0 0

 

avg-cpu: %user %nice %system %iowait %steal %idle

          39.10 0.00 60.90 0.00 0.00 0.00

 

Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn

sda 0.00 0.00 0.00 0 0

sda1 0.00 0.00 0.00 0 0

sda2 0.00 0.00 0.00 0 0

sda3 0.00 0.00 0.00 0 0

sda4 0.00 0.00 0.00 0 0

sda5 0.00 0.00 0.00 0 0

 

 

The speed will be better if I use SGE (Sun Grid Engine) to submit NAMD job.

1 node, 16cores

Info: Benchmark time: 16 CPUs 0.125926 s/step 0.728737 days/ns 230.543 MB memory

CPU usage:

Tasks: 346 total, 11 running, 335 sleeping, 0 stopped, 0 zombie

Cpu0 : 87.5%us, 12.5%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

....

 

2node, 32cores:

Info: Benchmark time: 32 CPUs 0.0742307 s/step 0.429576 days/ns 228.188 MB memory

CPU usage:

Tasks: 341 total, 8 running, 333 sleeping, 0 stopped, 0 zombie

Cpu0 : 72.0%us, 27.7%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st

....

 

3node, 48cores:

Info: Benchmark time: 48 CPUs 0.0791372 s/step 0.45797 days/ns 174.879 MB memory

CPU usage:

Tasks: 324 total, 12 running, 312 sleeping, 0 stopped, 0 zombie

Cpu0 : 45.8%us, 53.8%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st

....

 

In general, the benchmark data is:

Mpirun:

1node,16cores 0.716176 days/ns 15% system cpu usage

2nodes,32cores 0.586941 days/ns 45% system cpu usage

3nodes,48cores 0.727932 days/ns 60% system cpu usage

SGE:

1node,16cores 0.728737 days/ns 15% system cpu usage

2nodes,32cores 0.429576 days/ns 35% system cpu usage

3nodes,48cores 0.45797 days/ns 50% system cpu usage

Number of running processes varies in both Mpirun and SGE. The maximum data transfer rate is about 200MB/s in these benchmark.

As you can see, the scaling is bad; system cpu usage increases when more cores are used. I don't know why. Maybe it has something to do with our switch.

If you know anything about the problem, please tell me. I really appreciate your help!

 

Neil Zhou

School of Life Science, Tsinghua University, Beijing

China

 

---
Diese E-Mail ist frei von Viren und Malware, denn der avast! Antivirus Schutz ist aktiv.
http://www.avast.com

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:24:03 CST