Re: Mysterious slow down in parallel

From: Joseph Farran (jfarran_at_uci.edu)
Date: Sun Oct 27 2013 - 00:20:02 CDT

Hi Roy.

If I may make a suggestion.   I don't know NAMD but I run a cluster with a *lot* of users using it.

We have around 50 AMD 64-core nodes (going to100 nodes soon) and they work great with NAMD-multicore (single job per node) since NAMD runs with threads.

So if you can break your problem into chunks that can run with 64-cores, this will be the *fastest* way you can run.   With threads, the communication stays inside the node.    The communication part is the killer and thus why the more nodes you add, the slower it can* get.

Joseph


On 10/26/2013 8:06 PM, Roy Fernando wrote:
Hi Axel,

Thanks for the reply.

So, do you imply, given the size of my system, the maximum speed I get at 2 nodes with 8 processors at each node is about the maximum speed I can achieve? This is about a nano_second/day.

I did not quite understand what you said as ""when you add more processors than you have "parallel work units"". This system has 80 patches. I thought a patch is a single parallel work unit?. I only saw a speeding up, up to 16 processors. (Also, I observed the same pattern when I test run with a system 10 times smaller than this one)

Do you think there is any process I can follow to improve this situation?

Roy


On Sat, Oct 26, 2013 at 6:04 PM, Axel Kohlmeyer <akohlmey@gmail.com> wrote:



On Sat, Oct 26, 2013 at 5:05 PM, Roy Fernando <roy.nandos@gmail.com> wrote:
Dear NAMD experts,

I recently started running NAMD in cluster and I initially played with my system to determine what is the best combination of nodes and processors for my simulation in the cluster. I only ran for a shot time interval.

The cluster contains 30 nodes each containing 8 cores.

I noticed a significant speed up from a single processor to 8 processors in a single node. Then I chose 2 nodes (16 processors) and observed another speed up. But when I increased the number of nodes to 3 or 4 the simulation displayed a drastic slow down.

Can somebody please, suggest why probably the simulations slow down? I higly appreciate your input;

this kind of slowdown is not mysterious at all. it happens to almost all parallel programs when the overhead from exchanging information between individual processors becomes significant relative to the amount of computational work. also, when you add more processors than you have "parallel work units" you cannot see any speedup, and most of the time you will see a slowdown. how much of a slowdown depends on the problem at hand and the kind of network that you have and particular its latency and bandwidth.

in this specific case, you are doing classical MD, which has rather low computational complexity and your system is not very large, so you don't have a lot of work units, and it looks like there are you are using TCP/IP communication which has very high latency.

NAMD through using the charm++ library can hide high communication latency quite well, but only up to a point. the processors you add, the more the the combined latency becomes and at the same time, there is equivalently less computational work to hide behind.

axel.
 

Roy

Following is the table I made including these details.

Job # Nodes #processors start up wall time
571825 1 1 7.5 2866
569 1 8 9 539
470 2 8 2.4 316
498 2 8 3 323
494 3 8   4500
500 4 8 16 4793

I submitted the job using the following command line;
qsub -l nodes=<#nodes> : ppn=<#processors> , walltime=<expected_wall_time> <job_file_name>

and following is the contents of my job_file;
---------------------------------------------------------------------------------------------------------------------------------------------------
#!/bin/sh -l
# Change to the directory from which you originally submitted this job.
cd $PBS_O_WORKDIR
CONV_RSH=ssh
export CONV_RSH
# CONV_DAEMON=""
# export CONV_DAEMON
module load namd

NODES=`cat $PBS_NODEFILE`
NODELIST="$RCAC_SCRATCH/namd2-$PBS_JOBID.nodelist"
echo group main > "$NODELIST"

# charmrun "$NAMD_HOME/namd2" ++verbose +p$NUMPROCS ++nodelist "$NODELIST" ubq_wb_eq.conf
charmrun "$NAMD_HOME/namd2" ++verbose +p16 ++nodelist "$NODELIST" SOD_wb_eq0.conf
module unload namd
--------------------------------------------------------------------------------------------------------------------------------------------------------------

Following is my structure summary;

Info: ****************************
Info: STRUCTURE SUMMARY:
Info: 50198 ATOMS
Info: 35520 BONDS
Info: 25502 ANGLES
Info: 15756 DIHEDRALS
Info: 1042 IMPROPERS
Info: 380 CROSSTERMS
Info: 0 EXCLUSIONS
Info: 47188 RIGID BONDS
Info: 103406 DEGREES OF FREEDOM
Info: 17790 HYDROGEN GROUPS
Info: 4 ATOMS IN LARGEST HYDROGEN GROUP
Info: 17790 MIGRATION GROUPS
Info: 4 ATOMS IN LARGEST MIGRATION GROUP
Info: TOTAL MASS = 308670 amu
Info: TOTAL CHARGE = -8 e
Info: MASS DENSITY = 0.946582 g/cm^3
Info: ATOM DENSITY = 0.0927022 atoms/A^3
Info: *****************************

Info: Entering startup at 7.15922 s, 14.8091 MB of memory in use
Info: Startup phase 0 took 0.0303071 s, 14.8092 MB of memory in use
Info: Startup phase 1 took 0.068871 s, 23.5219 MB of memory in use
Info: Startup phase 2 took 0.0307088 s, 23.9375 MB of memory in use
Info: Startup phase 3 took 0.0302751 s, 23.9374 MB of memory in use
Info: PATCH GRID IS 4 (PERIODIC) BY 4 (PERIODIC) BY 5 (PERIODIC)
Info: PATCH GRID IS 1-AWAY BY 1-AWAY BY 1-AWAY
Info: REMOVING COM VELOCITY 0.0178943 -0.00579233 -0.00948207
Info: LARGEST PATCH (29) HAS 672 ATOMS
Info: Startup phase 4 took 0.0571079 s, 31.7739 MB of memory in use
Info: PME using 1 and 1 processors for FFT and reciprocal sum.
Info: PME USING 1 GRID NODES AND 1 TRANS NODES
Info: PME GRID LOCATIONS: 0
Info: PME TRANS LOCATIONS: 0
Info: Optimizing 4 FFT steps.  1... 2... 3... 4...   Done.
Info: Startup phase 5 took 0.0330172 s, 34.1889 MB of memory in use
Info: Startup phase 6 took 0.0302858 s, 34.1888 MB of memory in use
LDB: Central LB being created...
Info: Startup phase 7 took 0.030385 s, 34.1902 MB of memory in use
Info: CREATING 1526 COMPUTE OBJECTS
Info: NONBONDED TABLE R-SQUARED SPACING: 0.0625
Info: NONBONDED TABLE SIZE: 769 POINTS
Info: Startup phase 8 took 0.0399361 s, 39.2458 MB of memory in use
Info: Startup phase 9 took 0.030345 s, 39.2457 MB of memory in use
Info: Startup phase 10 took 0.000467062 s, 49.472 MB of memory in use
Info: Finished startup at 7.54093 s, 49.472 MB of memory in use




--
Dr. Axel Kohlmeyer  akohlmey@gmail.com  http://goo.gl/1wk0
International Centre for Theoretical Physics, Trieste. Italy.


This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:23:53 CST