Re: Mysterious slow down in parallel

From: Axel Kohlmeyer (akohlmey_at_gmail.com)
Date: Sat Oct 26 2013 - 17:04:11 CDT

On Sat, Oct 26, 2013 at 5:05 PM, Roy Fernando <roy.nandos_at_gmail.com> wrote:

> Dear NAMD experts,
>
> I recently started running NAMD in cluster and I initially played with my
> system to determine what is the best combination of nodes and processors
> for my simulation in the cluster. I only ran for a shot time interval.
>
> The cluster contains 30 nodes each containing 8 cores.
>
> I noticed a significant speed up from a single processor to 8 processors
> in a single node. Then I chose 2 nodes (16 processors) and observed another
> speed up. But when I increased the number of nodes to 3 or 4 the simulation
> displayed a drastic slow down.
>
> Can somebody please, suggest why probably the simulations slow down? I
> higly appreciate your input;
>

this kind of slowdown is not mysterious at all. it happens to almost all
parallel programs when the overhead from exchanging information between
individual processors becomes significant relative to the amount of
computational work. also, when you add more processors than you have
"parallel work units" you cannot see any speedup, and most of the time you
will see a slowdown. how much of a slowdown depends on the problem at hand
and the kind of network that you have and particular its latency and
bandwidth.

in this specific case, you are doing classical MD, which has rather low
computational complexity and your system is not very large, so you don't
have a lot of work units, and it looks like there are you are using TCP/IP
communication which has very high latency.

NAMD through using the charm++ library can hide high communication latency
quite well, but only up to a point. the processors you add, the more the
the combined latency becomes and at the same time, there is equivalently
less computational work to hide behind.

axel.

>
> Roy
>
> Following is the table I made including these details.
>
> Job # Nodes #processors start up wall time 571825 1 1 7.5 2866 569 1 8
> 9 539 470 2 8 2.4 316 498 2 8 3 323 494 3 8 4500 500 4 8 16 4793
> I submitted the job using the following command line;
> qsub -l nodes=<#nodes> : ppn=<#processors> , walltime=<expected_wall_time>
> <job_file_name>
>
> and following is the contents of my job_file;
>
> ---------------------------------------------------------------------------------------------------------------------------------------------------
> #!/bin/sh -l
> # Change to the directory from which you originally submitted this job.
> cd $PBS_O_WORKDIR
> CONV_RSH=ssh
> export CONV_RSH
> # CONV_DAEMON=""
> # export CONV_DAEMON
> module load namd
>
> NODES=`cat $PBS_NODEFILE`
> NODELIST="$RCAC_SCRATCH/namd2-$PBS_JOBID.nodelist"
> echo group main > "$NODELIST"
>
> # charmrun "$NAMD_HOME/namd2" ++verbose +p$NUMPROCS ++nodelist "$NODELIST"
> ubq_wb_eq.conf
> charmrun "$NAMD_HOME/namd2" ++verbose +p16 ++nodelist "$NODELIST"
> SOD_wb_eq0.conf
> module unload namd
>
> --------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Following is my structure summary;
>
> Info: ****************************
> Info: STRUCTURE SUMMARY:
> Info: 50198 ATOMS
> Info: 35520 BONDS
> Info: 25502 ANGLES
> Info: 15756 DIHEDRALS
> Info: 1042 IMPROPERS
> Info: 380 CROSSTERMS
> Info: 0 EXCLUSIONS
> Info: 47188 RIGID BONDS
> Info: 103406 DEGREES OF FREEDOM
> Info: 17790 HYDROGEN GROUPS
> Info: 4 ATOMS IN LARGEST HYDROGEN GROUP
> Info: 17790 MIGRATION GROUPS
> Info: 4 ATOMS IN LARGEST MIGRATION GROUP
> Info: TOTAL MASS = 308670 amu
> Info: TOTAL CHARGE = -8 e
> Info: MASS DENSITY = 0.946582 g/cm^3
> Info: ATOM DENSITY = 0.0927022 atoms/A^3
> Info: *****************************
>
> Info: Entering startup at 7.15922 s, 14.8091 MB of memory in use
> Info: Startup phase 0 took 0.0303071 s, 14.8092 MB of memory in use
> Info: Startup phase 1 took 0.068871 s, 23.5219 MB of memory in use
> Info: Startup phase 2 took 0.0307088 s, 23.9375 MB of memory in use
> Info: Startup phase 3 took 0.0302751 s, 23.9374 MB of memory in use
> Info: PATCH GRID IS 4 (PERIODIC) BY 4 (PERIODIC) BY 5 (PERIODIC)
> Info: PATCH GRID IS 1-AWAY BY 1-AWAY BY 1-AWAY
> Info: REMOVING COM VELOCITY 0.0178943 -0.00579233 -0.00948207
> Info: LARGEST PATCH (29) HAS 672 ATOMS
> Info: Startup phase 4 took 0.0571079 s, 31.7739 MB of memory in use
> Info: PME using 1 and 1 processors for FFT and reciprocal sum.
> Info: PME USING 1 GRID NODES AND 1 TRANS NODES
> Info: PME GRID LOCATIONS: 0
> Info: PME TRANS LOCATIONS: 0
> Info: Optimizing 4 FFT steps. 1... 2... 3... 4... Done.
> Info: Startup phase 5 took 0.0330172 s, 34.1889 MB of memory in use
> Info: Startup phase 6 took 0.0302858 s, 34.1888 MB of memory in use
> LDB: Central LB being created...
> Info: Startup phase 7 took 0.030385 s, 34.1902 MB of memory in use
> Info: CREATING 1526 COMPUTE OBJECTS
> Info: NONBONDED TABLE R-SQUARED SPACING: 0.0625
> Info: NONBONDED TABLE SIZE: 769 POINTS
> Info: Startup phase 8 took 0.0399361 s, 39.2458 MB of memory in use
> Info: Startup phase 9 took 0.030345 s, 39.2457 MB of memory in use
> Info: Startup phase 10 took 0.000467062 s, 49.472 MB of memory in use
> Info: Finished startup at 7.54093 s, 49.472 MB of memory in use
>
>

-- 
Dr. Axel Kohlmeyer  akohlmey_at_gmail.com  http://goo.gl/1wk0
International Centre for Theoretical Physics, Trieste. Italy.

This archive was generated by hypermail 2.1.6 : Wed Dec 31 2014 - 23:21:49 CST