Re: Mysterious slow down in parallel

From: Gianluca Interlandi (gianluca_at_u.washington.edu)
Date: Sun Oct 27 2013 - 01:12:21 CDT

You should read about domain decomposition. The more tasks you have the
more frequently the CPUs need to update each other about the position of
the atoms in each subdomain. Sending a message across the network requires
a "preparation" time, the so called latency. The smaller a subdomain the
more time each CPU (or core) will spend waiting to send or receive a
message than to actually do a computation. This is why adding more
processing cores does not necessarily speed up your computation since
after a critical number (16 in your case) the latency will be larger than
the time a core actually spends computing.

In other words, there is nothing you can do within the limits of the
hardware that you have. What you can do is to reduce the latency by for
example adding a myrinet or infiniband network to the nodes of your
cluster, assuming that you have a budget for that (quite pricey) and that
the cluster is at your site. Otherwise, your only choice is to gain access
to a cluster which already has a low latency network.

Gianluca

On Sat, 26 Oct 2013, Roy Fernando wrote:

> Hi Axel,
>
> Thanks for the reply.
>
> So, do you imply, given the size of my system, the maximum speed I get at 2 nodes with 8
> processors at each node is about the maximum speed I can achieve? This is about a
> nano_second/day.
>
> I did not quite understand what you said as ""when you add more processors than you have
> "parallel work units"". This system has 80 patches. I thought a patch is a single parallel
> work unit?. I only saw a speeding up, up to 16 processors. (Also, I observed the same
> pattern when I test run with a system 10 times smaller than this one)
>
> Do you think there is any process I can follow to improve this situation?
>
> Roy
>
>
> On Sat, Oct 26, 2013 at 6:04 PM, Axel Kohlmeyer <akohlmey_at_gmail.com> wrote:
>
>
>
> On Sat, Oct 26, 2013 at 5:05 PM, Roy Fernando <roy.nandos_at_gmail.com> wrote:
> Dear NAMD experts,
>
> I recently started running NAMD in cluster and I initially played with my
> system to determine what is the best combination of nodes and processors for my
> simulation in the cluster. I only ran for a shot time interval.
>
> The cluster contains 30 nodes each containing 8 cores.
>
> I noticed a significant speed up from a single processor to 8 processors in a
> single node. Then I chose 2 nodes (16 processors) and observed another speed
> up. But when I increased the number of nodes to 3 or 4 the simulation displayed
> a drastic slow down.
>
> Can somebody please, suggest why probably the simulations slow down? I higly
> appreciate your input;
>
>
> this kind of slowdown is not mysterious at all. it happens to almost all parallel
> programs when the overhead from exchanging information between individual processors
> becomes significant relative to the amount of computational work. also, when you add
> more processors than you have "parallel work units" you cannot see any speedup, and
> most of the time you will see a slowdown. how much of a slowdown depends on the
> problem at hand and the kind of network that you have and particular its latency and
> bandwidth.
>
> in this specific case, you are doing classical MD, which has rather low computational
> complexity and your system is not very large, so you don't have a lot of work units,
> and it looks like there are you are using TCP/IP communication which has very high
> latency.
>
> NAMD through using the charm++ library can hide high communication latency quite
> well, but only up to a point. the processors you add, the more the the combined
> latency becomes and at the same time, there is equivalently less computational work
> to hide behind.
>
> axel.
>  
>
> Roy
>
> Following is the table I made including these details.
>
> Job
> # Nodes
> #processors
> start up
> wall time
> 571825
> 1
> 1
> 7.5
> 2866
> 569
> 1
> 8
> 9
> 539
> 470
> 2
> 8
> 2.4
> 316
> 498
> 2
> 8
> 3
> 323
> 494
> 3
> 8
>  
> 4500
> 500
> 4
> 8
> 16
> 4793
> I submitted the job using the following command line;
> qsub -l nodes=<#nodes> : ppn=<#processors> , walltime=<expected_wall_time>
> <job_file_name>
>
> and following is the contents of my job_file;
> ------------------------------------------------------------------------------------------
> ---------------------------------------------------------
> #!/bin/sh -l
> # Change to the directory from which you originally submitted this job.
> cd $PBS_O_WORKDIR
> CONV_RSH=ssh
> export CONV_RSH
> # CONV_DAEMON=""
> # export CONV_DAEMON
> module load namd
>
> NODES=`cat $PBS_NODEFILE`
> NODELIST="$RCAC_SCRATCH/namd2-$PBS_JOBID.nodelist"
> echo group main > "$NODELIST"
>
> # charmrun "$NAMD_HOME/namd2" ++verbose +p$NUMPROCS ++nodelist "$NODELIST"
> ubq_wb_eq.conf
> charmrun "$NAMD_HOME/namd2" ++verbose +p16 ++nodelist "$NODELIST"
> SOD_wb_eq0.conf
> module unload namd
> ------------------------------------------------------------------------------------------
> --------------------------------------------------------------------
>
> Following is my structure summary;
>
> Info: ****************************
> Info: STRUCTURE SUMMARY:
> Info: 50198 ATOMS
> Info: 35520 BONDS
> Info: 25502 ANGLES
> Info: 15756 DIHEDRALS
> Info: 1042 IMPROPERS
> Info: 380 CROSSTERMS
> Info: 0 EXCLUSIONS
> Info: 47188 RIGID BONDS
> Info: 103406 DEGREES OF FREEDOM
> Info: 17790 HYDROGEN GROUPS
> Info: 4 ATOMS IN LARGEST HYDROGEN GROUP
> Info: 17790 MIGRATION GROUPS
> Info: 4 ATOMS IN LARGEST MIGRATION GROUP
> Info: TOTAL MASS = 308670 amu
> Info: TOTAL CHARGE = -8 e
> Info: MASS DENSITY = 0.946582 g/cm^3
> Info: ATOM DENSITY = 0.0927022 atoms/A^3
> Info: *****************************
>
> Info: Entering startup at 7.15922 s, 14.8091 MB of memory in use
> Info: Startup phase 0 took 0.0303071 s, 14.8092 MB of memory in use
> Info: Startup phase 1 took 0.068871 s, 23.5219 MB of memory in use
> Info: Startup phase 2 took 0.0307088 s, 23.9375 MB of memory in use
> Info: Startup phase 3 took 0.0302751 s, 23.9374 MB of memory in use
> Info: PATCH GRID IS 4 (PERIODIC) BY 4 (PERIODIC) BY 5 (PERIODIC)
> Info: PATCH GRID IS 1-AWAY BY 1-AWAY BY 1-AWAY
> Info: REMOVING COM VELOCITY 0.0178943 -0.00579233 -0.00948207
> Info: LARGEST PATCH (29) HAS 672 ATOMS
> Info: Startup phase 4 took 0.0571079 s, 31.7739 MB of memory in use
> Info: PME using 1 and 1 processors for FFT and reciprocal sum.
> Info: PME USING 1 GRID NODES AND 1 TRANS NODES
> Info: PME GRID LOCATIONS: 0
> Info: PME TRANS LOCATIONS: 0
> Info: Optimizing 4 FFT steps.  1... 2... 3... 4...   Done.
> Info: Startup phase 5 took 0.0330172 s, 34.1889 MB of memory in use
> Info: Startup phase 6 took 0.0302858 s, 34.1888 MB of memory in use
> LDB: Central LB being created...
> Info: Startup phase 7 took 0.030385 s, 34.1902 MB of memory in use
> Info: CREATING 1526 COMPUTE OBJECTS
> Info: NONBONDED TABLE R-SQUARED SPACING: 0.0625
> Info: NONBONDED TABLE SIZE: 769 POINTS
> Info: Startup phase 8 took 0.0399361 s, 39.2458 MB of memory in use
> Info: Startup phase 9 took 0.030345 s, 39.2457 MB of memory in use
> Info: Startup phase 10 took 0.000467062 s, 49.472 MB of memory in use
> Info: Finished startup at 7.54093 s, 49.472 MB of memory in use
>
>
>
>
> --
> Dr. Axel Kohlmeyer  akohlmey_at_gmail.com  http://goo.gl/1wk0
> International Centre for Theoretical Physics, Trieste. Italy.
>
>
>
>

-----------------------------------------------------
Gianluca Interlandi, PhD gianluca_at_u.washington.edu
                     +1 (206) 685 4435
                     http://artemide.bioeng.washington.edu/

Research Scientist at the Department of Bioengineering
at the University of Washington, Seattle WA U.S.A.
-----------------------------------------------------

This archive was generated by hypermail 2.1.6 : Wed Dec 31 2014 - 23:21:49 CST