AW: Mysterious slow down in parallel

From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Mon Oct 28 2013 - 01:50:58 CDT

Hi,

one thing you should try 1st is to add +idlepoll to the namd2 command, this
could bring some siginificant advantage. Additionally, as your scale-out
occurs right from two to three nodes, this indicates that your network
switch has a too high switching latency. Where it doesn't need to do
anything, keeping up a network connection between two nodes, it needs to
switch high frequent if there are more than two nodes involved in a very
busy communication all-to-all. Are you sure that there isn't any high speed
network in this cluster dedicated to computations? If there's not and if you
have root access to the cluster, you might want to try the TCP congestion
algorithm "highspeed" which can come with an improvement of the mentioned
behavior while doing the communication in larger chunks, rather than
small-packet traffic, which reduces the number of required switching moves
on your network. It can be configures easily with "sysctl" and needs to be
set on all the nodes.
 
Norman Geist.

> -----Ursprüngliche Nachricht-----
> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
> Auftrag von Gianluca Interlandi
> Gesendet: Sonntag, 27. Oktober 2013 07:12
> An: Roy Fernando
> Cc: Axel Kohlmeyer; namd-l_at_ks.uiuc.edu
> Betreff: Re: namd-l: Mysterious slow down in parallel
>
> You should read about domain decomposition. The more tasks you have the
> more frequently the CPUs need to update each other about the position
> of
> the atoms in each subdomain. Sending a message across the network
> requires
> a "preparation" time, the so called latency. The smaller a subdomain
> the
> more time each CPU (or core) will spend waiting to send or receive a
> message than to actually do a computation. This is why adding more
> processing cores does not necessarily speed up your computation since
> after a critical number (16 in your case) the latency will be larger
> than
> the time a core actually spends computing.
>
> In other words, there is nothing you can do within the limits of the
> hardware that you have. What you can do is to reduce the latency by for
> example adding a myrinet or infiniband network to the nodes of your
> cluster, assuming that you have a budget for that (quite pricey) and
> that
> the cluster is at your site. Otherwise, your only choice is to gain
> access
> to a cluster which already has a low latency network.
>
> Gianluca
>
> On Sat, 26 Oct 2013, Roy Fernando wrote:
>
> > Hi Axel,
> >
> > Thanks for the reply.
> >
> > So, do you imply, given the size of my system, the maximum speed I
> get at 2 nodes with 8
> > processors at each node is about the maximum speed I can achieve?
> This is about a
> > nano_second/day.
> >
> > I did not quite understand what you said as ""when you add more
> processors than you have
> > "parallel work units"". This system has 80 patches. I thought a patch
> is a single parallel
> > work unit?. I only saw a speeding up, up to 16 processors. (Also, I
> observed the same
> > pattern when I test run with a system 10 times smaller than this one)
> >
> > Do you think there is any process I can follow to improve this
> situation?
> >
> > Roy
> >
> >
> > On Sat, Oct 26, 2013 at 6:04 PM, Axel Kohlmeyer <akohlmey_at_gmail.com>
> wrote:
> >
> >
> >
> > On Sat, Oct 26, 2013 at 5:05 PM, Roy Fernando
> <roy.nandos_at_gmail.com> wrote:
> > Dear NAMD experts,
> >
> > I recently started running NAMD in cluster and I initially played
> with my
> > system to determine what is the best combination of nodes and
> processors for my
> > simulation in the cluster. I only ran for a shot time interval.
> >
> > The cluster contains 30 nodes each containing 8 cores.
> >
> > I noticed a significant speed up from a single processor to 8
> processors in a
> > single node. Then I chose 2 nodes (16 processors) and observed
> another speed
> > up. But when I increased the number of nodes to 3 or 4 the simulation
> displayed
> > a drastic slow down.
> >
> > Can somebody please, suggest why probably the simulations slow down?
> I higly
> > appreciate your input;
> >
> >
> > this kind of slowdown is not mysterious at all. it happens to almost
> all parallel
> > programs when the overhead from exchanging information between
> individual processors
> > becomes significant relative to the amount of computational work.
> also, when you add
> > more processors than you have "parallel work units" you cannot see
> any speedup, and
> > most of the time you will see a slowdown. how much of a slowdown
> depends on the
> > problem at hand and the kind of network that you have and particular
> its latency and
> > bandwidth.
> >
> > in this specific case, you are doing classical MD, which has rather
> low computational
> > complexity and your system is not very large, so you don't have a lot
> of work units,
> > and it looks like there are you are using TCP/IP communication which
> has very high
> > latency.
> >
> > NAMD through using the charm++ library can hide high communication
> latency quite
> > well, but only up to a point. the processors you add, the more the
> the combined
> > latency becomes and at the same time, there is equivalently less
> computational work
> > to hide behind.
> >
> > axel.
> >
> >
> > Roy
> >
> > Following is the table I made including these details.
> >
> > Job
> > # Nodes
> > #processors
> > start up
> > wall time
> > 571825
> > 1
> > 1
> > 7.5
> > 2866
> > 569
> > 1
> > 8
> > 9
> > 539
> > 470
> > 2
> > 8
> > 2.4
> > 316
> > 498
> > 2
> > 8
> > 3
> > 323
> > 494
> > 3
> > 8
> >
> > 4500
> > 500
> > 4
> > 8
> > 16
> > 4793
> > I submitted the job using the following command line;
> > qsub -l nodes=<#nodes> : ppn=<#processors> ,
> walltime=<expected_wall_time>
> > <job_file_name>
> >
> > and following is the contents of my job_file;
> > ---------------------------------------------------------------------
> ---------------------
> > ---------------------------------------------------------
> > #!/bin/sh -l
> > # Change to the directory from which you originally submitted this
> job.
> > cd $PBS_O_WORKDIR
> > CONV_RSH=ssh
> > export CONV_RSH
> > # CONV_DAEMON=""
> > # export CONV_DAEMON
> > module load namd
> >
> > NODES=`cat $PBS_NODEFILE`
> > NODELIST="$RCAC_SCRATCH/namd2-$PBS_JOBID.nodelist"
> > echo group main > "$NODELIST"
> >
> > # charmrun "$NAMD_HOME/namd2" ++verbose +p$NUMPROCS ++nodelist
> "$NODELIST"
> > ubq_wb_eq.conf
> > charmrun "$NAMD_HOME/namd2" ++verbose +p16 ++nodelist "$NODELIST"
> > SOD_wb_eq0.conf
> > module unload namd
> > ---------------------------------------------------------------------
> ---------------------
> > --------------------------------------------------------------------
> >
> > Following is my structure summary;
> >
> > Info: ****************************
> > Info: STRUCTURE SUMMARY:
> > Info: 50198 ATOMS
> > Info: 35520 BONDS
> > Info: 25502 ANGLES
> > Info: 15756 DIHEDRALS
> > Info: 1042 IMPROPERS
> > Info: 380 CROSSTERMS
> > Info: 0 EXCLUSIONS
> > Info: 47188 RIGID BONDS
> > Info: 103406 DEGREES OF FREEDOM
> > Info: 17790 HYDROGEN GROUPS
> > Info: 4 ATOMS IN LARGEST HYDROGEN GROUP
> > Info: 17790 MIGRATION GROUPS
> > Info: 4 ATOMS IN LARGEST MIGRATION GROUP
> > Info: TOTAL MASS = 308670 amu
> > Info: TOTAL CHARGE = -8 e
> > Info: MASS DENSITY = 0.946582 g/cm^3
> > Info: ATOM DENSITY = 0.0927022 atoms/A^3
> > Info: *****************************
> >
> > Info: Entering startup at 7.15922 s, 14.8091 MB of memory in use
> > Info: Startup phase 0 took 0.0303071 s, 14.8092 MB of memory in use
> > Info: Startup phase 1 took 0.068871 s, 23.5219 MB of memory in use
> > Info: Startup phase 2 took 0.0307088 s, 23.9375 MB of memory in use
> > Info: Startup phase 3 took 0.0302751 s, 23.9374 MB of memory in use
> > Info: PATCH GRID IS 4 (PERIODIC) BY 4 (PERIODIC) BY 5 (PERIODIC)
> > Info: PATCH GRID IS 1-AWAY BY 1-AWAY BY 1-AWAY
> > Info: REMOVING COM VELOCITY 0.0178943 -0.00579233 -0.00948207
> > Info: LARGEST PATCH (29) HAS 672 ATOMS
> > Info: Startup phase 4 took 0.0571079 s, 31.7739 MB of memory in use
> > Info: PME using 1 and 1 processors for FFT and reciprocal sum.
> > Info: PME USING 1 GRID NODES AND 1 TRANS NODES
> > Info: PME GRID LOCATIONS: 0
> > Info: PME TRANS LOCATIONS: 0
> > Info: Optimizing 4 FFT steps.  1... 2... 3... 4...   Done.
> > Info: Startup phase 5 took 0.0330172 s, 34.1889 MB of memory in use
> > Info: Startup phase 6 took 0.0302858 s, 34.1888 MB of memory in use
> > LDB: Central LB being created...
> > Info: Startup phase 7 took 0.030385 s, 34.1902 MB of memory in use
> > Info: CREATING 1526 COMPUTE OBJECTS
> > Info: NONBONDED TABLE R-SQUARED SPACING: 0.0625
> > Info: NONBONDED TABLE SIZE: 769 POINTS
> > Info: Startup phase 8 took 0.0399361 s, 39.2458 MB of memory in use
> > Info: Startup phase 9 took 0.030345 s, 39.2457 MB of memory in use
> > Info: Startup phase 10 took 0.000467062 s, 49.472 MB of memory in use
> > Info: Finished startup at 7.54093 s, 49.472 MB of memory in use
> >
> >
> >
> >
> > --
> > Dr. Axel Kohlmeyer  akohlmey_at_gmail.com  http://goo.gl/1wk0
> > International Centre for Theoretical Physics, Trieste. Italy.
> >
> >
> >
> >
>
> -----------------------------------------------------
> Gianluca Interlandi, PhD gianluca_at_u.washington.edu
> +1 (206) 685 4435
> http://artemide.bioeng.washington.edu/
>
> Research Scientist at the Department of Bioengineering
> at the University of Washington, Seattle WA U.S.A.
> -----------------------------------------------------

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:23:53 CST