Re: Mysterious slow down in parallel

From: Roy Fernando (roy.nandos_at_gmail.com)
Date: Sat Oct 26 2013 - 22:06:49 CDT

Hi Axel,

Thanks for the reply.

So, do you imply, given the size of my system, the maximum speed I get at 2
nodes with 8 processors at each node is about the maximum speed I can
achieve? This is about a nano_second/day.

I did not quite understand what you said as ""when you add more processors
than you have "parallel work units"". This system has 80 patches. I thought
a patch is a single parallel work unit?. I only saw a speeding up, up to 16
processors. (Also, I observed the same pattern when I test run with a
system 10 times smaller than this one)

Do you think there is any process I can follow to improve this situation?

Roy

On Sat, Oct 26, 2013 at 6:04 PM, Axel Kohlmeyer <akohlmey_at_gmail.com> wrote:

>
>
>
> On Sat, Oct 26, 2013 at 5:05 PM, Roy Fernando <roy.nandos_at_gmail.com>wrote:
>
>> Dear NAMD experts,
>>
>> I recently started running NAMD in cluster and I initially played with my
>> system to determine what is the best combination of nodes and processors
>> for my simulation in the cluster. I only ran for a shot time interval.
>>
>> The cluster contains 30 nodes each containing 8 cores.
>>
>> I noticed a significant speed up from a single processor to 8 processors
>> in a single node. Then I chose 2 nodes (16 processors) and observed another
>> speed up. But when I increased the number of nodes to 3 or 4 the simulation
>> displayed a drastic slow down.
>>
>> Can somebody please, suggest why probably the simulations slow down? I
>> higly appreciate your input;
>>
>
> this kind of slowdown is not mysterious at all. it happens to almost all
> parallel programs when the overhead from exchanging information between
> individual processors becomes significant relative to the amount of
> computational work. also, when you add more processors than you have
> "parallel work units" you cannot see any speedup, and most of the time you
> will see a slowdown. how much of a slowdown depends on the problem at hand
> and the kind of network that you have and particular its latency and
> bandwidth.
>
> in this specific case, you are doing classical MD, which has rather low
> computational complexity and your system is not very large, so you don't
> have a lot of work units, and it looks like there are you are using TCP/IP
> communication which has very high latency.
>
> NAMD through using the charm++ library can hide high communication latency
> quite well, but only up to a point. the processors you add, the more the
> the combined latency becomes and at the same time, there is equivalently
> less computational work to hide behind.
>
> axel.
>
>
>>
>> Roy
>>
>> Following is the table I made including these details.
>>
>> Job # Nodes #processors start up wall time 571825 1 1 7.5 2866 569 1
>> 8 9 539 470 2 8 2.4 316 498 2 8 3 323 494 3 8 4500 500 4 8 16 4793
>> I submitted the job using the following command line;
>> qsub -l nodes=<#nodes> : ppn=<#processors> ,
>> walltime=<expected_wall_time> <job_file_name>
>>
>> and following is the contents of my job_file;
>>
>> ---------------------------------------------------------------------------------------------------------------------------------------------------
>> #!/bin/sh -l
>> # Change to the directory from which you originally submitted this job.
>> cd $PBS_O_WORKDIR
>> CONV_RSH=ssh
>> export CONV_RSH
>> # CONV_DAEMON=""
>> # export CONV_DAEMON
>> module load namd
>>
>> NODES=`cat $PBS_NODEFILE`
>> NODELIST="$RCAC_SCRATCH/namd2-$PBS_JOBID.nodelist"
>> echo group main > "$NODELIST"
>>
>> # charmrun "$NAMD_HOME/namd2" ++verbose +p$NUMPROCS ++nodelist
>> "$NODELIST" ubq_wb_eq.conf
>> charmrun "$NAMD_HOME/namd2" ++verbose +p16 ++nodelist "$NODELIST"
>> SOD_wb_eq0.conf
>> module unload namd
>>
>> --------------------------------------------------------------------------------------------------------------------------------------------------------------
>>
>> Following is my structure summary;
>>
>> Info: ****************************
>> Info: STRUCTURE SUMMARY:
>> Info: 50198 ATOMS
>> Info: 35520 BONDS
>> Info: 25502 ANGLES
>> Info: 15756 DIHEDRALS
>> Info: 1042 IMPROPERS
>> Info: 380 CROSSTERMS
>> Info: 0 EXCLUSIONS
>> Info: 47188 RIGID BONDS
>> Info: 103406 DEGREES OF FREEDOM
>> Info: 17790 HYDROGEN GROUPS
>> Info: 4 ATOMS IN LARGEST HYDROGEN GROUP
>> Info: 17790 MIGRATION GROUPS
>> Info: 4 ATOMS IN LARGEST MIGRATION GROUP
>> Info: TOTAL MASS = 308670 amu
>> Info: TOTAL CHARGE = -8 e
>> Info: MASS DENSITY = 0.946582 g/cm^3
>> Info: ATOM DENSITY = 0.0927022 atoms/A^3
>> Info: *****************************
>>
>> Info: Entering startup at 7.15922 s, 14.8091 MB of memory in use
>> Info: Startup phase 0 took 0.0303071 s, 14.8092 MB of memory in use
>> Info: Startup phase 1 took 0.068871 s, 23.5219 MB of memory in use
>> Info: Startup phase 2 took 0.0307088 s, 23.9375 MB of memory in use
>> Info: Startup phase 3 took 0.0302751 s, 23.9374 MB of memory in use
>> Info: PATCH GRID IS 4 (PERIODIC) BY 4 (PERIODIC) BY 5 (PERIODIC)
>> Info: PATCH GRID IS 1-AWAY BY 1-AWAY BY 1-AWAY
>> Info: REMOVING COM VELOCITY 0.0178943 -0.00579233 -0.00948207
>> Info: LARGEST PATCH (29) HAS 672 ATOMS
>> Info: Startup phase 4 took 0.0571079 s, 31.7739 MB of memory in use
>> Info: PME using 1 and 1 processors for FFT and reciprocal sum.
>> Info: PME USING 1 GRID NODES AND 1 TRANS NODES
>> Info: PME GRID LOCATIONS: 0
>> Info: PME TRANS LOCATIONS: 0
>> Info: Optimizing 4 FFT steps. 1... 2... 3... 4... Done.
>> Info: Startup phase 5 took 0.0330172 s, 34.1889 MB of memory in use
>> Info: Startup phase 6 took 0.0302858 s, 34.1888 MB of memory in use
>> LDB: Central LB being created...
>> Info: Startup phase 7 took 0.030385 s, 34.1902 MB of memory in use
>> Info: CREATING 1526 COMPUTE OBJECTS
>> Info: NONBONDED TABLE R-SQUARED SPACING: 0.0625
>> Info: NONBONDED TABLE SIZE: 769 POINTS
>> Info: Startup phase 8 took 0.0399361 s, 39.2458 MB of memory in use
>> Info: Startup phase 9 took 0.030345 s, 39.2457 MB of memory in use
>> Info: Startup phase 10 took 0.000467062 s, 49.472 MB of memory in use
>> Info: Finished startup at 7.54093 s, 49.472 MB of memory in use
>>
>>
>
>
> --
> Dr. Axel Kohlmeyer akohlmey_at_gmail.com http://goo.gl/1wk0
> International Centre for Theoretical Physics, Trieste. Italy.
>

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:23:53 CST