AW: Re: charmrun setup

From: Norman Geist (
Date: Mon May 07 2012 - 00:49:22 CDT

Hi Branko,


congratulation, you are now faced with the main problem in high performance
computing. People call it the parallel scaling. If a program runs parallel
on multiple cores or nodes, the single processes needs to communicate to
share the work and gain speedup. There can be many reasons why you don't see
the speedup you want. The most common issue is the User ^^ and the network,
and also in the node itself the memory bandwidth.


So to benchmark the scaling of a node itself, run namd from 1 to 8 cores, on
only one node and see how your speedup is. This is what most people doesn't

Most people are interested in scaling over multiple nodes. To test that,
just run jobs of 1 to 3 nodes, each with 8 processes. You should only have
the nodes inside the nodelist, that should be currently used for the test.
So 1 node test -> one node in the nodelist. Why? Because charmrun (and mpi
also) distribute processes in a round robin fashion. Means in goes though
the nodelist and starts one process foreach line there. If not enough
processes started at the end of the nodelist, it starts from the top again.
Means with a nodelist of






And a charmrun of ++p8, it allocates











As this is not what you want to see, the nodelist should only contain the
machines for the current test.


To give reliable advice, we also should know more about your hardware
(machines and network)


Let us know


Norman Geist.


Von: [] Im Auftrag
von Branko
Gesendet: Samstag, 5. Mai 2012 20:37
An: namd-l
Betreff: Fwd: Re: namd-l: charmrun setup



On 5/5/2012 8:14 PM, Pedro Armando Ojeda May wrote:




             1) Running namd2

                 TCP/namd2) using charmrun



             2) Launch the run as follows:


                    charmrun namd2 ++remote-shell ssh ++verbose +netpoll
++nodelist nodelist2

                    ++ppn 8 ++p 16 inputfile


                    my nodefile looks like:

                    group main

                    host x087

                    host x089

                    host x093


                    Each node has 8 processors. Running on the command line
(not with torque or any

                    other queue system).



It works fine in the sense that my program finishes for the number of steps
I assigned. My concern is regarding to the simulation time, if I use "++p 8"
the time this run lasts is 22min (measured with "time" command), but if I
use "++p 16" the time for the run is 20min.

I expected the simulation time to be reduce at least for a half, but it is
almost the same time.

Does anyone has a comment about this issue?




        - periodic cubic box 70x70x70

        - 33933 atoms






No virus found in this incoming message.
Checked by AVG -
Version: 9.0.930 / Virus Database: 2410.1.1/4979 - Release Date: 05/05/12


This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:21:31 CST