Re: Can't start SMP NAMD - Problem clearly is in front of the monitor

From: Jim Phillips (jim_at_ks.uiuc.edu)
Date: Tue Nov 17 2015 - 12:45:51 CST

Hi,

First, the "Charm++> Running on 5 unique compute nodes (12-way SMP)"
message is only communicating the discovered characteristics of the
machine you are running on (5 physical nodes, 12 processors per node).
This output will not be affected by the ppn options.

The "Info: Running on 55 processors, 5 nodes, 1 physical nodes." is
accurate. Your nodelist file probably has multiple entries for the same
node. The ++verbose option will show you what charmrun is doing.

Also, if you have a working mpiexec integrated with the queueing system on
your cluster the ++mpiexec option should eliminate the need to set up the
nodelist file yourself.

I'm really sorry about the +ppn vs ++ppn confusion in the notes.txt file.
For platforms launched via charmrun the ++ppn option is parsed by
charmrun, but platforms launched by something else (aprun or srun on Cray,
for example) the +ppn option is parsed by the namd2 binary. This is
consistent in the sense that ++ parameters are parsed by charmrun and +
parameters are parsed by the launched program.

Jim

On Tue, 17 Nov 2015, Vogel, Alexander wrote:

> Hello,
>
> I'm running NAMD on my own cluster for quite some time now but never used the SMP version. IBVERBS versions of NAMD 2.10 amd 2.11b1 are running just fine without SMP but I can get the SMP versions to run properly.
>
> My cluster consists of nodes with 12 cores each (2 hexacore processors per node). Now lets say I want to run a simulation on 5 nodes (=60 cores). The non-SMP version is then started with:
>
> charmrun namd2 ++nodelist ${TMPDIR}/machines +p60 +setcpuaffinity <configfile>
>
> The instructions for the SMP versions tell me to use the following command (60 cores are requested in the parallel environment, 5 of them will be communication threads=one per node):
>
> charmrun namd2 ++nodelist ${TMPDIR}/machines +p55 +ppn 11 +setcpuaffinity <configfile>
>
> I get the following messages in the output:
>
> Charm++> Running on 5 unique compute nodes (12-way SMP).
> -> It should show 11-way SMP, right?
>
> Charm++> Warning: the number of SMP threads (24) is greater than the number of physical cores (12), so threads will sleep while idling. Use +CmiSpinOnIdle or +CmiSleepOnIdle to control this directly.
> -> Don't know what to make of that.
>
> WARNING: +ppn is a command line argument beginning with a '+' but was not parsed by the RTS.
> If any of the above arguments were intended for the RTS you may need to recompile Charm++ with different options.
> -> This seems to be the main problem...the +ppn option is not recognized.
>
> FATAL ERROR: SMP build launched as multiple single-thread processes. Use ++ppn to set number of worker threads per process to match available cores, reserving one core per process for communication thread.
> -> This the result of the missing +ppn option.
>
> So what I figured out myself is that I can add a second + in front of +ppn to get the following command which helps somewhat (simulation runs but very slow):
>
> charmrun namd2 ++nodelist ${TMPDIR}/machines +p55 ++ppn 11 +setcpuaffinity <configfile>
>
> I get the following messages in the output:
>
> Charm++> Running on 1 unique compute nodes (12-way SMP).
> -> Now it is showing only one unique compute node and still 12-way SMP???
>
> Charm++> Warning: the number of SMP threads (60) is greater than the number of physical cores (12), so threads will sleep while idling. Use +CmiSpinOnIdle or +CmiSleepOnIdle to control this directly.
> -> Similar like before but now the number of SMP threads increased from 24 to 60.
>
> Info: Running on 55 processors, 5 nodes, 1 physical nodes.
> -> I don't think this right...it should be 5 physical nodes?
>
> The simulation is running after that...but very slow.
>
> So I don't really know how to fix this and I don't find any instructions for dummies. Could you please help me? Thank you very much...
>
> Alexander
>

This archive was generated by hypermail 2.1.6 : Thu Dec 31 2015 - 23:22:14 CST