RE: Scaling problem: 4 nodes OK, 5 fails to start

From: Stober, Spencer T (spencer.t.stober_at_exxonmobil.com)
Date: Thu Jan 17 2013 - 08:31:53 CST

Norman,

Thanks for the ideas. I ran the same tests using interactive mode in torque (i.e., qsub -I -l nodes=4:ppn=12)... this gives me a direct login to the head node for my job and I can execute the mpirun_rsh command directly. Also, in this way I know that I already have all the resources allocated (I can login to all nodes). In this way, I can be sure the queuing system is not killing the job. Interestingly, I printed out the actual mpispawn request by executing:

mpirun_rsh -show <all the other stuff>

This shows that the actual command to start the job is identical in the cases of either 4 or 5 nodes (with only the number of mpispawn requests changing), and I still cannot determine what is killing the job. I have tried several different combinations of nodes, so I know it does not depend on which node I select. N.b., I can run the job successfully in this way on 4 nodes, but not 5.

Is there any other test ideas that you have that may help me sort out the problem?

Thanks, Spence

Spencer T. Stober, Ph.D.

From: Norman Geist [mailto:norman.geist_at_uni-greifswald.de]
Sent: Thursday, January 17, 2013 1:15 AM
To: Stober, Spencer T
Cc: Namd Mailing List
Subject: AW: namd-l: Scaling problem: 4 nodes OK, 5 fails to start

Hi Spencer,

I wouldn't call it a "scaling problem". Actually signal 15 is the usual SIGTERM. The question is why your namd processes get this signal and from where. It's pretty likely that your queuing system kills the job for some reason (maybe you exceed some resource limits). Check the log files of torque and the nodes and see if you get some information. Additionally, check if the 5th node could be the problem,if it always the same machine, try to disable it temporarily.

Norman Geist.

Von: owner-namd-l_at_ks.uiuc.edu<mailto:owner-namd-l_at_ks.uiuc.edu> [mailto:owner-namd-l_at_ks.uiuc.edu] Im Auftrag von Stober, Spencer T
Gesendet: Mittwoch, 16. Januar 2013 20:39
An: namd-l_at_ks.uiuc.edu<mailto:namd-l_at_ks.uiuc.edu>
Betreff: namd-l: Scaling problem: 4 nodes OK, 5 fails to start

Hello,

Thanks in advance for the assistance. I have compiled and successfully run NAMD 2.9 on a cluster but cannot run on more than 4 nodes. The cluster has infiniband interconnect and each node has dual 6-core xeon processors running CentOS 5.x and MPI mvapich-1.2.0-gcc-x86_64 with a torque queuing system.

If I run on 4 nodes, 12 ppn, 48 cores, everything works, simulations are all OK. Using the EXACT same input files with the only change being the number of nodes and cores in the torque submission script the run fails. I have no idea why this occurs, I am certain that I have access to the resources (I can run other MPI programs on any number of nodes). The problem occurs in version 2.6 of NAMD and also with the CUDA version of NAMD 2.9.

Any ideas are greatly appreciated. Details of the problem follow:

MPI version: mvapich-1.2.0-gcc-x86_64

Compiled NAMD 2.9 and Charm with the following commmands:

Charm:
env MPICXX=mpicxx ./build charm++ mpi-linux-x86_64 --no-build-shared --with-production

NAMD:
./config Linux-x86_64-g++ --charm-arch mpi-linux-x86_64

-------- torque script to launch namd---------
#!/bin/bash
#PBS -N namd2
#PBS -l nodes=5:ppn=12
#PBS -q short
#PBS -V

NAMD_CONF="$PBS_O_WORKDIR/namd.conf"
NAMD_EXEC="/home/ststobe/NAMD_exe_NOCUDA/namd2"

HOSTFILE=$PBS_NODEFILE
cd $PBS_O_WORKDIR
export LD_LIBRARY_PATH=/home/ststobe/NAMD_exe_NOCUDA:$LD_LIBRARY_PATH
mpirun_rsh -rsh -np 60 -hostfile $HOSTFILE $NAMD_EXEC $NAMD_CONF > $PBS_O_WORKDIR/namd.$PBS_JOBID
--------------------------------------

NAMD output file for run on 4 nodes, 12 ppn, 48 cores:
----------------------------------------------
Charm++> Running on MPI version: 1.2
Charm++> level of thread support used: MPI_THREAD_FUNNELED (desired: MPI_THREAD_SINGLE)
Charm++> Running on non-SMP mode
Converse/Charm++ Commit ID: v6.4.0-beta1-0-g5776d21
Warning> Randomization of stack pointer is turned on in kernel, thread migration may not work! Run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable it, or try run with '+isomalloc_sync'.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 4 unique compute nodes (12-way SMP).
Charm++> cpu topology info is gathered in 0.089 seconds.
Info: NAMD 2.9 for Linux-x86_64-MPI

... then the rest of the startup output and all works fine....
------------------------------------------------

NAMD output file for run on 5 nodes, 12 ppn, 60 cores:
------------------------------------------------
Charm++> Running on MPI version: 1.2
Charm++> level of thread support used: MPI_THREAD_FUNNELED (desired: MPI_THREAD_SINGLE)
Charm++> Running on non-SMP mode
Converse/Charm++ Commit ID: v6.4.0-beta1-0-g5776d21
Warning> Randomization of stack pointer is turned on in kernel, thread migration may not work! Run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable it, or try run with '+isomalloc_sync'.
CharmLB> Load balancer assumes all CPUs are same.
Signal 15 received.
Signal 15 received.
Signal 15 received.
Signal 15 received.
Signal 15 received.

... and I get this from the stderr output from the torque system

MPI process terminated unexpectedly
Exit code -5 signaled from i18
Killing remote processes...MPI process terminated unexpectedly
MPI process terminated unexpectedly
MPI process terminated unexpectedly
MPI process terminated unexpectedly
DONE
------------------------------------------------

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:22:54 CST