From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Mon Jan 21 2013 - 00:02:25 CST
Hi again,
I would suggest to check the system logs and the torque logs. Also, in the
interactive mode of the queue, you can _NOT_ be sure that torque isn't
killing the job if you exceed some resource limitations. Also, as already
mentioned, check if your 5th node is the problem. Remember that torque
provides a list of nodes. This list should differ from 4 to 5 nodes, also
check this. Pls also provide the full namd output between 4 and 5 nodes.
Norman Geist.
Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im Auftrag
von Stober, Spencer T
Gesendet: Donnerstag, 17. Januar 2013 15:32
An: Norman Geist
Cc: Namd Mailing List
Betreff: RE: namd-l: Scaling problem: 4 nodes OK, 5 fails to start
Norman,
Thanks for the ideas. I ran the same tests using interactive mode in torque
(i.e., qsub -I -l nodes=4:ppn=12). this gives me a direct login to the head
node for my job and I can execute the mpirun_rsh command directly. Also, in
this way I know that I already have all the resources allocated (I can login
to all nodes). In this way, I can be sure the queuing system is not killing
the job. Interestingly, I printed out the actual mpispawn request by
executing:
mpirun_rsh -show <all the other stuff>
This shows that the actual command to start the job is identical in the
cases of either 4 or 5 nodes (with only the number of mpispawn requests
changing), and I still cannot determine what is killing the job. I have
tried several different combinations of nodes, so I know it does not depend
on which node I select. N.b., I can run the job successfully in this way on
4 nodes, but not 5.
Is there any other test ideas that you have that may help me sort out the
problem?
Thanks, Spence
Spencer T. Stober, Ph.D.
From: Norman Geist [mailto:norman.geist_at_uni-greifswald.de]
Sent: Thursday, January 17, 2013 1:15 AM
To: Stober, Spencer T
Cc: Namd Mailing List
Subject: AW: namd-l: Scaling problem: 4 nodes OK, 5 fails to start
Hi Spencer,
I wouldn't call it a "scaling problem". Actually signal 15 is the usual
SIGTERM. The question is why your namd processes get this signal and from
where. It's pretty likely that your queuing system kills the job for some
reason (maybe you exceed some resource limits). Check the log files of
torque and the nodes and see if you get some information. Additionally,
check if the 5th node could be the problem,if it always the same machine,
try to disable it temporarily.
Norman Geist.
Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im Auftrag
von Stober, Spencer T
Gesendet: Mittwoch, 16. Januar 2013 20:39
An: namd-l_at_ks.uiuc.edu
Betreff: namd-l: Scaling problem: 4 nodes OK, 5 fails to start
Hello,
Thanks in advance for the assistance. I have compiled and successfully run
NAMD 2.9 on a cluster but cannot run on more than 4 nodes. The cluster has
infiniband interconnect and each node has dual 6-core xeon processors
running CentOS 5.x and MPI mvapich-1.2.0-gcc-x86_64 with a torque queuing
system.
If I run on 4 nodes, 12 ppn, 48 cores, everything works, simulations are all
OK. Using the EXACT same input files with the only change being the number
of nodes and cores in the torque submission script the run fails. I have no
idea why this occurs, I am certain that I have access to the resources (I
can run other MPI programs on any number of nodes). The problem occurs in
version 2.6 of NAMD and also with the CUDA version of NAMD 2.9.
Any ideas are greatly appreciated. Details of the problem follow:
MPI version: mvapich-1.2.0-gcc-x86_64
Compiled NAMD 2.9 and Charm with the following commmands:
Charm:
env MPICXX=mpicxx ./build charm++ mpi-linux-x86_64 --no-build-shared
--with-production
NAMD:
./config Linux-x86_64-g++ --charm-arch mpi-linux-x86_64
-------- torque script to launch namd---------
#!/bin/bash
#PBS -N namd2
#PBS -l nodes=5:ppn=12
#PBS -q short
#PBS -V
NAMD_CONF="$PBS_O_WORKDIR/namd.conf"
NAMD_EXEC="/home/ststobe/NAMD_exe_NOCUDA/namd2"
HOSTFILE=$PBS_NODEFILE
cd $PBS_O_WORKDIR
export LD_LIBRARY_PATH=/home/ststobe/NAMD_exe_NOCUDA:$LD_LIBRARY_PATH
mpirun_rsh -rsh -np 60 -hostfile $HOSTFILE $NAMD_EXEC $NAMD_CONF >
$PBS_O_WORKDIR/namd.$PBS_JOBID
--------------------------------------
NAMD output file for run on 4 nodes, 12 ppn, 48 cores:
----------------------------------------------
Charm++> Running on MPI version: 1.2
Charm++> level of thread support used: MPI_THREAD_FUNNELED (desired:
MPI_THREAD_SINGLE)
Charm++> Running on non-SMP mode
Converse/Charm++ Commit ID: v6.4.0-beta1-0-g5776d21
Warning> Randomization of stack pointer is turned on in kernel, thread
migration may not work! Run 'echo 0 > /proc/sys/kernel/randomize_va_space'
as root to disable it, or try run with '+isomalloc_sync'.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 4 unique compute nodes (12-way SMP).
Charm++> cpu topology info is gathered in 0.089 seconds.
Info: NAMD 2.9 for Linux-x86_64-MPI
... then the rest of the startup output and all works fine....
------------------------------------------------
NAMD output file for run on 5 nodes, 12 ppn, 60 cores:
------------------------------------------------
Charm++> Running on MPI version: 1.2
Charm++> level of thread support used: MPI_THREAD_FUNNELED (desired:
MPI_THREAD_SINGLE)
Charm++> Running on non-SMP mode
Converse/Charm++ Commit ID: v6.4.0-beta1-0-g5776d21
Warning> Randomization of stack pointer is turned on in kernel, thread
migration may not work! Run 'echo 0 > /proc/sys/kernel/randomize_va_space'
as root to disable it, or try run with '+isomalloc_sync'.
CharmLB> Load balancer assumes all CPUs are same.
Signal 15 received.
Signal 15 received.
Signal 15 received.
Signal 15 received.
Signal 15 received.
... and I get this from the stderr output from the torque system
MPI process terminated unexpectedly
Exit code -5 signaled from i18
Killing remote processes...MPI process terminated unexpectedly
MPI process terminated unexpectedly
MPI process terminated unexpectedly
MPI process terminated unexpectedly
DONE
------------------------------------------------
This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:22:54 CST