AW: Scaling problem: 4 nodes OK, 5 fails to start

From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Thu Jan 17 2013 - 00:14:37 CST

Next message: Stober, Spencer T: "RE: Scaling problem: 4 nodes OK, 5 fails to start"
Previous message: Norman Geist: "AW: Re: AW: AW: CUDA problem?"
In reply to: Stober, Spencer T: "Scaling problem: 4 nodes OK, 5 fails to start"
Next in thread: Stober, Spencer T: "RE: Scaling problem: 4 nodes OK, 5 fails to start"
Reply: Stober, Spencer T: "RE: Scaling problem: 4 nodes OK, 5 fails to start"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

Hi Spencer,

I wouldn't call it a "scaling problem". Actually signal 15 is the usual
SIGTERM. The question is why your namd processes get this signal and from
where. It's pretty likely that your queuing system kills the job for some
reason (maybe you exceed some resource limits). Check the log files of
torque and the nodes and see if you get some information. Additionally,
check if the 5th node could be the problem,if it always the same machine,
try to disable it temporarily.

Norman Geist.

Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im Auftrag
von Stober, Spencer T
Gesendet: Mittwoch, 16. Januar 2013 20:39
An: namd-l_at_ks.uiuc.edu
Betreff: namd-l: Scaling problem: 4 nodes OK, 5 fails to start

Hello,

Thanks in advance for the assistance. I have compiled and successfully run
NAMD 2.9 on a cluster but cannot run on more than 4 nodes. The cluster has
infiniband interconnect and each node has dual 6-core xeon processors
running CentOS 5.x and MPI mvapich-1.2.0-gcc-x86_64 with a torque queuing
system.

If I run on 4 nodes, 12 ppn, 48 cores, everything works, simulations are all
OK. Using the EXACT same input files with the only change being the number
of nodes and cores in the torque submission script the run fails. I have no
idea why this occurs, I am certain that I have access to the resources (I
can run other MPI programs on any number of nodes). The problem occurs in
version 2.6 of NAMD and also with the CUDA version of NAMD 2.9.

Any ideas are greatly appreciated. Details of the problem follow:

MPI version: mvapich-1.2.0-gcc-x86_64

Compiled NAMD 2.9 and Charm with the following commmands:

Charm:

env MPICXX=mpicxx ./build charm++ mpi-linux-x86_64 --no-build-shared
--with-production

NAMD:

./config Linux-x86_64-g++ --charm-arch mpi-linux-x86_64

-------- torque script to launch namd---------

#!/bin/bash

#PBS -N namd2

#PBS -l nodes=5:ppn=12

#PBS -q short

#PBS -V

NAMD_CONF="$PBS_O_WORKDIR/namd.conf"

NAMD_EXEC="/home/ststobe/NAMD_exe_NOCUDA/namd2"

HOSTFILE=$PBS_NODEFILE

cd $PBS_O_WORKDIR

export LD_LIBRARY_PATH=/home/ststobe/NAMD_exe_NOCUDA:$LD_LIBRARY_PATH

mpirun_rsh -rsh -np 60 -hostfile $HOSTFILE $NAMD_EXEC $NAMD_CONF >
$PBS_O_WORKDIR/namd.$PBS_JOBID

--------------------------------------

NAMD output file for run on 4 nodes, 12 ppn, 48 cores:

----------------------------------------------

Charm++> Running on MPI version: 1.2

Charm++> level of thread support used: MPI_THREAD_FUNNELED (desired:
MPI_THREAD_SINGLE)

Charm++> Running on non-SMP mode

Converse/Charm++ Commit ID: v6.4.0-beta1-0-g5776d21

Warning> Randomization of stack pointer is turned on in kernel, thread
migration may not work! Run 'echo 0 > /proc/sys/kernel/randomize_va_space'
as root to disable it, or try run with '+isomalloc_sync'.

CharmLB> Load balancer assumes all CPUs are same.

Charm++> Running on 4 unique compute nodes (12-way SMP).

Charm++> cpu topology info is gathered in 0.089 seconds.

Info: NAMD 2.9 for Linux-x86_64-MPI

... then the rest of the startup output and all works fine....

------------------------------------------------

NAMD output file for run on 5 nodes, 12 ppn, 60 cores:

------------------------------------------------

Charm++> Running on MPI version: 1.2

Charm++> level of thread support used: MPI_THREAD_FUNNELED (desired:
MPI_THREAD_SINGLE)

Charm++> Running on non-SMP mode

Converse/Charm++ Commit ID: v6.4.0-beta1-0-g5776d21

CharmLB> Load balancer assumes all CPUs are same.

Signal 15 received.

... and I get this from the stderr output from the torque system

MPI process terminated unexpectedly

Exit code -5 signaled from i18

Killing remote processes...MPI process terminated unexpectedly

MPI process terminated unexpectedly

DONE

------------------------------------------------

Next message: Stober, Spencer T: "RE: Scaling problem: 4 nodes OK, 5 fails to start"
Previous message: Norman Geist: "AW: Re: AW: AW: CUDA problem?"
In reply to: Stober, Spencer T: "Scaling problem: 4 nodes OK, 5 fails to start"
Next in thread: Stober, Spencer T: "RE: Scaling problem: 4 nodes OK, 5 fails to start"
Reply: Stober, Spencer T: "RE: Scaling problem: 4 nodes OK, 5 fails to start"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:22:54 CST