namd 2.9 run instability (segfaults)

From: Michael Galloway (
Date: Thu Sep 06 2012 - 09:19:05 CDT

good day all,

we are having some issues running namd 2.9 on our new cluster. we are
using qlogic IB/openmpi 1.6.1/gcc 4.4.6(centOS 6.2).

during testing we are experiencing some random segfaults such as:

[mgx_at_cmbcluster namd]$ mpirun -machinefile nodes -np 96
/shared/namd-2.9/Linux-x86_64-g++/namd2 apoa1/apoa1.namd
Charm++> Running on MPI version: 2.1
Charm++> level of thread support used: MPI_THREAD_SINGLE (desired:
Charm++> Running on non-SMP mode
Converse/Charm++ Commit ID: v6.4.0-beta1-0-g5776d21

namd2:18856 terminated with signal 11 at PC=3104a0947f SP=7fff480eb420.
CharmLB> Load balancer assumes all CPUs are same.
mpirun noticed that process rank 44 with PID 18856 on node node011
exited on signal 11 (Segmentation fault).

the run may run fine sometime and sometimes segfaults, i've run some
scaling from 12 to 384 processors and scaling is good, but all runs exhibit
occasional segfaults.

charm++ and namd were built as the docs indicate:

Build and test the Charm++/Converse library (MPI version):
   env MPICXX=mpicxx ./build charm++ mpi-linux-x86_64 --with-production

Download and install TCL and FFTW libraries:
   (cd to NAMD_2.9_Source if you're not already there)
   tar xzf fftw-linux-x86_64.tar.gz
   mv linux-x86_64 fftw
   tar xzf tcl8.5.9-linux-x86_64.tar.gz
   tar xzf tcl8.5.9-linux-x86_64-threaded.tar.gz
   mv tcl8.5.9-linux-x86_64 tcl
   mv tcl8.5.9-linux-x86_64-threaded tcl-threaded

Set up build directory and compile:
   MPI version: ./config Linux-x86_64-g++ --charm-arch mpi-linux-x86_64
   cd Linux-x86_64-g++
   make (or gmake -j4, which should run faster)

openmpi was built with:

./configure --prefix=/shared/openmpi-1.6.1/gcc --enable-static
--without-tm -with-openib=/usr --with-psm=/usr CC=gcc CXX=g++
F77=gfortran FC=gfortran --enable-mpi-thread-multiple

when it completes scaling looks like
Info: Benchmark time: 12 CPUs 0.116051 s/step 1.34318 days/ns 257.895 MB
Info: Benchmark time: 24 CPUs 0.0596373 s/step 0.690247 days/ns 247.027
MB memory
Info: Benchmark time: 48 CPUs 0.0303531 s/step 0.351309 days/ns 249.84
MB memory
Info: Benchmark time: 96 CPUs 0.0161126 s/step 0.186489 days/ns 249.98
MB memory
Info: Benchmark time: 192 CPUs 0.00904823 s/step 0.104725 days/ns
267.719 MB memory
Info: Benchmark time: 384 CPUs 0.00490486 s/step 0.0567692 days/ns
313.637 MB memory

my testing with other applications built with same system (simple
mpihello and nwchem 6.1.1) seem to run well.

any suggestions where the issue might be? thanks!

--- michael

