namd 2.9 run instability (segfaults)

From: Michael Galloway (gallowaymd_at_ornl.gov)
Date: Thu Sep 06 2012 - 09:19:05 CDT

good day all,

we are having some issues running namd 2.9 on our new cluster. we are
using qlogic IB/openmpi 1.6.1/gcc 4.4.6(centOS 6.2).

during testing we are experiencing some random segfaults such as:

[mgx_at_cmbcluster namd]$ mpirun -machinefile nodes -np 96
/shared/namd-2.9/Linux-x86_64-g++/namd2 apoa1/apoa1.namd
Charm++> Running on MPI version: 2.1
Charm++> level of thread support used: MPI_THREAD_SINGLE (desired:
MPI_THREAD_SINGLE)
Charm++> Running on non-SMP mode
Converse/Charm++ Commit ID: v6.4.0-beta1-0-g5776d21

namd2:18856 terminated with signal 11 at PC=3104a0947f SP=7fff480eb420.
Backtrace:
CharmLB> Load balancer assumes all CPUs are same.
--------------------------------------------------------------------------
mpirun noticed that process rank 44 with PID 18856 on node node011
exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

the run may run fine sometime and sometimes segfaults, i've run some
scaling from 12 to 384 processors and scaling is good, but all runs exhibit
occasional segfaults.

charm++ and namd were built as the docs indicate:

Build and test the Charm++/Converse library (MPI version):
   env MPICXX=mpicxx ./build charm++ mpi-linux-x86_64 --with-production

Download and install TCL and FFTW libraries:
   (cd to NAMD_2.9_Source if you're not already there)
   wget
http://www.ks.uiuc.edu/Research/namd/libraries/fftw-linux-x86_64.tar.gz
   tar xzf fftw-linux-x86_64.tar.gz
   mv linux-x86_64 fftw
   wget
http://www.ks.uiuc.edu/Research/namd/libraries/tcl8.5.9-linux-x86_64.tar.gz
   wget
http://www.ks.uiuc.edu/Research/namd/libraries/tcl8.5.9-linux-x86_64-threaded.tar.gz
   tar xzf tcl8.5.9-linux-x86_64.tar.gz
   tar xzf tcl8.5.9-linux-x86_64-threaded.tar.gz
   mv tcl8.5.9-linux-x86_64 tcl
   mv tcl8.5.9-linux-x86_64-threaded tcl-threaded

Set up build directory and compile:
   MPI version: ./config Linux-x86_64-g++ --charm-arch mpi-linux-x86_64
   cd Linux-x86_64-g++
   make (or gmake -j4, which should run faster)

openmpi was built with:

./configure --prefix=/shared/openmpi-1.6.1/gcc --enable-static
--without-tm -with-openib=/usr --with-psm=/usr CC=gcc CXX=g++
F77=gfortran FC=gfortran --enable-mpi-thread-multiple

when it completes scaling looks like
Info: Benchmark time: 12 CPUs 0.116051 s/step 1.34318 days/ns 257.895 MB
memory
Info: Benchmark time: 24 CPUs 0.0596373 s/step 0.690247 days/ns 247.027
MB memory
Info: Benchmark time: 48 CPUs 0.0303531 s/step 0.351309 days/ns 249.84
MB memory
Info: Benchmark time: 96 CPUs 0.0161126 s/step 0.186489 days/ns 249.98
MB memory
Info: Benchmark time: 192 CPUs 0.00904823 s/step 0.104725 days/ns
267.719 MB memory
Info: Benchmark time: 384 CPUs 0.00490486 s/step 0.0567692 days/ns
313.637 MB memory

my testing with other applications built with same system (simple
mpihello and nwchem 6.1.1) seem to run well.

any suggestions where the issue might be? thanks!

--- michael

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:22:02 CST