From: Michael Galloway (gallowaymd_at_ornl.gov)
Date: Thu Sep 06 2012 - 09:19:05 CDT
good day all,
we are having some issues running namd 2.9 on our new cluster. we are
using qlogic IB/openmpi 1.6.1/gcc 4.4.6(centOS 6.2).
during testing we are experiencing some random segfaults such as:
[mgx_at_cmbcluster namd]$ mpirun -machinefile nodes -np 96
/shared/namd-2.9/Linux-x86_64-g++/namd2 apoa1/apoa1.namd
Charm++> Running on MPI version: 2.1
Charm++> level of thread support used: MPI_THREAD_SINGLE (desired:
MPI_THREAD_SINGLE)
Charm++> Running on non-SMP mode
Converse/Charm++ Commit ID: v6.4.0-beta1-0-g5776d21
namd2:18856 terminated with signal 11 at PC=3104a0947f SP=7fff480eb420.
Backtrace:
CharmLB> Load balancer assumes all CPUs are same.
--------------------------------------------------------------------------
mpirun noticed that process rank 44 with PID 18856 on node node011
exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
the run may run fine sometime and sometimes segfaults, i've run some
scaling from 12 to 384 processors and scaling is good, but all runs exhibit
occasional segfaults.
charm++ and namd were built as the docs indicate:
Build and test the Charm++/Converse library (MPI version):
env MPICXX=mpicxx ./build charm++ mpi-linux-x86_64 --with-production
Download and install TCL and FFTW libraries:
(cd to NAMD_2.9_Source if you're not already there)
wget
http://www.ks.uiuc.edu/Research/namd/libraries/fftw-linux-x86_64.tar.gz
tar xzf fftw-linux-x86_64.tar.gz
mv linux-x86_64 fftw
wget
http://www.ks.uiuc.edu/Research/namd/libraries/tcl8.5.9-linux-x86_64.tar.gz
wget
http://www.ks.uiuc.edu/Research/namd/libraries/tcl8.5.9-linux-x86_64-threaded.tar.gz
tar xzf tcl8.5.9-linux-x86_64.tar.gz
tar xzf tcl8.5.9-linux-x86_64-threaded.tar.gz
mv tcl8.5.9-linux-x86_64 tcl
mv tcl8.5.9-linux-x86_64-threaded tcl-threaded
Set up build directory and compile:
MPI version: ./config Linux-x86_64-g++ --charm-arch mpi-linux-x86_64
cd Linux-x86_64-g++
make (or gmake -j4, which should run faster)
openmpi was built with:
./configure --prefix=/shared/openmpi-1.6.1/gcc --enable-static
--without-tm -with-openib=/usr --with-psm=/usr CC=gcc CXX=g++
F77=gfortran FC=gfortran --enable-mpi-thread-multiple
when it completes scaling looks like
Info: Benchmark time: 12 CPUs 0.116051 s/step 1.34318 days/ns 257.895 MB
memory
Info: Benchmark time: 24 CPUs 0.0596373 s/step 0.690247 days/ns 247.027
MB memory
Info: Benchmark time: 48 CPUs 0.0303531 s/step 0.351309 days/ns 249.84
MB memory
Info: Benchmark time: 96 CPUs 0.0161126 s/step 0.186489 days/ns 249.98
MB memory
Info: Benchmark time: 192 CPUs 0.00904823 s/step 0.104725 days/ns
267.719 MB memory
Info: Benchmark time: 384 CPUs 0.00490486 s/step 0.0567692 days/ns
313.637 MB memory
my testing with other applications built with same system (simple
mpihello and nwchem 6.1.1) seem to run well.
any suggestions where the issue might be? thanks!
--- michael
This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:22:02 CST