AW: namd 2.9 run instability (segfaults)

From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Fri Sep 07 2012 - 01:10:23 CDT

Hi Micheal,

1st of all you should make sure that it is not a compilation issue, so you
could try a precompiled namd binary and see if the problem persists, if the
ib version doesn't work, try the udp version (could also use the ib when u
have ipoib configured). A segfault usually is a programming error, but will
also occur if you mix old and new libraries. As you say the segfault occurs
randomly, it can depend on the order of nodes. Check if it is always the
same node that failes:

mpirun noticed that process rank 44 with PID 18856 on node node011 <---

Also check if all your nodes use the same shared libraries. You can do that
for example with "ldd namd2". Escpecially varying GLIBC versions can cause a
segfault.

Let us know.

Norman Geist.

> -----Ursprüngliche Nachricht-----
> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
> Auftrag von Michael Galloway
> Gesendet: Donnerstag, 6. September 2012 16:19
> An: namd-l
> Betreff: namd-l: namd 2.9 run instability (segfaults)
>
> good day all,
>
> we are having some issues running namd 2.9 on our new cluster. we are
> using qlogic IB/openmpi 1.6.1/gcc 4.4.6(centOS 6.2).
>
> during testing we are experiencing some random segfaults such as:
>
> [mgx_at_cmbcluster namd]$ mpirun -machinefile nodes -np 96
> /shared/namd-2.9/Linux-x86_64-g++/namd2 apoa1/apoa1.namd
> Charm++> Running on MPI version: 2.1
> Charm++> level of thread support used: MPI_THREAD_SINGLE (desired:
> MPI_THREAD_SINGLE)
> Charm++> Running on non-SMP mode
> Converse/Charm++ Commit ID: v6.4.0-beta1-0-g5776d21
>
> namd2:18856 terminated with signal 11 at PC=3104a0947f SP=7fff480eb420.
> Backtrace:
> CharmLB> Load balancer assumes all CPUs are same.
> -----------------------------------------------------------------------
> ---
> mpirun noticed that process rank 44 with PID 18856 on node node011
> exited on signal 11 (Segmentation fault).
> -----------------------------------------------------------------------
> ---
>
> the run may run fine sometime and sometimes segfaults, i've run some
> scaling from 12 to 384 processors and scaling is good, but all runs
> exhibit
> occasional segfaults.
>
> charm++ and namd were built as the docs indicate:
>
> Build and test the Charm++/Converse library (MPI version):
> env MPICXX=mpicxx ./build charm++ mpi-linux-x86_64 --with-production
>
> Download and install TCL and FFTW libraries:
> (cd to NAMD_2.9_Source if you're not already there)
> wget
> http://www.ks.uiuc.edu/Research/namd/libraries/fftw-linux-x86_64.tar.gz
> tar xzf fftw-linux-x86_64.tar.gz
> mv linux-x86_64 fftw
> wget
> http://www.ks.uiuc.edu/Research/namd/libraries/tcl8.5.9-linux-
> x86_64.tar.gz
> wget
> http://www.ks.uiuc.edu/Research/namd/libraries/tcl8.5.9-linux-x86_64-
> threaded.tar.gz
> tar xzf tcl8.5.9-linux-x86_64.tar.gz
> tar xzf tcl8.5.9-linux-x86_64-threaded.tar.gz
> mv tcl8.5.9-linux-x86_64 tcl
> mv tcl8.5.9-linux-x86_64-threaded tcl-threaded
>
> Set up build directory and compile:
> MPI version: ./config Linux-x86_64-g++ --charm-arch mpi-linux-x86_64
> cd Linux-x86_64-g++
> make (or gmake -j4, which should run faster)
>
> openmpi was built with:
>
> ./configure --prefix=/shared/openmpi-1.6.1/gcc --enable-static
> --without-tm -with-openib=/usr --with-psm=/usr CC=gcc CXX=g++
> F77=gfortran FC=gfortran --enable-mpi-thread-multiple
>
> when it completes scaling looks like
> Info: Benchmark time: 12 CPUs 0.116051 s/step 1.34318 days/ns 257.895
> MB
> memory
> Info: Benchmark time: 24 CPUs 0.0596373 s/step 0.690247 days/ns 247.027
> MB memory
> Info: Benchmark time: 48 CPUs 0.0303531 s/step 0.351309 days/ns 249.84
> MB memory
> Info: Benchmark time: 96 CPUs 0.0161126 s/step 0.186489 days/ns 249.98
> MB memory
> Info: Benchmark time: 192 CPUs 0.00904823 s/step 0.104725 days/ns
> 267.719 MB memory
> Info: Benchmark time: 384 CPUs 0.00490486 s/step 0.0567692 days/ns
> 313.637 MB memory
>
> my testing with other applications built with same system (simple
> mpihello and nwchem 6.1.1) seem to run well.
>
> any suggestions where the issue might be? thanks!
>
> --- michael
>
>

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:22:02 CST