AW: AW: namd 2.9 run instability (segfaults)

From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Mon Sep 10 2012 - 02:09:43 CDT

Hi,

> -----Ursprüngliche Nachricht-----
> Von: Michael Galloway [mailto:gallowaymd_at_ornl.gov]
> Gesendet: Freitag, 7. September 2012 16:29
> An: Norman Geist
> Cc: Namd Mailing List
> Betreff: Re: AW: namd-l: namd 2.9 run instability (segfaults)
>
> hmmm ...
>
> i don't seem to understand how to run the binaries:
>
> /home/mgx/testing/namd-ibverbs/NAMD_2.9_Linux-x86_64-ibverbs-
> smp/charmrun +p24
> ++nodelist nodes ++mpiexec ++remote-shell mpiexec
> /home/mgx/testing/namd-ibverbs/NAMD_2.9_Linux-x86_64-ibverbs-smp/namd2
> /home/mgx/testing/namd-ibverbs/NAMD_2.9_Linux-x86_64-ibverbs-
> smp/apoa1/apoa1.namd
>
> Charmrun> IBVERBS version of charmrun
> Charmrun> started all node programs in 0.142 seconds.
> Charmrun: error on request socket--
> Socket closed before recv.

This error is not your fault, it appears when the infiniband runtime the
precompiled namd got build with is incompatible to your infiniband setup.
Therefore I offered to try the udp binary with Ethernet or ipoib.

>
> leaves this file:
>
> [mgx_at_cmbcluster NAMD_2.9_Linux-x86_64-ibverbs-smp]$ more charmrun.5084
> #!/bin/sh
> Echo() {
> echo 'Charmrun remote shell(127.0.0.1.0)>' $*
> }
> Exit() {
> if [ $1 -ne 0 ]
> then
> Echo Exiting with error code $1
> fi
> exit $1
> }
> Find() {
> loc=''
> for dir in `echo $PATH | sed -e 's/:/ /g'`
> do
> test -f "$dir/$1" && loc="$dir/$1"
> done
> if [ "x$loc" = x ]
> then
> Echo $1 not found in your PATH "($PATH)"--
> Echo set your path in your ~/.charmrunrc
> Exit 1
> fi
> }
> test -f "$HOME/.charmrunrc" && . "$HOME/.charmrunrc"
> DISPLAY='localhost:10.0';export DISPLAY
> NETMAGIC="5084";export NETMAGIC
> CmiMyNode=$OMPI_COMM_WORLD_RANK
> test -z "$CmiMyNode" && CmiMyNode=$MPIRUN_RANK
> test -z "$CmiMyNode" && CmiMyNode=$PMI_RANK
> test -z "$CmiMyNode" && CmiMyNode=$PMI_ID
> test -z "$CmiMyNode" && (Echo Could not detect rank from environment ;
> Exit 1)
> export CmiMyNode
> NETSTART="$CmiMyNode 172.16.100.1 34705 5084 0";export NETSTART
> CmiMyNodeSize='1'; export CmiMyNodeSize
> CmiMyForks='0'; export CmiMyForks
> CmiNumNodes=$OMPI_COMM_WORLD_SIZE
> test -z "$CmiNumNodes" && CmiNumNodes=$MPIRUN_NPROCS
> test -z "$CmiNumNodes" && CmiNumNodes=$PMI_SIZE
> test -z "$CmiNumNodes" && (Echo Could not detect node count from
> environment ; Exit 1)
> export CmiNumNodes
> PATH="$PATH:/bin:/usr/bin:/usr/X/bin:/usr/X11/bin:/usr/local/bin:/usr/X
> 11R6/bin:/usr/openwin/bin"
> if test ! -x
> "/home/mgx/testing/namd-ibverbs/NAMD_2.9_Linux-x86_64-ibverbs-
> smp/namd2"
> then
> Echo 'Cannot locate this node-program:
> /home/mgx/testing/namd-ibverbs/NAMD_2.9_Linux-x86_64-ibverbs-smp/namd2'
> Exit 1
> fi
> cd "/home/mgx/testing/namd-ibverbs/NAMD_2.9_Linux-x86_64-ibverbs-smp"
> if test $? = 1
> then
> Echo 'Cannot propagate this current directory:'
> Echo '/home/mgx/testing/namd-ibverbs/NAMD_2.9_Linux-x86_64-ibverbs-
> smp'
> [mgx_at_cmbcluster NAMD_2.9_Linux-x86_64-ibverbs-smp]$ ls -l
> /home/mgx/testing/namd-ibverbs/NAMD_2.9_Linux-x86_64-ibverbs-smp/namd2
> -rwxr-xr-x 1 mgx mgx 17025259 Apr 30 15:04
> /home/mgx/testing/namd-ibverbs/NAMD_2.9_Linux-x86_64-ibverbs-smp/namd2
> [mgx_at_cmbcluster NAMD_2.9_Linux-x86_64-ibverbs-smp]$ vim ~/.bashrc
> [mgx_at_cmbcluster NAMD_2.9_Linux-x86_64-ibverbs-smp]$ which namd2
> /shared/namd-bin/NAMD_2.9_Linux-x86_64-ibverbs/namd2
> [mgx_at_cmbcluster NAMD_2.9_Linux-x86_64-ibverbs-smp]$ more charmrun.5084
> #!/bin/sh
> Echo() {
> echo 'Charmrun remote shell(127.0.0.1.0)>' $*
> }
> Exit() {
> if [ $1 -ne 0 ]
> then
> Echo Exiting with error code $1
> fi
> exit $1
> }
> Find() {
> loc=''
> for dir in `echo $PATH | sed -e 's/:/ /g'`
> do
> test -f "$dir/$1" && loc="$dir/$1"
> done
> if [ "x$loc" = x ]
> then
> Echo $1 not found in your PATH "($PATH)"--
> Echo set your path in your ~/.charmrunrc
> Exit 1
> fi
> }
> test -f "$HOME/.charmrunrc" && . "$HOME/.charmrunrc"
> DISPLAY='localhost:10.0';export DISPLAY
> NETMAGIC="5084";export NETMAGIC
> CmiMyNode=$OMPI_COMM_WORLD_RANK
> test -z "$CmiMyNode" && CmiMyNode=$MPIRUN_RANK
> test -z "$CmiMyNode" && CmiMyNode=$PMI_RANK
> test -z "$CmiMyNode" && CmiMyNode=$PMI_ID
> test -z "$CmiMyNode" && (Echo Could not detect rank from environment ;
> Exit 1)
> export CmiMyNode
> NETSTART="$CmiMyNode 172.16.100.1 34705 5084 0";export NETSTART
> CmiMyNodeSize='1'; export CmiMyNodeSize
> CmiMyForks='0'; export CmiMyForks
> CmiNumNodes=$OMPI_COMM_WORLD_SIZE
> test -z "$CmiNumNodes" && CmiNumNodes=$MPIRUN_NPROCS
> test -z "$CmiNumNodes" && CmiNumNodes=$PMI_SIZE
> test -z "$CmiNumNodes" && (Echo Could not detect node count from
> environment ; Exit 1)
> export CmiNumNodes
> PATH="$PATH:/bin:/usr/bin:/usr/X/bin:/usr/X11/bin:/usr/local/bin:/usr/X
> 11R6/bin:/usr/openwin/bin"
> if test ! -x
> "/home/mgx/testing/namd-ibverbs/NAMD_2.9_Linux-x86_64-ibverbs-
> smp/namd2"
> then
> Echo 'Cannot locate this node-program:
> /home/mgx/testing/namd-ibverbs/NAMD_2.9_Linux-x86_64-ibverbs-smp/namd2'
> Exit 1
> fi
> cd "/home/mgx/testing/namd-ibverbs/NAMD_2.9_Linux-x86_64-ibverbs-smp"
> if test $? = 1
> then
> Echo 'Cannot propagate this current directory:'
> Echo '/home/mgx/testing/namd-ibverbs/NAMD_2.9_Linux-x86_64-ibverbs-
> smp'
> Exit 1
> fi
> rm -f /tmp/charmrun_err.$$
> ("/home/mgx/testing/namd-ibverbs/NAMD_2.9_Linux-x86_64-ibverbs-
> smp/namd2" /home/mgx/testing/namd-ibverbs/NAMD_2.9_Linux-x86_64-
> ibverbs-smp/apoa1/apoa1.namd
> res=$?
> if [ $res -eq 127 ]
> then
> (
> "/home/mgx/testing/namd-ibverbs/NAMD_2.9_Linux-x86_64-ibverbs-
> smp/namd2"
> ldd
> "/home/mgx/testing/namd-ibverbs/NAMD_2.9_Linux-x86_64-ibverbs-
> smp/namd2"
> ) > /tmp/charmrun_err.$$ 2>&1
> fi
> ) < /dev/null 1> /dev/null 2> /dev/null
> sleep 1
> if [ -r /tmp/charmrun_err.$$ ]
> then
> cat /tmp/charmrun_err.$$
> rm -f /tmp/charmrun_err.$$
> Exit 1
> fi
> Exit 0
>
> segfaults seem to occur on mostly random nodes:

Seems like it. Have you build namd on one of the nodes?
Please, before we go on, try a precompiled namd to check if it is not a
compilation issue.

>
> [mgx_at_cmbcluster namd]$ mpirun -machinefile nodes -np 384
> /shared/namd-2.9-bart/Linux-x86_64-g++/namd2 apoa1/apoa1.namd
>
> namd2:24474 terminated with signal 11 at PC=7f99ffcaca0b
> SP=7fffbd9e5b50. Backtrace:
> Charm++> Running on MPI version: 2.1
> Charm++> level of thread support used: MPI_THREAD_SINGLE (desired:
> MPI_THREAD_SINGLE)
> Charm++> Running on non-SMP mode
> Converse/Charm++ Commit ID: v6.4.0-beta1-0-g5776d21
>
> namd2:25005 terminated with signal 11 at PC=7f2f000682a0
> SP=7fff2e7fdd78. Backtrace:
> -----------------------------------------------------------------------
> ---
> mpirun noticed that process rank 203 with PID 24474 on node node034
> exited on signal 11 (Segmentation fault).
> -----------------------------------------------------------------------
> ---
> [mgx_at_cmbcluster namd]$ mpirun -machinefile nodes -np 384
> /shared/namd-2.9-bart/Linux-x86_64-g++/namd2 apoa1/apoa1.namd
> Charm++> Running on MPI version: 2.1
> Charm++> level of thread support used: MPI_THREAD_SINGLE (desired:
> MPI_THREAD_SINGLE)
> Charm++> Running on non-SMP mode
> Converse/Charm++ Commit ID: v6.4.0-beta1-0-g5776d21
>
> namd2:24874 terminated with signal 11 at PC=7f5c7fdf56b6
> SP=7f5c7e617e20. Backtrace:
>
> namd2:24856 terminated with signal 11 at PC=3efd60947f SP=7fffc070b500.
> Backtrace:
>
> namd2:24857 terminated with signal 11 at PC=7f6e0006c0c0
> SP=7fff36be2c28. Backtrace:
> /shared/openmpi-
> 1.6.1/gcc/lib/libmpi.so.1(mca_pml_cm_irecv+0x0)[0x7f6e0006c0c0]
> /shared/openmpi-
> 1.6.1/gcc/lib/libmpi.so.1(ompi_coll_tuned_sendrecv_actual+0x7f)[0x7f6df
> ffd2bff]
> /shared/openmpi-
> 1.6.1/gcc/lib/libmpi.so.1(ompi_coll_tuned_barrier_intra_bruck+0x9a)[0x7
> f6dfffdaffa]
> /shared/openmpi-
> 1.6.1/gcc/lib/libmpi.so.1(MPI_Barrier+0x8e)[0x7f6dfff732fe]
> /shared/namd-2.9-bart/Linux-x86_64-g++/namd2(CmiBarrier+0x13)[0xa9a636]
>
> namd2:24933 terminated with signal 11 at PC=3a7600944b SP=7fff817c06b0.
> Backtrace:
> -----------------------------------------------------------------------
> ---
> mpirun noticed that process rank 288 with PID 24874 on node node017
> exited on signal 11 (Segmentation fault).
> -----------------------------------------------------------------------
> ---
> 2 total processes killed (some possibly by mpirun during cleanup)
> [mgx_at_cmbcluster namd]$ mpirun -machinefile nodes -np 384
> /shared/namd-2.9-bart/Linux-x86_64-g++/namd2 apoa1/apoa1.namd
> Charm++> Running on MPI version: 2.1
> Charm++> level of thread support used: MPI_THREAD_SINGLE (desired:
> MPI_THREAD_SINGLE)
> Charm++> Running on non-SMP mode
> Converse/Charm++ Commit ID: v6.4.0-beta1-0-g5776d21
> CharmLB> Load balancer assumes all CPUs are same.
> Charm++> Running on 34 unique compute nodes (12-way SMP).
> Charm++> cpu topology info is gathered in 0.005 seconds.
> Info: NAMD 2.9 for Linux-x86_64-MPI
> Info:
> Info: Please visit http://www.ks.uiuc.edu/Research/namd/
> Info: for updates, documentation, and support information.
> Info:
> Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005)
> Info: in all publications reporting results obtained with NAMD.
> Info:
> Info: Based on Charm++/Converse 60400 for mpi-linux-x86_64-gfortran-
> mpicxx
> Info: Built Thu Sep 6 18:24:45 EDT 2012 by root on cmbcluster
> Info: 1 NAMD 2.9 Linux-x86_64-MPI 384 node001 mgx
> Info: Running on 384 processors, 384 nodes, 34 physical nodes.
> Info: CPU topology information available.
> Info: Charm++/Converse parallel runtime startup completed at 0.0180719
> s
> Info: 159.988 MB of memory in use based on /proc/self/stat
> Info: Configuration file is apoa1/apoa1.namd
> Info: Changed directory to apoa1
> TCL: Suspending until startup complete.
> Info: SIMULATION PARAMETERS:
> Info: TIMESTEP 1
> Info: NUMBER OF STEPS 500
> Info: STEPS PER CYCLE 20
> Info: PERIODIC CELL BASIS 1 108.861 0 0
> Info: PERIODIC CELL BASIS 2 0 108.861 0
> Info: PERIODIC CELL BASIS 3 0 0 77.758
> Info: PERIODIC CELL CENTER 0 0 0
> Info: LOAD BALANCER Centralized
> Info: LOAD BALANCING STRATEGY New Load Balancers -- DEFAULT
> Info: LDB PERIOD 4000 steps
> Info: FIRST LDB TIMESTEP 100
> Info: LAST LDB TIMESTEP -1
> Info: LDB BACKGROUND SCALING 1
> Info: HOM BACKGROUND SCALING 1
> Info: PME BACKGROUND SCALING 1
> Info: REMOVING LOAD FROM NODE 0
> Info: REMOVING PATCHES FROM PROCESSOR 0
>
> On 09/07/2012 02:10 AM, Norman Geist wrote:
> > Hi Micheal,
> >
> > 1st of all you should make sure that it is not a compilation issue,
> so you
> > could try a precompiled namd binary and see if the problem persists,
> if the
> > ib version doesn't work, try the udp version (could also use the ib
> when u
> > have ipoib configured). A segfault usually is a programming error,
> but will
> > also occur if you mix old and new libraries. As you say the segfault
> occurs
> > randomly, it can depend on the order of nodes. Check if it is always
> the
> > same node that failes:
> >
> > mpirun noticed that process rank 44 with PID 18856 on node node011 <-
> --
> >
> > Also check if all your nodes use the same shared libraries. You can
> do that
> > for example with "ldd namd2". Escpecially varying GLIBC versions can
> cause a
> > segfault.
> >
> > Let us know.
> >
> > Norman Geist.
> >
> >> -----Ursprüngliche Nachricht-----
> >> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
> >> Auftrag von Michael Galloway
> >> Gesendet: Donnerstag, 6. September 2012 16:19
> >> An: namd-l
> >> Betreff: namd-l: namd 2.9 run instability (segfaults)
> >>
> >> good day all,
> >>
> >> we are having some issues running namd 2.9 on our new cluster. we
> are
> >> using qlogic IB/openmpi 1.6.1/gcc 4.4.6(centOS 6.2).
> >>
> >> during testing we are experiencing some random segfaults such as:
> >>
> >> [mgx_at_cmbcluster namd]$ mpirun -machinefile nodes -np 96
> >> /shared/namd-2.9/Linux-x86_64-g++/namd2 apoa1/apoa1.namd
> >> Charm++> Running on MPI version: 2.1
> >> Charm++> level of thread support used: MPI_THREAD_SINGLE (desired:
> >> MPI_THREAD_SINGLE)
> >> Charm++> Running on non-SMP mode
> >> Converse/Charm++ Commit ID: v6.4.0-beta1-0-g5776d21
> >>
> >> namd2:18856 terminated with signal 11 at PC=3104a0947f
> SP=7fff480eb420.
> >> Backtrace:
> >> CharmLB> Load balancer assumes all CPUs are same.
> >> --------------------------------------------------------------------
> ---
> >> ---
> >> mpirun noticed that process rank 44 with PID 18856 on node node011
> >> exited on signal 11 (Segmentation fault).
> >> --------------------------------------------------------------------
> ---
> >> ---
> >>
> >> the run may run fine sometime and sometimes segfaults, i've run some
> >> scaling from 12 to 384 processors and scaling is good, but all runs
> >> exhibit
> >> occasional segfaults.
> >>
> >> charm++ and namd were built as the docs indicate:
> >>
> >> Build and test the Charm++/Converse library (MPI version):
> >> env MPICXX=mpicxx ./build charm++ mpi-linux-x86_64 --with-
> production
> >>
> >> Download and install TCL and FFTW libraries:
> >> (cd to NAMD_2.9_Source if you're not already there)
> >> wget
> >> http://www.ks.uiuc.edu/Research/namd/libraries/fftw-linux-
> x86_64.tar.gz
> >> tar xzf fftw-linux-x86_64.tar.gz
> >> mv linux-x86_64 fftw
> >> wget
> >> http://www.ks.uiuc.edu/Research/namd/libraries/tcl8.5.9-linux-
> >> x86_64.tar.gz
> >> wget
> >> http://www.ks.uiuc.edu/Research/namd/libraries/tcl8.5.9-linux-
> x86_64-
> >> threaded.tar.gz
> >> tar xzf tcl8.5.9-linux-x86_64.tar.gz
> >> tar xzf tcl8.5.9-linux-x86_64-threaded.tar.gz
> >> mv tcl8.5.9-linux-x86_64 tcl
> >> mv tcl8.5.9-linux-x86_64-threaded tcl-threaded
> >>
> >> Set up build directory and compile:
> >> MPI version: ./config Linux-x86_64-g++ --charm-arch mpi-linux-
> x86_64
> >> cd Linux-x86_64-g++
> >> make (or gmake -j4, which should run faster)
> >>
> >> openmpi was built with:
> >>
> >> ./configure --prefix=/shared/openmpi-1.6.1/gcc --enable-static
> >> --without-tm -with-openib=/usr --with-psm=/usr CC=gcc CXX=g++
> >> F77=gfortran FC=gfortran --enable-mpi-thread-multiple
> >>
> >> when it completes scaling looks like
> >> Info: Benchmark time: 12 CPUs 0.116051 s/step 1.34318 days/ns
> 257.895
> >> MB
> >> memory
> >> Info: Benchmark time: 24 CPUs 0.0596373 s/step 0.690247 days/ns
> 247.027
> >> MB memory
> >> Info: Benchmark time: 48 CPUs 0.0303531 s/step 0.351309 days/ns
> 249.84
> >> MB memory
> >> Info: Benchmark time: 96 CPUs 0.0161126 s/step 0.186489 days/ns
> 249.98
> >> MB memory
> >> Info: Benchmark time: 192 CPUs 0.00904823 s/step 0.104725 days/ns
> >> 267.719 MB memory
> >> Info: Benchmark time: 384 CPUs 0.00490486 s/step 0.0567692 days/ns
> >> 313.637 MB memory
> >>
> >> my testing with other applications built with same system (simple
> >> mpihello and nwchem 6.1.1) seem to run well.
> >>
> >> any suggestions where the issue might be? thanks!
> >>
> >> --- michael
> >>
> >>
> >

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:22:03 CST