Re: AW: namd 2.9 run instability (segfaults)

From: Michael Galloway (gallowaymd_at_ornl.gov)
Date: Fri Sep 07 2012 - 09:28:36 CDT

hmmm ...

i don't seem to understand how to run the binaries:

/home/mgx/testing/namd-ibverbs/NAMD_2.9_Linux-x86_64-ibverbs-smp/charmrun +p24
++nodelist nodes ++mpiexec ++remote-shell mpiexec
/home/mgx/testing/namd-ibverbs/NAMD_2.9_Linux-x86_64-ibverbs-smp/namd2
/home/mgx/testing/namd-ibverbs/NAMD_2.9_Linux-x86_64-ibverbs-smp/apoa1/apoa1.namd

Charmrun> IBVERBS version of charmrun
Charmrun> started all node programs in 0.142 seconds.
Charmrun: error on request socket--
Socket closed before recv.

leaves this file:

[mgx_at_cmbcluster NAMD_2.9_Linux-x86_64-ibverbs-smp]$ more charmrun.5084
#!/bin/sh
Echo() {
   echo 'Charmrun remote shell(127.0.0.1.0)>' $*
}
Exit() {
   if [ $1 -ne 0 ]
   then
     Echo Exiting with error code $1
   fi
   exit $1
}
Find() {
   loc=''
   for dir in `echo $PATH | sed -e 's/:/ /g'`
   do
     test -f "$dir/$1" && loc="$dir/$1"
   done
   if [ "x$loc" = x ]
   then
     Echo $1 not found in your PATH "($PATH)"--
     Echo set your path in your ~/.charmrunrc
     Exit 1
   fi
}
test -f "$HOME/.charmrunrc" && . "$HOME/.charmrunrc"
DISPLAY='localhost:10.0';export DISPLAY
NETMAGIC="5084";export NETMAGIC
CmiMyNode=$OMPI_COMM_WORLD_RANK
test -z "$CmiMyNode" && CmiMyNode=$MPIRUN_RANK
test -z "$CmiMyNode" && CmiMyNode=$PMI_RANK
test -z "$CmiMyNode" && CmiMyNode=$PMI_ID
test -z "$CmiMyNode" && (Echo Could not detect rank from environment ;
Exit 1)
export CmiMyNode
NETSTART="$CmiMyNode 172.16.100.1 34705 5084 0";export NETSTART
CmiMyNodeSize='1'; export CmiMyNodeSize
CmiMyForks='0'; export CmiMyForks
CmiNumNodes=$OMPI_COMM_WORLD_SIZE
test -z "$CmiNumNodes" && CmiNumNodes=$MPIRUN_NPROCS
test -z "$CmiNumNodes" && CmiNumNodes=$PMI_SIZE
test -z "$CmiNumNodes" && (Echo Could not detect node count from
environment ; Exit 1)
export CmiNumNodes
PATH="$PATH:/bin:/usr/bin:/usr/X/bin:/usr/X11/bin:/usr/local/bin:/usr/X11R6/bin:/usr/openwin/bin"
if test ! -x
"/home/mgx/testing/namd-ibverbs/NAMD_2.9_Linux-x86_64-ibverbs-smp/namd2"
then
   Echo 'Cannot locate this node-program:
/home/mgx/testing/namd-ibverbs/NAMD_2.9_Linux-x86_64-ibverbs-smp/namd2'
   Exit 1
fi
cd "/home/mgx/testing/namd-ibverbs/NAMD_2.9_Linux-x86_64-ibverbs-smp"
if test $? = 1
then
   Echo 'Cannot propagate this current directory:'
   Echo '/home/mgx/testing/namd-ibverbs/NAMD_2.9_Linux-x86_64-ibverbs-smp'
[mgx_at_cmbcluster NAMD_2.9_Linux-x86_64-ibverbs-smp]$ ls -l
/home/mgx/testing/namd-ibverbs/NAMD_2.9_Linux-x86_64-ibverbs-smp/namd2
-rwxr-xr-x 1 mgx mgx 17025259 Apr 30 15:04
/home/mgx/testing/namd-ibverbs/NAMD_2.9_Linux-x86_64-ibverbs-smp/namd2
[mgx_at_cmbcluster NAMD_2.9_Linux-x86_64-ibverbs-smp]$ vim ~/.bashrc
[mgx_at_cmbcluster NAMD_2.9_Linux-x86_64-ibverbs-smp]$ which namd2
/shared/namd-bin/NAMD_2.9_Linux-x86_64-ibverbs/namd2
[mgx_at_cmbcluster NAMD_2.9_Linux-x86_64-ibverbs-smp]$ more charmrun.5084
#!/bin/sh
Echo() {
   echo 'Charmrun remote shell(127.0.0.1.0)>' $*
}
Exit() {
   if [ $1 -ne 0 ]
   then
     Echo Exiting with error code $1
   fi
   exit $1
}
Find() {
   loc=''
   for dir in `echo $PATH | sed -e 's/:/ /g'`
   do
     test -f "$dir/$1" && loc="$dir/$1"
   done
   if [ "x$loc" = x ]
   then
     Echo $1 not found in your PATH "($PATH)"--
     Echo set your path in your ~/.charmrunrc
     Exit 1
   fi
}
test -f "$HOME/.charmrunrc" && . "$HOME/.charmrunrc"
DISPLAY='localhost:10.0';export DISPLAY
NETMAGIC="5084";export NETMAGIC
CmiMyNode=$OMPI_COMM_WORLD_RANK
test -z "$CmiMyNode" && CmiMyNode=$MPIRUN_RANK
test -z "$CmiMyNode" && CmiMyNode=$PMI_RANK
test -z "$CmiMyNode" && CmiMyNode=$PMI_ID
test -z "$CmiMyNode" && (Echo Could not detect rank from environment ;
Exit 1)
export CmiMyNode
NETSTART="$CmiMyNode 172.16.100.1 34705 5084 0";export NETSTART
CmiMyNodeSize='1'; export CmiMyNodeSize
CmiMyForks='0'; export CmiMyForks
CmiNumNodes=$OMPI_COMM_WORLD_SIZE
test -z "$CmiNumNodes" && CmiNumNodes=$MPIRUN_NPROCS
test -z "$CmiNumNodes" && CmiNumNodes=$PMI_SIZE
test -z "$CmiNumNodes" && (Echo Could not detect node count from
environment ; Exit 1)
export CmiNumNodes
PATH="$PATH:/bin:/usr/bin:/usr/X/bin:/usr/X11/bin:/usr/local/bin:/usr/X11R6/bin:/usr/openwin/bin"
if test ! -x
"/home/mgx/testing/namd-ibverbs/NAMD_2.9_Linux-x86_64-ibverbs-smp/namd2"
then
   Echo 'Cannot locate this node-program:
/home/mgx/testing/namd-ibverbs/NAMD_2.9_Linux-x86_64-ibverbs-smp/namd2'
   Exit 1
fi
cd "/home/mgx/testing/namd-ibverbs/NAMD_2.9_Linux-x86_64-ibverbs-smp"
if test $? = 1
then
   Echo 'Cannot propagate this current directory:'
   Echo '/home/mgx/testing/namd-ibverbs/NAMD_2.9_Linux-x86_64-ibverbs-smp'
   Exit 1
fi
rm -f /tmp/charmrun_err.$$
("/home/mgx/testing/namd-ibverbs/NAMD_2.9_Linux-x86_64-ibverbs-smp/namd2" /home/mgx/testing/namd-ibverbs/NAMD_2.9_Linux-x86_64-ibverbs-smp/apoa1/apoa1.namd
res=$?
if [ $res -eq 127 ]
then
   (
"/home/mgx/testing/namd-ibverbs/NAMD_2.9_Linux-x86_64-ibverbs-smp/namd2"
     ldd
"/home/mgx/testing/namd-ibverbs/NAMD_2.9_Linux-x86_64-ibverbs-smp/namd2"
   ) > /tmp/charmrun_err.$$ 2>&1
fi
) < /dev/null 1> /dev/null 2> /dev/null
sleep 1
if [ -r /tmp/charmrun_err.$$ ]
then
   cat /tmp/charmrun_err.$$
   rm -f /tmp/charmrun_err.$$
   Exit 1
fi
Exit 0

segfaults seem to occur on mostly random nodes:

[mgx_at_cmbcluster namd]$ mpirun -machinefile nodes -np 384
/shared/namd-2.9-bart/Linux-x86_64-g++/namd2 apoa1/apoa1.namd

namd2:24474 terminated with signal 11 at PC=7f99ffcaca0b
SP=7fffbd9e5b50. Backtrace:
Charm++> Running on MPI version: 2.1
Charm++> level of thread support used: MPI_THREAD_SINGLE (desired:
MPI_THREAD_SINGLE)
Charm++> Running on non-SMP mode
Converse/Charm++ Commit ID: v6.4.0-beta1-0-g5776d21

namd2:25005 terminated with signal 11 at PC=7f2f000682a0
SP=7fff2e7fdd78. Backtrace:
--------------------------------------------------------------------------
mpirun noticed that process rank 203 with PID 24474 on node node034
exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
[mgx_at_cmbcluster namd]$ mpirun -machinefile nodes -np 384
/shared/namd-2.9-bart/Linux-x86_64-g++/namd2 apoa1/apoa1.namd
Charm++> Running on MPI version: 2.1
Charm++> level of thread support used: MPI_THREAD_SINGLE (desired:
MPI_THREAD_SINGLE)
Charm++> Running on non-SMP mode
Converse/Charm++ Commit ID: v6.4.0-beta1-0-g5776d21

namd2:24874 terminated with signal 11 at PC=7f5c7fdf56b6
SP=7f5c7e617e20. Backtrace:

namd2:24856 terminated with signal 11 at PC=3efd60947f SP=7fffc070b500.
Backtrace:

namd2:24857 terminated with signal 11 at PC=7f6e0006c0c0
SP=7fff36be2c28. Backtrace:
/shared/openmpi-1.6.1/gcc/lib/libmpi.so.1(mca_pml_cm_irecv+0x0)[0x7f6e0006c0c0]
/shared/openmpi-1.6.1/gcc/lib/libmpi.so.1(ompi_coll_tuned_sendrecv_actual+0x7f)[0x7f6dfffd2bff]
/shared/openmpi-1.6.1/gcc/lib/libmpi.so.1(ompi_coll_tuned_barrier_intra_bruck+0x9a)[0x7f6dfffdaffa]
/shared/openmpi-1.6.1/gcc/lib/libmpi.so.1(MPI_Barrier+0x8e)[0x7f6dfff732fe]
/shared/namd-2.9-bart/Linux-x86_64-g++/namd2(CmiBarrier+0x13)[0xa9a636]

namd2:24933 terminated with signal 11 at PC=3a7600944b SP=7fff817c06b0.
Backtrace:
--------------------------------------------------------------------------
mpirun noticed that process rank 288 with PID 24874 on node node017
exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
2 total processes killed (some possibly by mpirun during cleanup)
[mgx_at_cmbcluster namd]$ mpirun -machinefile nodes -np 384
/shared/namd-2.9-bart/Linux-x86_64-g++/namd2 apoa1/apoa1.namd
Charm++> Running on MPI version: 2.1
Charm++> level of thread support used: MPI_THREAD_SINGLE (desired:
MPI_THREAD_SINGLE)
Charm++> Running on non-SMP mode
Converse/Charm++ Commit ID: v6.4.0-beta1-0-g5776d21
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 34 unique compute nodes (12-way SMP).
Charm++> cpu topology info is gathered in 0.005 seconds.
Info: NAMD 2.9 for Linux-x86_64-MPI
Info:
Info: Please visit http://www.ks.uiuc.edu/Research/namd/
Info: for updates, documentation, and support information.
Info:
Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005)
Info: in all publications reporting results obtained with NAMD.
Info:
Info: Based on Charm++/Converse 60400 for mpi-linux-x86_64-gfortran-mpicxx
Info: Built Thu Sep 6 18:24:45 EDT 2012 by root on cmbcluster
Info: 1 NAMD 2.9 Linux-x86_64-MPI 384 node001 mgx
Info: Running on 384 processors, 384 nodes, 34 physical nodes.
Info: CPU topology information available.
Info: Charm++/Converse parallel runtime startup completed at 0.0180719 s
Info: 159.988 MB of memory in use based on /proc/self/stat
Info: Configuration file is apoa1/apoa1.namd
Info: Changed directory to apoa1
TCL: Suspending until startup complete.
Info: SIMULATION PARAMETERS:
Info: TIMESTEP 1
Info: NUMBER OF STEPS 500
Info: STEPS PER CYCLE 20
Info: PERIODIC CELL BASIS 1 108.861 0 0
Info: PERIODIC CELL BASIS 2 0 108.861 0
Info: PERIODIC CELL BASIS 3 0 0 77.758
Info: PERIODIC CELL CENTER 0 0 0
Info: LOAD BALANCER Centralized
Info: LOAD BALANCING STRATEGY New Load Balancers -- DEFAULT
Info: LDB PERIOD 4000 steps
Info: FIRST LDB TIMESTEP 100
Info: LAST LDB TIMESTEP -1
Info: LDB BACKGROUND SCALING 1
Info: HOM BACKGROUND SCALING 1
Info: PME BACKGROUND SCALING 1
Info: REMOVING LOAD FROM NODE 0
Info: REMOVING PATCHES FROM PROCESSOR 0

On 09/07/2012 02:10 AM, Norman Geist wrote:
> Hi Micheal,
>
> 1st of all you should make sure that it is not a compilation issue, so you
> could try a precompiled namd binary and see if the problem persists, if the
> ib version doesn't work, try the udp version (could also use the ib when u
> have ipoib configured). A segfault usually is a programming error, but will
> also occur if you mix old and new libraries. As you say the segfault occurs
> randomly, it can depend on the order of nodes. Check if it is always the
> same node that failes:
>
> mpirun noticed that process rank 44 with PID 18856 on node node011 <---
>
> Also check if all your nodes use the same shared libraries. You can do that
> for example with "ldd namd2". Escpecially varying GLIBC versions can cause a
> segfault.
>
> Let us know.
>
> Norman Geist.
>
>> -----Ursprüngliche Nachricht-----
>> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
>> Auftrag von Michael Galloway
>> Gesendet: Donnerstag, 6. September 2012 16:19
>> An: namd-l
>> Betreff: namd-l: namd 2.9 run instability (segfaults)
>>
>> good day all,
>>
>> we are having some issues running namd 2.9 on our new cluster. we are
>> using qlogic IB/openmpi 1.6.1/gcc 4.4.6(centOS 6.2).
>>
>> during testing we are experiencing some random segfaults such as:
>>
>> [mgx_at_cmbcluster namd]$ mpirun -machinefile nodes -np 96
>> /shared/namd-2.9/Linux-x86_64-g++/namd2 apoa1/apoa1.namd
>> Charm++> Running on MPI version: 2.1
>> Charm++> level of thread support used: MPI_THREAD_SINGLE (desired:
>> MPI_THREAD_SINGLE)
>> Charm++> Running on non-SMP mode
>> Converse/Charm++ Commit ID: v6.4.0-beta1-0-g5776d21
>>
>> namd2:18856 terminated with signal 11 at PC=3104a0947f SP=7fff480eb420.
>> Backtrace:
>> CharmLB> Load balancer assumes all CPUs are same.
>> -----------------------------------------------------------------------
>> ---
>> mpirun noticed that process rank 44 with PID 18856 on node node011
>> exited on signal 11 (Segmentation fault).
>> -----------------------------------------------------------------------
>> ---
>>
>> the run may run fine sometime and sometimes segfaults, i've run some
>> scaling from 12 to 384 processors and scaling is good, but all runs
>> exhibit
>> occasional segfaults.
>>
>> charm++ and namd were built as the docs indicate:
>>
>> Build and test the Charm++/Converse library (MPI version):
>> env MPICXX=mpicxx ./build charm++ mpi-linux-x86_64 --with-production
>>
>> Download and install TCL and FFTW libraries:
>> (cd to NAMD_2.9_Source if you're not already there)
>> wget
>> http://www.ks.uiuc.edu/Research/namd/libraries/fftw-linux-x86_64.tar.gz
>> tar xzf fftw-linux-x86_64.tar.gz
>> mv linux-x86_64 fftw
>> wget
>> http://www.ks.uiuc.edu/Research/namd/libraries/tcl8.5.9-linux-
>> x86_64.tar.gz
>> wget
>> http://www.ks.uiuc.edu/Research/namd/libraries/tcl8.5.9-linux-x86_64-
>> threaded.tar.gz
>> tar xzf tcl8.5.9-linux-x86_64.tar.gz
>> tar xzf tcl8.5.9-linux-x86_64-threaded.tar.gz
>> mv tcl8.5.9-linux-x86_64 tcl
>> mv tcl8.5.9-linux-x86_64-threaded tcl-threaded
>>
>> Set up build directory and compile:
>> MPI version: ./config Linux-x86_64-g++ --charm-arch mpi-linux-x86_64
>> cd Linux-x86_64-g++
>> make (or gmake -j4, which should run faster)
>>
>> openmpi was built with:
>>
>> ./configure --prefix=/shared/openmpi-1.6.1/gcc --enable-static
>> --without-tm -with-openib=/usr --with-psm=/usr CC=gcc CXX=g++
>> F77=gfortran FC=gfortran --enable-mpi-thread-multiple
>>
>> when it completes scaling looks like
>> Info: Benchmark time: 12 CPUs 0.116051 s/step 1.34318 days/ns 257.895
>> MB
>> memory
>> Info: Benchmark time: 24 CPUs 0.0596373 s/step 0.690247 days/ns 247.027
>> MB memory
>> Info: Benchmark time: 48 CPUs 0.0303531 s/step 0.351309 days/ns 249.84
>> MB memory
>> Info: Benchmark time: 96 CPUs 0.0161126 s/step 0.186489 days/ns 249.98
>> MB memory
>> Info: Benchmark time: 192 CPUs 0.00904823 s/step 0.104725 days/ns
>> 267.719 MB memory
>> Info: Benchmark time: 384 CPUs 0.00490486 s/step 0.0567692 days/ns
>> 313.637 MB memory
>>
>> my testing with other applications built with same system (simple
>> mpihello and nwchem 6.1.1) seem to run well.
>>
>> any suggestions where the issue might be? thanks!
>>
>> --- michael
>>
>>
>

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:22:31 CST