Re: Megatest fails on Linux cluster

From: Gengbin Zheng (gzheng_at_ks.uiuc.edu)
Date: Wed May 25 2005 - 22:54:23 CDT

Hi,

Please check out the NAMD wiki webpage I wrote earlier:

http://www.ks.uiuc.edu/Research/namd/wiki/index.cgi?NamdOnMyrinet

Gengbin

Sterling Paramore wrote:

> Hi, I'm trying to compile NAMD on the new ARL Linux cluster, JVN
> (http://www.arl.hpc.mil/userservices/lnxi_jvn.html). I was able to
> get NAMD compiled with MPI and GM, but when I run it, I get the
> following error (only last portion of output file shown)
>
>
> Info: Entering startup phase 8 with 37340 kB of memory in use.
> Info: Finished startup with 47845 kB of memory in use.
>
> TID HOST_NAME COMMAND_LINE STATUS
> TERMINATION_TIME
> ==== ========== ================ =======================
> ===================
> 0001 jvn-n0505 gmmpirun_wrapper Signaled (SIGSEGV) 05/11/2005
> 16:40:37
> 0002 jvn-n0505 gmmpirun_wrapper Killed by PAM (SIGTERM) 05/11/2005
> 16:40:37
>
>
> I believe that this is a problem with linking to the MPI libraries. I
> compiled charm++ with the following command:
>
> ./build charm++ mpi-linux icc gm --no-shared -DCMK_OPTIMIZE=1
> -I/usr/i386-intel-7.1/mpich-gm-1.2.6..14/include
> -L/usr/i386-intel-7.1/mpich-gm-1.2.6..14/lib -lgm -lmpich -lpmpich
> -lpthread
>
> and also tried modifying the src/arch/mpi-linux/conv_mach.sh to use
> the mpicc and mpiCC compilers.
>
> I was unable to run megatest like it says to run it in the release notes:
>
> [- megatest -] ./charmrun ++local +p2 ./pgm
> Running on 2 processors: ++local ./pgm
> Unrecognized argument ++local ignored.
> Use of uninitialized value in subroutine entry at
> /usr/i386-intel-7.1/mpich-gm-1.2.6..14/bin/mpirun.ch_gm.pl line 867.
> Bad arg length for Socket::inet_ntoa, length is 0, should be 4 at
> /usr/i386-intel-7.1/mpich-gm-1.2.6..14/bin/mpirun.ch_gm.pl line 867.
>
> When I try running the above command from an interactive job (bsub -m
> jvn -a "mpich_gm" -q debug -Ip -n 2 -W 0:30 bash), the program hangs.
> When I run pgm using mpirun.lsf instead of charmrun in an interactive
> job I get the following errors:
>
> [- megatest -] mpirun.lsf pgm
> Warning: Permanently added 'n0269,10.0.205.213' (RSA) to the list of
> known hosts.
> Warning: Permanently added 'n0269,10.0.205.213' (RSA) to the list of
> known hosts.
> test 0: initiated [groupring (milind)]
> test 0: completed (0.00 sec)
> test 1: initiated [nodering (milind)]
> test 1: completed (0.00 sec)
> test 2: initiated [varsize (mjlang)]
> test 2: completed (0.00 sec)
> test 3: initiated [varraystest (milind)]
> test 3: completed (0.00 sec)
> test 4: initiated [groupcast (mjlang)]
> test 4: completed (0.00 sec)
> test 5: initiated [nodecast (milind)]
> test 5: completed (0.00 sec)
> test 6: initiated [synctest (mjlang)]
> Killed by signal 15.
> Killed by signal 15.
> May 13 13:26:09 2005 15974 3 6.0 Rtasks fail:
> Rtask(s) 1 : exited with signal <11>
> Rtask(s) 2 : exited with signal <15>
>
> TID HOST_NAME COMMAND_LINE STATUS
> TERMINATION_TIME
> ==== ========== ================ =======================
> ===================
> 0001 jvn-n0269 gmmpirun_wrapper Signaled (SIGSEGV) 05/13/2005
> 13:26:09
> 0002 jvn-n0269 gmmpirun_wrapper Killed by PAM (SIGTERM) 05/13/2005
> 13:26:09
>
> Terminated
>
>
> Any ideas?
>
> Thanks in advance for any help,
> Sterling Paramore

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:39:29 CST