Megatest fails on Linux cluster

From: Sterling Paramore (paramore_at_hec.utah.edu)
Date: Fri May 13 2005 - 14:40:11 CDT

Hi, I'm trying to compile NAMD on the new ARL Linux cluster, JVN
(http://www.arl.hpc.mil/userservices/lnxi_jvn.html). I was able to get
NAMD compiled with MPI and GM, but when I run it, I get the following
error (only last portion of output file shown)

Info: Entering startup phase 8 with 37340 kB of memory in use.
Info: Finished startup with 47845 kB of memory in use.

TID HOST_NAME COMMAND_LINE STATUS TERMINATION_TIME
==== ========== ================ =======================
===================
0001 jvn-n0505 gmmpirun_wrapper Signaled (SIGSEGV) 05/11/2005
16:40:37
0002 jvn-n0505 gmmpirun_wrapper Killed by PAM (SIGTERM) 05/11/2005
16:40:37

I believe that this is a problem with linking to the MPI libraries. I
compiled charm++ with the following command:

./build charm++ mpi-linux icc gm --no-shared -DCMK_OPTIMIZE=1
-I/usr/i386-intel-7.1/mpich-gm-1.2.6..14/include
-L/usr/i386-intel-7.1/mpich-gm-1.2.6..14/lib -lgm -lmpich -lpmpich -lpthread

and also tried modifying the src/arch/mpi-linux/conv_mach.sh to use the
mpicc and mpiCC compilers.

I was unable to run megatest like it says to run it in the release notes:

[- megatest -] ./charmrun ++local +p2 ./pgm
Running on 2 processors: ++local ./pgm
Unrecognized argument ++local ignored.
Use of uninitialized value in subroutine entry at
/usr/i386-intel-7.1/mpich-gm-1.2.6..14/bin/mpirun.ch_gm.pl line 867.
Bad arg length for Socket::inet_ntoa, length is 0, should be 4 at
/usr/i386-intel-7.1/mpich-gm-1.2.6..14/bin/mpirun.ch_gm.pl line 867.

When I try running the above command from an interactive job (bsub -m
jvn -a "mpich_gm" -q debug -Ip -n 2 -W 0:30 bash), the program hangs.
When I run pgm using mpirun.lsf instead of charmrun in an interactive
job I get the following errors:

[- megatest -] mpirun.lsf pgm
Warning: Permanently added 'n0269,10.0.205.213' (RSA) to the list of
known hosts.
Warning: Permanently added 'n0269,10.0.205.213' (RSA) to the list of
known hosts.
test 0: initiated [groupring (milind)]
test 0: completed (0.00 sec)
test 1: initiated [nodering (milind)]
test 1: completed (0.00 sec)
test 2: initiated [varsize (mjlang)]
test 2: completed (0.00 sec)
test 3: initiated [varraystest (milind)]
test 3: completed (0.00 sec)
test 4: initiated [groupcast (mjlang)]
test 4: completed (0.00 sec)
test 5: initiated [nodecast (milind)]
test 5: completed (0.00 sec)
test 6: initiated [synctest (mjlang)]
Killed by signal 15.
Killed by signal 15.
May 13 13:26:09 2005 15974 3 6.0 Rtasks fail:
Rtask(s) 1 : exited with signal <11>
Rtask(s) 2 : exited with signal <15>

TID HOST_NAME COMMAND_LINE STATUS TERMINATION_TIME
==== ========== ================ =======================
===================
0001 jvn-n0269 gmmpirun_wrapper Signaled (SIGSEGV) 05/13/2005
13:26:09
0002 jvn-n0269 gmmpirun_wrapper Killed by PAM (SIGTERM) 05/13/2005
13:26:09

Terminated

Any ideas?

Thanks in advance for any help,
Sterling Paramore

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:40:46 CST