Re: charm++ over MPI

From: Bogdan Costescu (Bogdan.Costescu_at_iwr.uni-heidelberg.de)
Date: Thu Feb 16 2006 - 08:48:43 CST

On Wed, 15 Feb 2006, Gengbin Zheng wrote:

> I just tested LAM 7.1.1-3 on Fedora 4, and it worked for me.

Thanks for taking the time!

When you say that it worked, do you mean several of the tests ? I have
made some progress following you advice to not use the "gcc" option
and having everything in the PATH; I also did not use anymore
"-DCMK_OPTIMIZE" when building. The "hello" test finishes successfully
all the time (out of about 10 test runs), but "pingpong" and
"megatest" fail repetably at the same place where they were failing
before.

"pingpong":
Roundtrip time for 1D Arrays is 95.071100 us
Roundtrip time for 2D Arrays is 92.137200 us
Roundtrip time for 3D Arrays is 93.644200 us
Roundtrip time for Fancy Arrays is 92.644300 us
Roundtrip time for Chares (reuse msgs) is 85.452800 us
Roundtrip time for Chares (new/del msgs) is 91.239800 us
-----------------------------------------------------------------------------
One of the processes started by mpirun has exited with a nonzero exit
code. This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return
0" or "exit(0)" in your C code before exiting the application.

PID 9152 failed on node n1 (192.168.107.102) due to signal 11.
-----------------------------------------------------------------------------

(node "n1" is the second MPI rank). The backtrace shows no LAM library
being involved:

#0 0x42ea8c86 in __pthread_cleanup_upto () from /lib/libpthread.so.0
#1 0x42d8dad1 in _longjmp_unwind () from /lib/libc.so.6
#2 0x42d8da3c in siglongjmp () from /lib/libc.so.6
#3 0x42ea8d76 in longjmp () from /lib/libpthread.so.0
#4 0x080d3beb in qt_block ()
#5 0x0808f2ea in CthResume ()
#6 0x080c4de5 in CthResumeNormalThread ()
#7 0x080c497d in CmiHandleMessage ()
#8 0x080c4b06 in CsdScheduleForever ()
#9 0x080c4a8a in CsdScheduler ()
#10 0x080c3ae4 in ConverseRunPE ()
#11 0x080c3d89 in ConverseInit ()
#12 0x08097974 in main ()

so the previous backtrace involving LAM library calls was probably
just an artifact of memory corruption or a sideeffect of CMK_OPTIMIZE.
An identical backtrace is produced by "megatest" while outputing:

Megatest is running on 2 processors.
test 0: initiated [bitvector (jbooth)]
test 0: completed (0.00 sec)
test 1: initiated [immediatering (gengbin)]
test 1: completed (0.08 sec)
test 2: initiated [callback (olawlor)]
-----------------------------------------------------------------------------
One of the processes started by mpirun has exited with a nonzero exit
code. This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return
0" or "exit(0)" in your C code before exiting the application.

PID 457 failed on node n0 (192.168.107.101) due to signal 11.
-----------------------------------------------------------------------------

So I still can't say what is causing this: the OS (kernel + glibc) via
the pthread lib, the specific version of LAM (which also uses the
pthread lib) or something else ?

> (you still need to change -lmpich to -lmpi though)

Actually, also deleting it and not specifying -lmpi works as well, as
-lmpi is automatically added by the LAM mpiCC wrapper.

> For the MPICH, it again looks like you have a wrong mpirun in your
> path. It seemed to start two SEPARATE jobs of hello program.

This was indeed the case. I had wrongly assumed that the mpirun coming
from MPICH-GM would be able to start a MPICH job; at least with the
way the various MPI environments are set up here, this doesn't seem to
be the case... Sorry for the false alarm!

-- 
Bogdan Costescu
IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu_at_IWR.Uni-Heidelberg.De

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:41:38 CST