Re: charm++ over MPI

From: Gengbin Zheng (gzheng_at_ks.uiuc.edu)
Date: Thu Feb 16 2006 - 15:40:18 CST

It is hitting the Charm++ user level threads. I vaguely remember there
is some wierd issue about Charm++ QuickThreads and LAM MPI. But I ran
LAM 7.1.1 on my FC4 laptop, it works for these tests, in parallel job.
So try switching to "pthread" version of Charm++ user level threads by
adding "-thread pthreads" in the charmc command line at link time.

Gengbin

Gengbin

Bogdan Costescu wrote:

>On Wed, 15 Feb 2006, Gengbin Zheng wrote:
>
>
>
>>I just tested LAM 7.1.1-3 on Fedora 4, and it worked for me.
>>
>>
>
>Thanks for taking the time!
>
>When you say that it worked, do you mean several of the tests ? I have
>made some progress following you advice to not use the "gcc" option
>and having everything in the PATH; I also did not use anymore
>"-DCMK_OPTIMIZE" when building. The "hello" test finishes successfully
>all the time (out of about 10 test runs), but "pingpong" and
>"megatest" fail repetably at the same place where they were failing
>before.
>
>"pingpong":
>Roundtrip time for 1D Arrays is 95.071100 us
>Roundtrip time for 2D Arrays is 92.137200 us
>Roundtrip time for 3D Arrays is 93.644200 us
>Roundtrip time for Fancy Arrays is 92.644300 us
>Roundtrip time for Chares (reuse msgs) is 85.452800 us
>Roundtrip time for Chares (new/del msgs) is 91.239800 us
>-----------------------------------------------------------------------------
>One of the processes started by mpirun has exited with a nonzero exit
>code. This typically indicates that the process finished in error.
>If your process did not finish in error, be sure to include a "return
>0" or "exit(0)" in your C code before exiting the application.
>
>PID 9152 failed on node n1 (192.168.107.102) due to signal 11.
>-----------------------------------------------------------------------------
>
>(node "n1" is the second MPI rank). The backtrace shows no LAM library
>being involved:
>
>#0 0x42ea8c86 in __pthread_cleanup_upto () from /lib/libpthread.so.0
>#1 0x42d8dad1 in _longjmp_unwind () from /lib/libc.so.6
>#2 0x42d8da3c in siglongjmp () from /lib/libc.so.6
>#3 0x42ea8d76 in longjmp () from /lib/libpthread.so.0
>#4 0x080d3beb in qt_block ()
>#5 0x0808f2ea in CthResume ()
>#6 0x080c4de5 in CthResumeNormalThread ()
>#7 0x080c497d in CmiHandleMessage ()
>#8 0x080c4b06 in CsdScheduleForever ()
>#9 0x080c4a8a in CsdScheduler ()
>#10 0x080c3ae4 in ConverseRunPE ()
>#11 0x080c3d89 in ConverseInit ()
>#12 0x08097974 in main ()
>
>so the previous backtrace involving LAM library calls was probably
>just an artifact of memory corruption or a sideeffect of CMK_OPTIMIZE.
>An identical backtrace is produced by "megatest" while outputing:
>
>Megatest is running on 2 processors.
>test 0: initiated [bitvector (jbooth)]
>test 0: completed (0.00 sec)
>test 1: initiated [immediatering (gengbin)]
>test 1: completed (0.08 sec)
>test 2: initiated [callback (olawlor)]
>-----------------------------------------------------------------------------
>One of the processes started by mpirun has exited with a nonzero exit
>code. This typically indicates that the process finished in error.
>If your process did not finish in error, be sure to include a "return
>0" or "exit(0)" in your C code before exiting the application.
>
>PID 457 failed on node n0 (192.168.107.101) due to signal 11.
>-----------------------------------------------------------------------------
>
>So I still can't say what is causing this: the OS (kernel + glibc) via
>the pthread lib, the specific version of LAM (which also uses the
>pthread lib) or something else ?
>
>
>
>>(you still need to change -lmpich to -lmpi though)
>>
>>
>
>Actually, also deleting it and not specifying -lmpi works as well, as
>-lmpi is automatically added by the LAM mpiCC wrapper.
>
>
>
>>For the MPICH, it again looks like you have a wrong mpirun in your
>>path. It seemed to start two SEPARATE jobs of hello program.
>>
>>
>
>This was indeed the case. I had wrongly assumed that the mpirun coming
>from MPICH-GM would be able to start a MPICH job; at least with the
>way the various MPI environments are set up here, this doesn't seem to
>be the case... Sorry for the false alarm!
>
>
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:41:38 CST