charm++ over MPI

From: Bogdan Costescu (Bogdan.Costescu_at_iwr.uni-heidelberg.de)
Date: Tue Feb 14 2006 - 11:07:19 CST

Dear CHARM++ and NAMD developers,

In trying to install NAMD 2.6b1 (with CHARM++ 5.9) on a Linux cluster
using MPI as the underlying communication layer, I have some problems.

1. When trying to use LAM/MPI 7.0.3

Compilation of CHARM++ finishes fine. I need to take out a reference
to -lmpich from src/arch/mpi-linux/conv-mach.sh, which suggests to me
that installation with LAM/MPI was not tested recently - if it was,
then some mention in the docs about this MPICH specific setting would
have been helpful...

The test programs fail, usually with segmentation fault (SIGSEGV).
Many other MPI programs (including CHARMM and Gromacs) are running
using the same LAM/MPI installation on this cluster, so there are
small chances that this is caused by LAM/MPI. One possibility (related
to the paragraph above) is that some MPICH specific behaviour is
expected in some place and this assertion fails when not using MPICH.

The CHARM++ compilation was performed with gcc 3.2.3 (from RHEL3):

./build charm++ mpi-linux gcc --basedir /usr/local/lam-7.0.3-g77 --no-shared -O -DCMK_OPTIMIZE=1

and the backtrace of one of the core files from the hello example is:

#0 0x00499376 in ?? ()
#1 0x09539938 in ?? ()
#2 0xbfffcd08 in ?? ()
#3 0x08086937 in mm_free ()
#4 0x00499330 in ?? ()
#5 0x0000000b in ?? ()
#6 0x094c3408 in ?? ()
#7 0xfffffff5 in ?? ()
#8 0x40000000 in ?? ()
#9 0x00000258 in ?? ()
#10 0x00c86643 in ?? ()
#11 0x0000000b in ?? ()
#12 0x00000018 in ?? ()
#13 0x080f47ed in sread ()
#14 0x080f5896 in lam_ssi_rpi_tcp_proc_read_env ()
#15 0x080f5877 in lam_ssi_rpi_tcp_adv1 ()
#16 0x080f2f55 in lam_ssi_rpi_tcp_advance ()
#17 0x080e38a1 in _mpi_req_advance ()
#18 0x080f39b2 in lam_ssi_rpi_tcp_iprobe ()
#19 0x080d9f39 in MPI_Iprobe ()
#20 0x080c5d77 in PumpMsgs ()
#21 0x080c5fec in CmiGetNonLocal ()
#22 0x080c7a65 in CsdNextMessage ()
#23 0x080c7b57 in CsdScheduleForever ()
#24 0x080c7ae7 in CsdScheduler ()
#25 0x080c6804 in ConverseRunPE ()
#26 0x080c6ae7 in ConverseInit ()
#27 0x08093036 in main ()

(this was started directly with LAM/MPI's mpirun, not with charmrun,
but this should not make any difference from what I understand)

Is there any known problem in using this combination (CHARM++ 5.9,
LAM/MPI 7.0.3, gcc 3.2.3, RHEL3) ?

2. When trying to use MPICH 1.2.5.2

Compilation works fine. Examples are working, except that they don't
seem to be able to use more CPUs at once, the output looks like they
are N independent jobs instead of one running on N CPUs.

The hello output when run on 2 CPUs, started with 'mpirun -machinefile
hosts -np 2 ./hello' is:

Running Hello on 1 processors for 5 elements
Hello 0 created
Hello 1 created
Hello 2 created
Hello 3 created
Hello 4 created
Hi[17] from element 0
Hi[18] from element 1
Hi[19] from element 2
Hi[20] from element 3
Hi[21] from element 4
All done
End of program
Running Hello on 1 processors for 5 elements
Hello 0 created
Hello 1 created
Hello 2 created
Hello 3 created
Hello 4 created
Hi[17] from element 0
Hi[18] from element 1
Hi[19] from element 2
Hi[20] from element 3
Hi[21] from element 4
All done
End of program

NAMD was also compiled succesfully and when run it also reports a
similar situation (well, at least to my untrained eye):

...
Info: Sending usage information to NAMD developers via UDP. Sent data is:
Info: 1 NAMD 2.6b1 Linux-i686-MPI 1 node101.biocomp bogdan
Info: Running on 1 processors.
...
Info: Sending usage information to NAMD developers via UDP. Sent data is:
Info: 1 NAMD 2.6b1 Linux-i686-MPI 1 node102.biocomp bogdan
Info: Running on 1 processors.
...

Note the 2 different host names (node101 and node102), but the
"running on 1 processors" message.

BTW, the docs say that only when using "net-" the usage info is sent,
which seems to be wrong based on the above lines.

So, it seems that I'm doing something wrong, but I don't quite know
what... Any idea ?

Thanks in advance.

-- 
Bogdan Costescu
IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu_at_IWR.Uni-Heidelberg.De

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:41:37 CST