Re: Rif: Re: Rif: Re: Rif: Re: linux cluster trouble

From: Gengbin Zheng (gzheng_at_ks.uiuc.edu)
Date: Thu Nov 18 2004 - 13:20:46 CST

FYI, I have fixed charmrun bug (not able to handle DOS nodelist file
format under UNIX) in Charm++ development cvs. One needs to checkout the
*latest* charm from cvs (not charm-5.8) in order to have this fix.

Gengbin

Gengbin Zheng wrote:

>
> Hi , matteo
>
> I have logged in to your system and checked it. There seems to be a
> few problems:
>
> 1. Home directories are not corss mounted. So you may have to make
> sure all binaries on all machines are the same with all system
> libraries installed identically.
>
> 2. at least 192.168.0.67 has no intel libraries installed. If you run :
> ldd ./namd2
> under NAMD_2.5_Linux-i686-TCP
> you will see libimf.so is not found
> This prevent namd2 from launching on that node.
> You should be able to link intel libs statically to get around this.
>
> 3. charmrun does not like DOS format of nodelist file, that is "^M" is
> not allowed in nodelist file which happen to be your case.
> You can run command dos2unix <file> to convert the file into
> unix format.
>
> Anyway, I ran the namd2 APOA1 benchmark (at apoa1) using
> NAMD_2.5_Linux-i686-TCP on 192.168.0.66 and 192.168.0.64 (with intel
> libraries installed) with 4 processors and it runs fine for me.
>
> Gengbin
>
> Brian Bennion wrote:
>
>>Hello Matteo
>>
>>I am out of ideas here. It might be something really simple that I am
>>missing.
>>
>>Jim, Gengbin, Sameer any ideas?
>>
>>Brian
>>
>>On Thu, 18 Nov 2004, max wrote:
>>
>>
>>
>>>hello brian,
>>>
>>>yes i see pgm running on other three machines
>>>here enclosed you will found the output of namd;
>>>bash output is:
>>>
>>>/Linux-i686-TCP-icc/namd2 /home/matteo/NAMD_2
>>>5_Source/Linux-i686-TCP-icc/PrP.namd > namd.out
>>>Charmrun> charmrun started...
>>>Charmrun> using ./nodelist as nodesfile
>>>Charmrun> rsh (ctcfgr6:0d) started
>>>Charmrun> rsh (ctcfgr10:1d) started
>>>Charmrun> rsh (ctcfgr11:2d) started
>>>Charmrun> rsh (ctcfgr9:3d) started
>>>Charmrun> node programs all started
>>>Charmrun> node programs all connected
>>>Charmrun: error on request socket--
>>>Socket closed before recv.
>>>[matteo_at_ctcfgr6 megatest]$
>>>
>>>i can not use rsh, because red hat 9.0 disables it by default, and instead
>>>of rsh i use ssh; i can ssh to each node without pwd;
>>>as suggested in notes.text i inserted "setenv CONV_RSH ssh" in my .bashrc
>>>
>>>
>>>
>>>i tried to use command strace chramrun .....ecc.. It show a lacking of send
>>>and receive data before namd stop
>>>
>>>matteo
>>>
>>>-------Messaggio originale-------
>>>
>>>Da: Brian Bennion
>>>Data: 11/18/04 08:31:54
>>>A: max
>>>Oggetto: Re: Rif: Re: Rif: namd-l: Re: linux cluster trouble
>>>
>>>HI Matteo
>>>
>>>Okay, things seem okay here.
>>>try this.
>>>../charmrun +p4 ++verbose /pathtonamd2/ namd.configfile
>>>replace the pathtonamd and namdconfigfile with real names and paths
>>>and tell me what happens.
>>>
>>>can you rsh into each node without a password?
>>>
>>>can you see the pgm tests running on the other three machines in your pgm
>>>tests?
>>>
>>>
>>>On Thu, 18 Nov 2004, max wrote:
>>>
>>>
>>>
>>>>hi brian,
>>>>
>>>>i used exactly the command reported on notes.txt that you can found on
>>>>
>>>>
>>>namd
>>>
>>>
>>>>site:
>>>>
>>>>./charmrun ++local +p1 ./pgm
>>>>first, and secondarly
>>>>
>>>>./charmrun ++p 4 ++verbose ./pgm
>>>>with the nodelistfile in ./ directory
>>>>
>>>>this two test show a strange results:
>>>>1 processor test inished in about 0.23 s.
>>>>4 processor test finished in about 3,2 min.
>>>>it is usefull?
>>>>
>>>>My cluster it is connected with a ethernet switch 10/100/1000 hp procurve
>>>>2724 j4897a; each pc is equiped with 3 Com giga
>>>>
>>>>yes the linux tcp, should be the net-linux-tcp, anyway i compile the
>>>>net-linux-tcp-icc
>>>>
>>>>thanks for the help, i am becoming crazy with this problem
>>>>
>>>>
>>>>matteo
>>>>
>>>>
>>>>-------Messaggio originale-------
>>>>
>>>>Da: Brian Bennion
>>>>Data: 11/17/04 19:56:22
>>>>A: max
>>>>Oggetto: Re: Rif: namd-l: Re: linux cluster trouble
>>>>
>>>>
>>>>Hi Matteo,
>>>>
>>>>Thanks for the info. Could you also share the exact commands used to
>>>>launch the megatest and your namd jobs?
>>>>Is your cluster connected by a special switch or vanilla ethernet 100MB
>>>>etc...
>>>>
>>>>Finally you state below that you downloaded the linux-tcp, is this the
>>>>net-linux-tcp version?
>>>>
>>>>Regards
>>>>Brian
>>>>
>>>>On Wed, 17 Nov 2004, max wrote:
>>>>
>>>>
>>>>
>>>>>hello,
>>>>>i downloaded namd 2.5 from namd home page, both linux-tcp version both
>>>>>source code
>>>>>my cluster work under linux red hat 9.0
>>>>>i launched simulation on two single pc processors , and the simualtion
>>>>>
>>>>>
>>>is
>>>
>>>
>>>>>ok;
>>>>>conversely, when i launched the same simulation on three, four or more
>>>>>processors namd stopped.
>>>>>i tryed the charmrun pgm test on four machines and it is ok
>>>>>
>>>>>I remain looking forward any further suggestion......
>>>>>Bye
>>>>>
>>>>>matteo
>>>>>
>>>>>-------Messaggio originale-------
>>>>>
>>>>>Da: Brian Bennion
>>>>>Data: 11/16/04 19:00:37
>>>>>A: max
>>>>>Cc: namd-l_at_ks.uiuc.edu
>>>>>Oggetto: namd-l: Re: linux cluster trouble
>>>>>
>>>>>Hello Matteo
>>>>>
>>>>>Can you give more details about your setup?
>>>>>Are you running on more than one machine?
>>>>>What operating system do you have?
>>>>>Which version of charm++ and NAMD are you using?
>>>>>
>>>>>The only help I can suggest based on the info below is that you are
>>>>>
>>>>>
>>>trying
>>>
>>>
>>>>>to run on more cpus than you have available....
>>>>>
>>>>>
>>>>>Regards
>>>>>Brian
>>>>>
>>>>>On Tue, 16 Nov 2004, max wrote:
>>>>>
>>>>>
>>>>>
>>>>>> hi,
>>>>>>i tryed to compile namd on my pc but the results is the same:
>>>>>>Charmrun: error on request socket--
>>>>>>Socket closed before recv.
>>>>>>
>>>>>>
>>>>>>any suggestion
>>>>>>
>>>>>>matteo pappalardo
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>*****************************************************************
>>>>>**Brian Bennion, Ph.D. **
>>>>>**Computational and Systems Biology Division **
>>>>>**Biology and Biotechnology Research Program **
>>>>>**Lawrence Livermore National Laboratory **
>>>>>**P.O. Box 808, L-448 bennion1_at_llnl.gov **
>>>>>**7000 East Avenue phone: (925) 422-5722 **
>>>>>**Livermore, CA 94550 fax: (925) 424-6605 **
>>>>>*****************************************************************
>>>>>
>>>>>
>>>>>
>>>>*****************************************************************
>>>>**Brian Bennion, Ph.D. **
>>>>**Computational and Systems Biology Division **
>>>>**Biology and Biotechnology Research Program **
>>>>**Lawrence Livermore National Laboratory **
>>>>**P.O. Box 808, L-448 bennion1_at_llnl.gov **
>>>>**7000 East Avenue phone: (925) 422-5722 **
>>>>**Livermore, CA 94550 fax: (925) 424-6605 **
>>>>*****************************************************************
>>>>
>>>>
>>>>
>>>*****************************************************************
>>>**Brian Bennion, Ph.D. **
>>>**Computational and Systems Biology Division **
>>>**Biology and Biotechnology Research Program **
>>>**Lawrence Livermore National Laboratory **
>>>**P.O. Box 808, L-448 bennion1_at_llnl.gov **
>>>**7000 East Avenue phone: (925) 422-5722 **
>>>**Livermore, CA 94550 fax: (925) 424-6605 **
>>>*****************************************************************
>>>
>>>
>>>
>>
>>*****************************************************************
>>**Brian Bennion, Ph.D. **
>>**Computational and Systems Biology Division **
>>**Biology and Biotechnology Research Program **
>>**Lawrence Livermore National Laboratory **
>>**P.O. Box 808, L-448 bennion1_at_llnl.gov **
>>**7000 East Avenue phone: (925) 422-5722 **
>>**Livermore, CA 94550 fax: (925) 424-6605 **
>>*****************************************************************
>>
>>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:38:00 CST