Re: Rif: Re: Rif: Re: Rif: Re: linux cluster trouble

From: Gengbin Zheng (gzheng_at_ks.uiuc.edu)
Date: Thu Nov 18 2004 - 13:01:25 CST

Hi , matteo

  I have logged in to your system and checked it. There seems to be a
few problems:

1. Home directories are not corss mounted. So you may have to make sure
all binaries on all machines are the same with all system libraries
installed identically.

2. at least 192.168.0.67 has no intel libraries installed. If you run :
    ldd ./namd2
   under NAMD_2.5_Linux-i686-TCP
   you will see libimf.so is not found
   This prevent namd2 from launching on that node.
   You should be able to link intel libs statically to get around this.

3. charmrun does not like DOS format of nodelist file, that is "^M" is
not allowed in nodelist file which happen to be your case.
    You can run command dos2unix <file> to convert the file into unix
format.

Anyway, I ran the namd2 APOA1 benchmark (at apoa1) using
NAMD_2.5_Linux-i686-TCP on 192.168.0.66 and 192.168.0.64 (with intel
libraries installed) with 4 processors and it runs fine for me.

Gengbin

Brian Bennion wrote:

>Hello Matteo
>
>I am out of ideas here. It might be something really simple that I am
>missing.
>
>Jim, Gengbin, Sameer any ideas?
>
>Brian
>
>On Thu, 18 Nov 2004, max wrote:
>
>
>
>>hello brian,
>>
>>yes i see pgm running on other three machines
>>here enclosed you will found the output of namd;
>>bash output is:
>>
>>/Linux-i686-TCP-icc/namd2 /home/matteo/NAMD_2
>>5_Source/Linux-i686-TCP-icc/PrP.namd > namd.out
>>Charmrun> charmrun started...
>>Charmrun> using ./nodelist as nodesfile
>>Charmrun> rsh (ctcfgr6:0d) started
>>Charmrun> rsh (ctcfgr10:1d) started
>>Charmrun> rsh (ctcfgr11:2d) started
>>Charmrun> rsh (ctcfgr9:3d) started
>>Charmrun> node programs all started
>>Charmrun> node programs all connected
>>Charmrun: error on request socket--
>>Socket closed before recv.
>>[matteo_at_ctcfgr6 megatest]$
>>
>>i can not use rsh, because red hat 9.0 disables it by default, and instead
>>of rsh i use ssh; i can ssh to each node without pwd;
>>as suggested in notes.text i inserted "setenv CONV_RSH ssh" in my .bashrc
>>
>>
>>
>>i tried to use command strace chramrun .....ecc.. It show a lacking of send
>>and receive data before namd stop
>>
>>matteo
>>
>>-------Messaggio originale-------
>>
>>Da: Brian Bennion
>>Data: 11/18/04 08:31:54
>>A: max
>>Oggetto: Re: Rif: Re: Rif: namd-l: Re: linux cluster trouble
>>
>>HI Matteo
>>
>>Okay, things seem okay here.
>>try this.
>>../charmrun +p4 ++verbose /pathtonamd2/ namd.configfile
>>replace the pathtonamd and namdconfigfile with real names and paths
>>and tell me what happens.
>>
>>can you rsh into each node without a password?
>>
>>can you see the pgm tests running on the other three machines in your pgm
>>tests?
>>
>>
>>On Thu, 18 Nov 2004, max wrote:
>>
>>
>>
>>>hi brian,
>>>
>>>i used exactly the command reported on notes.txt that you can found on
>>>
>>>
>>namd
>>
>>
>>>site:
>>>
>>>./charmrun ++local +p1 ./pgm
>>>first, and secondarly
>>>
>>>./charmrun ++p 4 ++verbose ./pgm
>>>with the nodelistfile in ./ directory
>>>
>>>this two test show a strange results:
>>>1 processor test inished in about 0.23 s.
>>>4 processor test finished in about 3,2 min.
>>>it is usefull?
>>>
>>>My cluster it is connected with a ethernet switch 10/100/1000 hp procurve
>>>2724 j4897a; each pc is equiped with 3 Com giga
>>>
>>>yes the linux tcp, should be the net-linux-tcp, anyway i compile the
>>>net-linux-tcp-icc
>>>
>>>thanks for the help, i am becoming crazy with this problem
>>>
>>>
>>>matteo
>>>
>>>
>>>-------Messaggio originale-------
>>>
>>>Da: Brian Bennion
>>>Data: 11/17/04 19:56:22
>>>A: max
>>>Oggetto: Re: Rif: namd-l: Re: linux cluster trouble
>>>
>>>
>>>Hi Matteo,
>>>
>>>Thanks for the info. Could you also share the exact commands used to
>>>launch the megatest and your namd jobs?
>>>Is your cluster connected by a special switch or vanilla ethernet 100MB
>>>etc...
>>>
>>>Finally you state below that you downloaded the linux-tcp, is this the
>>>net-linux-tcp version?
>>>
>>>Regards
>>>Brian
>>>
>>>On Wed, 17 Nov 2004, max wrote:
>>>
>>>
>>>
>>>>hello,
>>>>i downloaded namd 2.5 from namd home page, both linux-tcp version both
>>>>source code
>>>>my cluster work under linux red hat 9.0
>>>>i launched simulation on two single pc processors , and the simualtion
>>>>
>>>>
>>is
>>
>>
>>>>ok;
>>>>conversely, when i launched the same simulation on three, four or more
>>>>processors namd stopped.
>>>>i tryed the charmrun pgm test on four machines and it is ok
>>>>
>>>>I remain looking forward any further suggestion......
>>>>Bye
>>>>
>>>>matteo
>>>>
>>>>-------Messaggio originale-------
>>>>
>>>>Da: Brian Bennion
>>>>Data: 11/16/04 19:00:37
>>>>A: max
>>>>Cc: namd-l_at_ks.uiuc.edu
>>>>Oggetto: namd-l: Re: linux cluster trouble
>>>>
>>>>Hello Matteo
>>>>
>>>>Can you give more details about your setup?
>>>>Are you running on more than one machine?
>>>>What operating system do you have?
>>>>Which version of charm++ and NAMD are you using?
>>>>
>>>>The only help I can suggest based on the info below is that you are
>>>>
>>>>
>>trying
>>
>>
>>>>to run on more cpus than you have available....
>>>>
>>>>
>>>>Regards
>>>>Brian
>>>>
>>>>On Tue, 16 Nov 2004, max wrote:
>>>>
>>>>
>>>>
>>>>> hi,
>>>>>i tryed to compile namd on my pc but the results is the same:
>>>>>Charmrun: error on request socket--
>>>>>Socket closed before recv.
>>>>>
>>>>>
>>>>>any suggestion
>>>>>
>>>>>matteo pappalardo
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>*****************************************************************
>>>>**Brian Bennion, Ph.D. **
>>>>**Computational and Systems Biology Division **
>>>>**Biology and Biotechnology Research Program **
>>>>**Lawrence Livermore National Laboratory **
>>>>**P.O. Box 808, L-448 bennion1_at_llnl.gov **
>>>>**7000 East Avenue phone: (925) 422-5722 **
>>>>**Livermore, CA 94550 fax: (925) 424-6605 **
>>>>*****************************************************************
>>>>
>>>>
>>>>
>>>*****************************************************************
>>>**Brian Bennion, Ph.D. **
>>>**Computational and Systems Biology Division **
>>>**Biology and Biotechnology Research Program **
>>>**Lawrence Livermore National Laboratory **
>>>**P.O. Box 808, L-448 bennion1_at_llnl.gov **
>>>**7000 East Avenue phone: (925) 422-5722 **
>>>**Livermore, CA 94550 fax: (925) 424-6605 **
>>>*****************************************************************
>>>
>>>
>>>
>>*****************************************************************
>>**Brian Bennion, Ph.D. **
>>**Computational and Systems Biology Division **
>>**Biology and Biotechnology Research Program **
>>**Lawrence Livermore National Laboratory **
>>**P.O. Box 808, L-448 bennion1_at_llnl.gov **
>>**7000 East Avenue phone: (925) 422-5722 **
>>**Livermore, CA 94550 fax: (925) 424-6605 **
>>*****************************************************************
>>
>>
>>
>
>*****************************************************************
>**Brian Bennion, Ph.D. **
>**Computational and Systems Biology Division **
>**Biology and Biotechnology Research Program **
>**Lawrence Livermore National Laboratory **
>**P.O. Box 808, L-448 bennion1_at_llnl.gov **
>**7000 East Avenue phone: (925) 422-5722 **
>**Livermore, CA 94550 fax: (925) 424-6605 **
>*****************************************************************
>
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:38:00 CST