Rif: Re: Rif: Re: Rif: Re: Rif: Re: linux cluster trouble

From: max (mpappala_at_dipchi.unict.it)
Date: Fri Nov 19 2004 - 03:47:01 CST

 hello Gengbin,
thank for the help.
i have some question
1) i do not understand "corss mounted". Can you explain better
2) i launched 4 procs simulation; they stopped after 100 -200 -300 step
(number of step it is random)
     Your test (apoa1) is only 20 step. Is it true?
3) i corrected nodelist file with dos2unix, with no result. For the sake of
clarity i will attach an output of my job (namd.out)
4) I create new nodelist file on apoa1 using vi, and i increase number of
step from 20 to 2000 and now it work !!!!

thnks matteo
 
-------Messaggio originale-------
 
Da: Gengbin Zheng
Data: 11/18/04 21:14:12
A: max
Cc: Brian Bennion; namd-l_at_ks.uiuc.edu
Oggetto: Re: Rif: Re: Rif: Re: Rif: namd-l: Re: linux cluster trouble
 

Hi , matteo

  I have logged in to your system and checked it. There seems to be a few
problems:

1. Home directories are not corss mounted. So you may have to make sure all
binaries on all machines are the same with all system libraries installed
identically.

2. at least 192.168.0.67 has no intel libraries installed. If you run :
    ldd ./namd2
   under NAMD_2.5_Linux-i686-TCP
   you will see libimf.so is not found
   This prevent namd2 from launching on that node.
   You should be able to link intel libs statically to get around this.

3. charmrun does not like DOS format of nodelist file, that is "^M" is not
allowed in nodelist file which happen to be your case.
    You can run command dos2unix <file> to convert the file into unix
format.

Anyway, I ran the namd2 APOA1 benchmark (at apoa1) using NAMD_2
5_Linux-i686-TCP on 192.168.0.66 and 192.168.0.64 (with intel libraries
installed) with 4 processors and it runs fine for me.

Gengbin

Brian Bennion wrote:
Hello Matteo

I am out of ideas here. It might be something really simple that I am
missing.

Jim, Gengbin, Sameer any ideas?

Brian

On Thu, 18 Nov 2004, max wrote:

  
hello brian,

yes i see pgm running on other three machines
here enclosed you will found the output of namd;
bash output is:

/Linux-i686-TCP-icc/namd2 /home/matteo/NAMD_2
5_Source/Linux-i686-TCP-icc/PrP.namd > namd.out
Charmrun> charmrun started...
Charmrun> using ./nodelist as nodesfile
Charmrun> rsh (ctcfgr6:0d) started
Charmrun> rsh (ctcfgr10:1d) started
Charmrun> rsh (ctcfgr11:2d) started
Charmrun> rsh (ctcfgr9:3d) started
Charmrun> node programs all started
Charmrun> node programs all connected
Charmrun: error on request socket--
Socket closed before recv.
[matteo_at_ctcfgr6 megatest]$

i can not use rsh, because red hat 9.0 disables it by default, and instead
of rsh i use ssh; i can ssh to each node without pwd;
as suggested in notes.text i inserted "setenv CONV_RSH ssh" in my .bashrc



i tried to use command strace chramrun .....ecc.. It show a lacking of send
and receive data before namd stop

matteo

-------Messaggio originale-------

Da: Brian Bennion
Data: 11/18/04 08:31:54
A: max
Oggetto: Re: Rif: Re: Rif: namd-l: Re: linux cluster trouble

HI Matteo

Okay, things seem okay here.
try this.
.../charmrun +p4 ++verbose /pathtonamd2/ namd.configfile
replace the pathtonamd and namdconfigfile with real names and paths
and tell me what happens.

can you rsh into each node without a password?

can you see the pgm tests running on the other three machines in your pgm
tests?


On Thu, 18 Nov 2004, max wrote:

    
hi brian,

i used exactly the command reported on notes.txt that you can found on
      
namd
    
site:

../charmrun ++local +p1 ./pgm
first, and secondarly

../charmrun ++p 4 ++verbose ./pgm
with the nodelistfile in ./ directory

this two test show a strange results:
1 processor test inished in about 0.23 s.
4 processor test finished in about 3,2 min.
it is usefull?

My cluster it is connected with a ethernet switch 10/100/1000 hp procurve
2724 j4897a; each pc is equiped with 3 Com giga

yes the linux tcp, should be the net-linux-tcp, anyway i compile the
net-linux-tcp-icc

thanks for the help, i am becoming crazy with this problem


matteo


-------Messaggio originale-------

Da: Brian Bennion
Data: 11/17/04 19:56:22
A: max
Oggetto: Re: Rif: namd-l: Re: linux cluster trouble


Hi Matteo,

Thanks for the info. Could you also share the exact commands used to
launch the megatest and your namd jobs?
Is your cluster connected by a special switch or vanilla ethernet 100MB
etc...

Finally you state below that you downloaded the linux-tcp, is this the
net-linux-tcp version?

Regards
Brian

On Wed, 17 Nov 2004, max wrote:

      
hello,
i downloaded namd 2.5 from namd home page, both linux-tcp version both
source code
my cluster work under linux red hat 9.0
i launched simulation on two single pc processors , and the simualtion
        
is
    
ok;
conversely, when i launched the same simulation on three, four or more
processors namd stopped.
i tryed the charmrun pgm test on four machines and it is ok

I remain looking forward any further suggestion......
Bye

matteo

-------Messaggio originale-------

Da: Brian Bennion
Data: 11/16/04 19:00:37
A: max
Cc: namd-l_at_ks.uiuc.edu
Oggetto: namd-l: Re: linux cluster trouble

Hello Matteo

Can you give more details about your setup?
Are you running on more than one machine?
What operating system do you have?
Which version of charm++ and NAMD are you using?

The only help I can suggest based on the info below is that you are
        
trying
    
to run on more cpus than you have available....


Regards
Brian

On Tue, 16 Nov 2004, max wrote:

        
 hi,
i tryed to compile namd on my pc but the results is the same:
Charmrun: error on request socket--
Socket closed before recv.


any suggestion

matteo pappalardo




          
*****************************************************************
**Brian Bennion, Ph.D. **
**Computational and Systems Biology Division **
**Biology and Biotechnology Research Program **
**Lawrence Livermore National Laboratory **
**P.O. Box 808, L-448 bennion1_at_llnl.gov **
**7000 East Avenue phone: (925) 422-5722 **
**Livermore, CA 94550 fax: (925) 424-6605 **
*****************************************************************

        
*****************************************************************
**Brian Bennion, Ph.D. **
**Computational and Systems Biology Division **
**Biology and Biotechnology Research Program **
**Lawrence Livermore National Laboratory **
**P.O. Box 808, L-448 bennion1_at_llnl.gov **
**7000 East Avenue phone: (925) 422-5722 **
**Livermore, CA 94550 fax: (925) 424-6605 **
*****************************************************************

      
*****************************************************************
**Brian Bennion, Ph.D. **
**Computational and Systems Biology Division **
**Biology and Biotechnology Research Program **
**Lawrence Livermore National Laboratory **
**P.O. Box 808, L-448 bennion1_at_llnl.gov **
**7000 East Avenue phone: (925) 422-5722 **
**Livermore, CA 94550 fax: (925) 424-6605 **
*****************************************************************

    

*****************************************************************
**Brian Bennion, Ph.D. **
**Computational and Systems Biology Division **
**Biology and Biotechnology Research Program **
**Lawrence Livermore National Laboratory **
**P.O. Box 808, L-448 bennion1_at_llnl.gov **
**7000 East Avenue phone: (925) 422-5722 **
**Livermore, CA 94550 fax: (925) 424-6605 **
*****************************************************************
  
 



IMSTP.gif

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:38:01 CST