Re: Rif: Re: Rif: Re: Rif: Re: Rif: Re: linux cluster trouble

From: Gengbin Zheng (gzheng_at_ks.uiuc.edu)
Date: Fri Nov 19 2004 - 09:18:28 CST

max wrote:

> hello Gengbin,
> thank for the help.
> i have some question
> 1) i do not understand "corss mounted". Can you explain better
>

If you mount the home directory via NFS (Network File System), then no
matter which machine you login to, you will see the same home directory.
And then you don't have to install NAMD on every machine at your home
directory.

> 2) i launched 4 procs simulation; they stopped after 100 -200 -300
> step (number of step it is random)
> Your test (apoa1) is only 20 step. Is it true?
>

Yes. It can be changed to be longer as you may have already done.
Your PrP.namd test I ran yesterday seems not stable, it gave some wierd
energy result (99999.99), that was why I used Apoa1 benchmark which is
supposed to work.

> 3) i corrected nodelist file with dos2unix, with no result. For the
> sake of clarity i will attach an output of my job (namd.out)
>

make sure you nodelist does not have same node twice like you did in
nodelist file:

host machine1 192.168.0.56

remove either the machine name or the IP.

> 4) I create new nodelist file on apoa1 using vi, and i increase number
> of step from 20 to 2000 and now it work !!!!
>

Good. If you have problem running your own atom system, chances are
either your atom input is not good or there is bug in NAMD calculation
itself. Let us know and other people can help you with the questions
regarding to NAMD physics calculation :-).

Gengbin

>
> thnks matteo
>
> -------Messaggio originale-------
>
> Da: Gengbin Zheng <mailto:gzheng_at_ks.uiuc.edu>
> Data: 11/18/04 21:14:12
> A: max <mailto:mpappala_at_dipchi.unict.it>
> Cc: Brian Bennion <mailto:brian_at_youkai.llnl.gov>; namd-l_at_ks.uiuc.edu
> <mailto:namd-l_at_ks.uiuc.edu>
> Oggetto: Re: Rif: Re: Rif: Re: Rif: namd-l: Re: linux cluster trouble
>
>
> Hi , matteo
>
> I have logged in to your system and checked it. There seems to be a
> few problems:
>
> 1. Home directories are not corss mounted. So you may have to make
> sure all binaries on all machines are the same with all system
> libraries installed identically.
>
> 2. at least 192.168.0.67 has no intel libraries installed. If you run :
> ldd ./namd2
> under NAMD_2.5_Linux-i686-TCP
> you will see libimf.so is not found
> This prevent namd2 from launching on that node.
> You should be able to link intel libs statically to get around this.
>
> 3. charmrun does not like DOS format of nodelist file, that is "^M" is
> not allowed in nodelist file which happen to be your case.
> You can run command dos2unix <file> to convert the file into
> unix format.
>
> Anyway, I ran the namd2 APOA1 benchmark (at apoa1) using
> NAMD_2.5_Linux-i686-TCP on 192.168.0.66 and 192.168.0.64 (with intel
> libraries installed) with 4 processors and it runs fine for me.
>
> Gengbin
>
> Brian Bennion wrote:
>
>Hello Matteo
>
>I am out of ideas here. It might be something really simple that I am
>missing.
>
>Jim, Gengbin, Sameer any ideas?
>
>Brian
>
>On Thu, 18 Nov 2004, max wrote:
>
>
>
>hello brian,
>
>yes i see pgm running on other three machines
>here enclosed you will found the output of namd;
>bash output is:
>
>/Linux-i686-TCP-icc/namd2 /home/matteo/NAMD_2
>5_Source/Linux-i686-TCP-icc/PrP.namd > namd.out
>Charmrun> charmrun started...
>Charmrun> using ./nodelist as nodesfile
>Charmrun> rsh (ctcfgr6:0d) started
>Charmrun> rsh (ctcfgr10:1d) started
>Charmrun> rsh (ctcfgr11:2d) started
>Charmrun> rsh (ctcfgr9:3d) started
>Charmrun> node programs all started
>Charmrun> node programs all connected
>Charmrun: error on request socket--
>Socket closed before recv.
>[matteo_at_ctcfgr6 megatest]$
>
>i can not use rsh, because red hat 9.0 disables it by default, and instead
>of rsh i use ssh; i can ssh to each node without pwd;
>as suggested in notes.text i inserted "setenv CONV_RSH ssh" in my .bashrc
>
>
>
>i tried to use command strace chramrun .....ecc.. It show a lacking of send
>and receive data before namd stop
>
>matteo
>
>-------Messaggio originale-------
>
>Da: Brian Bennion
>Data: 11/18/04 08:31:54
>A: max
>Oggetto: Re: Rif: Re: Rif: namd-l: Re: linux cluster trouble
>
>HI Matteo
>
>Okay, things seem okay here.
>try this.
>.../charmrun +p4 ++verbose /pathtonamd2/ namd.configfile
>replace the pathtonamd and namdconfigfile with real names and paths
>and tell me what happens.
>
>can you rsh into each node without a password?
>
>can you see the pgm tests running on the other three machines in your pgm
>tests?
>
>
>On Thu, 18 Nov 2004, max wrote:
>
>
>
>hi brian,
>
>i used exactly the command reported on notes.txt that you can found on
>
>
>namd
>
>
>site:
>
>../charmrun ++local +p1 ./pgm
>first, and secondarly
>
>../charmrun ++p 4 ++verbose ./pgm
>with the nodelistfile in ./ directory
>
>this two test show a strange results:
>1 processor test inished in about 0.23 s.
>4 processor test finished in about 3,2 min.
>it is usefull?
>
>My cluster it is connected with a ethernet switch 10/100/1000 hp procurve
>2724 j4897a; each pc is equiped with 3 Com giga
>
>yes the linux tcp, should be the net-linux-tcp, anyway i compile the
>net-linux-tcp-icc
>
>thanks for the help, i am becoming crazy with this problem
>
>
>matteo
>
>
>-------Messaggio originale-------
>
>Da: Brian Bennion
>Data: 11/17/04 19:56:22
>A: max
>Oggetto: Re: Rif: namd-l: Re: linux cluster trouble
>
>
>Hi Matteo,
>
>Thanks for the info. Could you also share the exact commands used to
>launch the megatest and your namd jobs?
>Is your cluster connected by a special switch or vanilla ethernet 100MB
>etc...
>
>Finally you state below that you downloaded the linux-tcp, is this the
>net-linux-tcp version?
>
>Regards
>Brian
>
>On Wed, 17 Nov 2004, max wrote:
>
>
>
>hello,
>i downloaded namd 2.5 from namd home page, both linux-tcp version both
>source code
>my cluster work under linux red hat 9.0
>i launched simulation on two single pc processors , and the simualtion
>
>
>is
>
>
>ok;
>conversely, when i launched the same simulation on three, four or more
>processors namd stopped.
>i tryed the charmrun pgm test on four machines and it is ok
>
>I remain looking forward any further suggestion......
>Bye
>
>matteo
>
>-------Messaggio originale-------
>
>Da: Brian Bennion
>Data: 11/16/04 19:00:37
>A: max
>Cc: namd-l_at_ks.uiuc.edu
>Oggetto: namd-l: Re: linux cluster trouble
>
>Hello Matteo
>
>Can you give more details about your setup?
>Are you running on more than one machine?
>What operating system do you have?
>Which version of charm++ and NAMD are you using?
>
>The only help I can suggest based on the info below is that you are
>
>
>trying
>
>
>to run on more cpus than you have available....
>
>
>Regards
>Brian
>
>On Tue, 16 Nov 2004, max wrote:
>
>
>
> hi,
>i tryed to compile namd on my pc but the results is the same:
>Charmrun: error on request socket--
>Socket closed before recv.
>
>
>any suggestion
>
>matteo pappalardo
>
>
>
>
>
>
>*****************************************************************
>**Brian Bennion, Ph.D. **
>**Computational and Systems Biology Division **
>**Biology and Biotechnology Research Program **
>**Lawrence Livermore National Laboratory **
>**P.O. Box 808, L-448 bennion1_at_llnl.gov **
>**7000 East Avenue phone: (925) 422-5722 **
>**Livermore, CA 94550 fax: (925) 424-6605 **
>*****************************************************************
>
>
>
>*****************************************************************
>**Brian Bennion, Ph.D. **
>**Computational and Systems Biology Division **
>**Biology and Biotechnology Research Program **
>**Lawrence Livermore National Laboratory **
>**P.O. Box 808, L-448 bennion1_at_llnl.gov **
>**7000 East Avenue phone: (925) 422-5722 **
>**Livermore, CA 94550 fax: (925) 424-6605 **
>*****************************************************************
>
>
>
>*****************************************************************
>**Brian Bennion, Ph.D. **
>**Computational and Systems Biology Division **
>**Biology and Biotechnology Research Program **
>**Lawrence Livermore National Laboratory **
>**P.O. Box 808, L-448 bennion1_at_llnl.gov **
>**7000 East Avenue phone: (925) 422-5722 **
>**Livermore, CA 94550 fax: (925) 424-6605 **
>*****************************************************************
>
>
>
>
>*****************************************************************
>**Brian Bennion, Ph.D. **
>**Computational and Systems Biology Division **
>**Biology and Biotechnology Research Program **
>**Lawrence Livermore National Laboratory **
>**P.O. Box 808, L-448 bennion1_at_llnl.gov **
>**7000 East Avenue phone: (925) 422-5722 **
>**Livermore, CA 94550 fax: (925) 424-6605 **
>*****************************************************************
>
>
>
>
>
>
>
> _______________________________________________________________________________
> <http://www.incredimail.com/redir.asp?ad_id=316&lang=16> IncrediMail
> - il mondo della posta elettronica si è finalmente evoluto - Clicca
> Qui <http://www.incredimail.com/redir.asp?ad_id=316&lang=16>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:39:00 CST