From: max (mpappala_at_dipchi.unict.it)
Date: Fri Nov 19 2004 - 03:47:01 CST
hello Gengbin,
thank for the help.
i have some question
1) i do not understand "corss mounted". Can you explain better
2) i launched 4 procs simulation; they stopped after 100 -200 -300 step
(number of step it is random)
Your test (apoa1) is only 20 step. Is it true?
3) i corrected nodelist file with dos2unix, with no result. For the sake of
clarity i will attach an output of my job (namd.out)
4) I create new nodelist file on apoa1 using vi, and i increase number of
step from 20 to 2000 and now it work !!!!
thnks matteo
-------Messaggio originale-------
Da: Gengbin Zheng
Data: 11/18/04 21:14:12
A: max
Cc: Brian Bennion; namd-l_at_ks.uiuc.edu
Oggetto: Re: Rif: Re: Rif: Re: Rif: namd-l: Re: linux cluster trouble
Hi , matteo
I have logged in to your system and checked it. There seems to be a few
problems:
1. Home directories are not corss mounted. So you may have to make sure all
binaries on all machines are the same with all system libraries installed
identically.
2. at least 192.168.0.67 has no intel libraries installed. If you run :
ldd ./namd2
under NAMD_2.5_Linux-i686-TCP
you will see libimf.so is not found
This prevent namd2 from launching on that node.
You should be able to link intel libs statically to get around this.
3. charmrun does not like DOS format of nodelist file, that is "^M" is not
allowed in nodelist file which happen to be your case.
You can run command dos2unix <file> to convert the file into unix
format.
Anyway, I ran the namd2 APOA1 benchmark (at apoa1) using NAMD_2
5_Linux-i686-TCP on 192.168.0.66 and 192.168.0.64 (with intel libraries
installed) with 4 processors and it runs fine for me.
Gengbin
Brian Bennion wrote:
Hello Matteo
I am out of ideas here. It might be something really simple that I am
missing.
Jim, Gengbin, Sameer any ideas?
Brian
On Thu, 18 Nov 2004, max wrote:
hello brian,
yes i see pgm running on other three machines
here enclosed you will found the output of namd;
bash output is:
/Linux-i686-TCP-icc/namd2 /home/matteo/NAMD_2
5_Source/Linux-i686-TCP-icc/PrP.namd > namd.out
Charmrun> charmrun started...
Charmrun> using ./nodelist as nodesfile
Charmrun> rsh (ctcfgr6:0d) started
Charmrun> rsh (ctcfgr10:1d) started
Charmrun> rsh (ctcfgr11:2d) started
Charmrun> rsh (ctcfgr9:3d) started
Charmrun> node programs all started
Charmrun> node programs all connected
Charmrun: error on request socket--
Socket closed before recv.
[matteo_at_ctcfgr6 megatest]$
i can not use rsh, because red hat 9.0 disables it by default, and instead
of rsh i use ssh; i can ssh to each node without pwd;
as suggested in notes.text i inserted "setenv CONV_RSH ssh" in my .bashrc
i tried to use command strace chramrun .....ecc.. It show a lacking of send
and receive data before namd stop
matteo
-------Messaggio originale-------
Da: Brian Bennion
Data: 11/18/04 08:31:54
A: max
Oggetto: Re: Rif: Re: Rif: namd-l: Re: linux cluster trouble
HI Matteo
Okay, things seem okay here.
try this.
.../charmrun +p4 ++verbose /pathtonamd2/ namd.configfile
replace the pathtonamd and namdconfigfile with real names and paths
and tell me what happens.
can you rsh into each node without a password?
can you see the pgm tests running on the other three machines in your pgm
tests?
On Thu, 18 Nov 2004, max wrote:
hi brian,
i used exactly the command reported on notes.txt that you can found on
namd
site:
../charmrun ++local +p1 ./pgm
first, and secondarly
../charmrun ++p 4 ++verbose ./pgm
with the nodelistfile in ./ directory
this two test show a strange results:
1 processor test inished in about 0.23 s.
4 processor test finished in about 3,2 min.
it is usefull?
My cluster it is connected with a ethernet switch 10/100/1000 hp procurve
2724 j4897a; each pc is equiped with 3 Com giga
yes the linux tcp, should be the net-linux-tcp, anyway i compile the
net-linux-tcp-icc
thanks for the help, i am becoming crazy with this problem
matteo
-------Messaggio originale-------
Da: Brian Bennion
Data: 11/17/04 19:56:22
A: max
Oggetto: Re: Rif: namd-l: Re: linux cluster trouble
Hi Matteo,
Thanks for the info. Could you also share the exact commands used to
launch the megatest and your namd jobs?
Is your cluster connected by a special switch or vanilla ethernet 100MB
etc...
Finally you state below that you downloaded the linux-tcp, is this the
net-linux-tcp version?
Regards
Brian
On Wed, 17 Nov 2004, max wrote:
hello,
i downloaded namd 2.5 from namd home page, both linux-tcp version both
source code
my cluster work under linux red hat 9.0
i launched simulation on two single pc processors , and the simualtion
is
ok;
conversely, when i launched the same simulation on three, four or more
processors namd stopped.
i tryed the charmrun pgm test on four machines and it is ok
I remain looking forward any further suggestion......
Bye
matteo
-------Messaggio originale-------
Da: Brian Bennion
Data: 11/16/04 19:00:37
A: max
Cc: namd-l_at_ks.uiuc.edu
Oggetto: namd-l: Re: linux cluster trouble
Hello Matteo
Can you give more details about your setup?
Are you running on more than one machine?
What operating system do you have?
Which version of charm++ and NAMD are you using?
The only help I can suggest based on the info below is that you are
trying
to run on more cpus than you have available....
Regards
Brian
On Tue, 16 Nov 2004, max wrote:
hi,
i tryed to compile namd on my pc but the results is the same:
Charmrun: error on request socket--
Socket closed before recv.
any suggestion
matteo pappalardo
*****************************************************************
**Brian Bennion, Ph.D. **
**Computational and Systems Biology Division **
**Biology and Biotechnology Research Program **
**Lawrence Livermore National Laboratory **
**P.O. Box 808, L-448 bennion1_at_llnl.gov **
**7000 East Avenue phone: (925) 422-5722 **
**Livermore, CA 94550 fax: (925) 424-6605 **
*****************************************************************
*****************************************************************
**Brian Bennion, Ph.D. **
**Computational and Systems Biology Division **
**Biology and Biotechnology Research Program **
**Lawrence Livermore National Laboratory **
**P.O. Box 808, L-448 bennion1_at_llnl.gov **
**7000 East Avenue phone: (925) 422-5722 **
**Livermore, CA 94550 fax: (925) 424-6605 **
*****************************************************************
*****************************************************************
**Brian Bennion, Ph.D. **
**Computational and Systems Biology Division **
**Biology and Biotechnology Research Program **
**Lawrence Livermore National Laboratory **
**P.O. Box 808, L-448 bennion1_at_llnl.gov **
**7000 East Avenue phone: (925) 422-5722 **
**Livermore, CA 94550 fax: (925) 424-6605 **
*****************************************************************
*****************************************************************
**Brian Bennion, Ph.D. **
**Computational and Systems Biology Division **
**Biology and Biotechnology Research Program **
**Lawrence Livermore National Laboratory **
**P.O. Box 808, L-448 bennion1_at_llnl.gov **
**7000 East Avenue phone: (925) 422-5722 **
**Livermore, CA 94550 fax: (925) 424-6605 **
*****************************************************************
This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:38:01 CST