Re: problems with paralellism

From: Gengbin Zheng (gzheng_at_ks.uiuc.edu)
Date: Tue Nov 28 2006 - 13:44:45 CST

Hi Leandro,

 What is the version of Charm++ you built for your cluster?
 If your cluster is managed with Clustermatics, you want to install a
clustermatic-specific charm and NAMD (described at NAMD wiki). This is
needed because regular net version of Charm uses ssh to fire jobs, which
is not supposed to work on clsutermatic (which uses bpsh). This may
explain your timeout problem when using nodes.

Gengbin

Leandro Martínez wrote:
>
> Dear Charm++ and NAMD developers,
> We are fighting already for some months with a new cluster
> of Amd64 dual-core processors, running Fedora 5.0. Ou cluster
> is composed by 9 cpus, being 8 diskless nodes and 1 master.
> We have already a Opteron cluster with a similar architecture
> which is working fine and running namd with charm++ very
> efficiently.
>
> However, we have been unable to run namd in parallel in our new
> cluster.
>
> Our observations are:
>
> 1. If I try to run in parallel starting the simulation from the master
> and using nodes, the simulation does not start, it hangs up before
> starting the simulation
> and return an error message: Timeout waiting for node-program to connect
> (more details on this message at the end of the email)
>
> 2. If I try to run in parallel starting from one node, and even using
> the master cpu, the simulation eventually hangs up and I get a process
> running with 100% cpu on the first machine of the node list, but the
> simulation does not continue.
>
> 3. I I try to run in parallel, without the master node, the simulation
> runs for one day or two, but eventually hangs up with the same problem
> as in 2.
>
> 4. One time we ran a simulation starting from one node and put the
> master node at the end of the nodelist file, the simulation hang up
> and we got: Warning: 1 processors are overloeaded due to high
> background load.
>
> 5. We have tried different versions of charm++ and namd2, and have
> recompiled the charm++ with the options suggested by Jim Philips in
> http://www.ks.uiuc.edu/Research/namd/wiki/index.cgi?NamdOnAMD64
> but the same results are observed.
>
> We have no clues anymore on what could be the problems. Apparently we
> have some problem with load balancing. We have also updated to the
> latest kernels, tested all connections, network interfaces, and we
> cannot find any hardware problem in the machines.
>
> We would strongly appreciate any insight into what could be the
> problem. Also we would appreciate if someone having a similar cluster
> configuration shared his/her experiences, so we can rule or not the
> possibility of some hardware incompatibility with charm or namd. If
> this is a problem that you share some interest in for some reason, we
> can certainly give you access to the machines.
>
> Thank you very much,
> Leandro Martinez
> State University of Campinas
> Brazil
>
>
> Error message when starting from the master trying to use nodes:
>
> Charmrun> charmrun started...
> Charmrun> using ./nodelist2 as nodesfile
> Charmrun> adding client 0: " 192.168.0.101 <http://192.168.0.101/>",
> IP: 192.168.0.101 <http://192.168.0.101/>
> Charmrun> adding client 1: "192.168.0.101 <http://192.168.0.101/>",
> IP: 192.168.0.101 <http://192.168.0.101/>
> Charmrun> Charmrun = alehpo.iqm.unicamp.br
> <http://alehpo.iqm.unicamp.br/>, port = 42645
> Charmrun> Sending "0 alehpo.iqm.unicamp.br
> <http://alehpo.iqm.unicamp.br/> 42645 17029 0" to client 0.
> Charmrun> find the node program
> "/home/lmartinez/./NAMD_2.6b2_Linux-amd64/namd2" at "/home/lmartinez"
> for 0.
> Charmrun> Starting rsh 192.168.0.101 <http://192.168.0.101/> -l
> lmartinez /bin/sh -f
> Charmrun> rsh (192.168.0.101:0 <http://192.168.0.101:0/>) started
> Charmrun> Sending "1 alehpo.iqm.unicamp.br
> <http://alehpo.iqm.unicamp.br/> 42645 17029 0" to client 1.
> Charmrun> find the node program
> "/home/lmartinez/./NAMD_2.6b2_Linux-amd64/namd2" at "/home/lmartinez"
> for 1.
> Charmrun> Starting rsh 192.168.0.101 <http://192.168.0.101/> -l
> lmartinez /bin/sh -f
> Charmrun> rsh ( 192.168.0.101:1 <http://192.168.0.101:1/>) started
> Charmrun> node programs all started
> Charmrun> waiting for rsh (192.168.0.101:0 <http://192.168.0.101:0/>),
> pid 17030
> Charmrun rsh( 192.168.0.101.0)> remote responding...
> Charmrun rsh(192.168.0.101.1)> remote responding...
> Charmrun rsh( 192.168.0.101.0)> starting node-program...
> Charmrun rsh(192.168.0.101.0)> rsh phase successful.
> Charmrun rsh(192.168.0.101.1)> starting node-program...
> Charmrun rsh(192.168.0.101.1)> rsh phase successful.
> Charmrun> waiting for rsh (192.168.0.101:1 <http://192.168.0.101:1/>),
> pid 17031
> Charmrun> Waiting for 0-th client to connect.
> Timeout waiting for node-program to connect
>
>
>
>
>
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:44:13 CST