Re: Re: Is clock skew a problem for charm++

From: Jan Saam (saam_at_charite.de)
Date: Wed Jun 21 2006 - 09:24:10 CDT

Next message: Peter Freddolino: "Re: Tcl, minimization and MD (Warning: I am a lazy ignorant)"
Previous message: Jan Saam: "Re: Re: Is clock skew a problem for charm++"
In reply to: hrh: "Re: Re: Is clock skew a problem for charm++"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

Hermann,
Thanks a lot for your comments!

As you could just read in the other email, the clock skew helped to some
extent, but not sufficiently.
Our cluster runs RedHat 9 and the node are connected with Gigabit.
The network achitecture is similar to yours:
The nodes have local addresses:

/etc/hosts:
192.168.1.1 BPU1
192.168.1.2 BPU2
...

They are not visible from the internet but they are connected to a
frontend machines which is connected with the internet (100Mbit).

So, you think the private addresses 192.168... are the problem?
I guess with official addresses you mean the ones from the nameserver? I
can't set these since the nodes are not visible from outside. :-(

Jan

hrh wrote:
>
> Hi Jan and all who can help!
> I have the same or a similar problem not only with NAMD, but also with
> GAMESS, a quantum chemistry program. At least in my case the
> tremendous increase in execution time using 2 or more nodes is not
> connected with clock skew. We use a cluster of 6 PC`s and the
> performance does _not_ depend on clock synchronization (exact
> synchronization of one slave with the master and a time skew of 2
> hours for another slave). However the execution time seems to depend
> in a puzzling manner on network configuration: The nodes are connected
> with a Gigabit switch and Gigabit LAN and they have private
> IP-addresses and private names, but with a second 100 Mbit network
> card in each PC they can also be connected with the internet (official
> IP-addresses, official hostnames). Using private names in the nodes
> file, the wallclock time increases dramatically for two nodes compared
> to one node. With official hostnames however, the wall clock time
> decreases slightly with two nodes (too little I suppose).
>
> Timing for a 1000 steps simulation:
> One node:
> Wall
> clock 125 s, CPU time 121 s.
> Two nodes (one or both private addresses) Wall clock 250 s, CPU
> time 94 s.
> Two nodes (both official addresses) Wall clock 104
> s, CPU time 73 s.
>
> Jan, which Linux distribution do you have? We have installed SUSE 9.3
> on the cluster.
>
> Hermann
>
>
>
> Jan Saam wrote:
>> I forgot to say that I checked already that the problem is not ssh
>> taking forever to make a connection.
>> This is at least proven by this simple test:
>> time ssh BPU5 pwd
>> /home/jan
>>
>> real 0m0.236s
>> user 0m0.050s
>> sys 0m0.000s
>>
>> Jan
>>
>>
>> Jan Saam wrote:
>>
>>> Hi all,
>>>
>>> I'm experiencing some weird performance problems with NAMD or the
>>> charm++ library on a linux cluster:
>>> When I'm using NAMD or a simple charmm++ demo program on one node
>>> everything is fine, but when I use more that one node each step takes
>>> _very_ much longer!
>>>
>>> Example:
>>> 2s for the program queens on 1 node, 445s on 2 nodes!!!
>>>
>>> running
>>> /home/jan/NAMD_2.6b1_Source/charm-5.9/mpi-linux-gcc/examples/charm++/queens/./pgm
>>> on 1 LINUX ch_p4 processors
>>> Created
>>> /home/jan/NAMD_2.6b1_Source/charm-5.9/mpi-linux-gcc/examples/charm++/queens/PI28357
>>> There are 14200 Solutions to 12 queens. Finish time=1.947209
>>> End of program
>>> [jan_at_BPU1 queens]$ mpirun -v -np 2 -machinefile ~/machines ./pgm 12 6
>>> running
>>> /home/jan/NAMD_2.6b1_Source/charm-5.9/mpi-linux-gcc/examples/charm++/queens/./pgm
>>> on 2 LINUX ch_p4 processors
>>> Created
>>> /home/jan/NAMD_2.6b1_Source/charm-5.9/mpi-linux-gcc/examples/charm++/queens/PI28413
>>> There are 14200 Solutions to 12 queens. Finish time=445.547998
>>> End of program
>>>
>>> The same is true when I'm building the net-linux versions instead of
>>> mpi-linux, thus the problem is probably independent of MPI.
>>>
>>> One thing I noticed is that there is a several minute clock skew between
>>> the nodes. Could that be part of my problem (unfortnately I don't have
>>> rights to simply synchronize the clocks)?
>>>
>>> Does anyone have an idea what the problem could be?
>>>
>>> Many thanks,
>>> Jan
>>>
>>>
>>>
>>
>>
>

-- 
---------------------------
Jan Saam
Institute of Biochemistry
Charite Berlin
Monbijoustr. 2
10117 Berlin
Germany
+49 30 450-528-446
saam_at_charite.de

Next message: Peter Freddolino: "Re: Tcl, minimization and MD (Warning: I am a lazy ignorant)"
Previous message: Jan Saam: "Re: Re: Is clock skew a problem for charm++"
In reply to: hrh: "Re: Re: Is clock skew a problem for charm++"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:42:14 CST