Re: Re: Is clock skew a problem for charm++

From: hrh (hrh_at_biophysik.biologie.uni-mainz.de)
Date: Wed Jun 21 2006 - 06:33:05 CDT

Hi Jan and all who can help!
I have the same or a similar problem not only with NAMD, but also with
GAMESS, a quantum chemistry program. At least in my case the tremendous
increase in execution time using 2 or more nodes is not connected with
clock skew. We use a cluster of 6 PC`s and the performance does _not_
depend on clock synchronization (exact synchronization of one slave with
the master and a time skew of 2 hours for another slave). However the
execution time seems to depend in a puzzling manner on network
configuration: The nodes are connected with a Gigabit switch and Gigabit
LAN and they have private IP-addresses and private names, but with a
second 100 Mbit network card in each PC they can also be connected with
the internet (official IP-addresses, official hostnames). Using private
names in the nodes file, the wallclock time increases dramatically for
two nodes compared to one node. With official hostnames however, the
wall clock time decreases slightly with two nodes (too little I suppose).

Timing for a 1000 steps simulation:
One node:
Wall clock 125 s, CPU time 121 s.
Two nodes (one or both private addresses) Wall clock 250 s, CPU
time 94 s.
Two nodes (both official addresses) Wall clock 104
s, CPU time 73 s.

Jan, which Linux distribution do you have? We have installed SUSE 9.3 on
the cluster.

Hermann

Jan Saam wrote:

>I forgot to say that I checked already that the problem is not ssh
>taking forever to make a connection.
>This is at least proven by this simple test:
>time ssh BPU5 pwd
>/home/jan
>
>real 0m0.236s
>user 0m0.050s
>sys 0m0.000s
>
>Jan
>
>
>Jan Saam wrote:
>
>
>>Hi all,
>>
>>I'm experiencing some weird performance problems with NAMD or the
>>charm++ library on a linux cluster:
>>When I'm using NAMD or a simple charmm++ demo program on one node
>>everything is fine, but when I use more that one node each step takes
>>_very_ much longer!
>>
>>Example:
>>2s for the program queens on 1 node, 445s on 2 nodes!!!
>>
>>running
>>/home/jan/NAMD_2.6b1_Source/charm-5.9/mpi-linux-gcc/examples/charm++/queens/./pgm
>>on 1 LINUX ch_p4 processors
>>Created
>>/home/jan/NAMD_2.6b1_Source/charm-5.9/mpi-linux-gcc/examples/charm++/queens/PI28357
>>There are 14200 Solutions to 12 queens. Finish time=1.947209
>>End of program
>>[jan_at_BPU1 queens]$ mpirun -v -np 2 -machinefile ~/machines ./pgm 12 6
>>running
>>/home/jan/NAMD_2.6b1_Source/charm-5.9/mpi-linux-gcc/examples/charm++/queens/./pgm
>>on 2 LINUX ch_p4 processors
>>Created
>>/home/jan/NAMD_2.6b1_Source/charm-5.9/mpi-linux-gcc/examples/charm++/queens/PI28413
>>There are 14200 Solutions to 12 queens. Finish time=445.547998
>>End of program
>>
>>The same is true when I'm building the net-linux versions instead of
>>mpi-linux, thus the problem is probably independent of MPI.
>>
>>One thing I noticed is that there is a several minute clock skew between
>>the nodes. Could that be part of my problem (unfortnately I don't have
>>rights to simply synchronize the clocks)?
>>
>>Does anyone have an idea what the problem could be?
>>
>>Many thanks,
>>Jan
>>
>>
>>
>>
>
>
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:42:14 CST