Re: Re: Is clock skew a problem for charm++

From: Jan Saam (saam_at_charite.de)
Date: Wed Jun 21 2006 - 14:38:09 CDT

Hi Gengbin and other listeners,

the weird story goes on. From the documentation and older threads in
namd-l I conclude that the mpi versions and the net versions of NAMD are
from the users point of view equivalent, while there might be some minor
performance differences depending on the architecture.
I tried both versions and found that both suffer from the same
performance problems mentioned earlier. However, mpi is yet 3 times slower.
One thing I noticed about the mpi version of NAMD is that it spawns two
processes on each node (while only one appears to consume significant
resources according to "top"):

jan 22216 22214 0 21:02 ? 00:00:00 /usr/sbin/sshd
jan 22217 22216 81 21:02 ? 00:08:53 /home/jan/bin/namd2 BPU1
32887 4amslave -p4yourname BPU5 -p4rmrank 3
jan 22246 22217 0 21:02 ? 00:00:00 /home/jan/bin/namd2 BPU1
32887 4amslave -p4yourname BPU5 -p4rmrank 3

Should it be like that?

I started namd on 12 processors:
mpirun -v -np 12 -leave_pg -machinefile ~/machines ~/bin/namd2 eq2.namd
> eq2.out&
(My machinefile contains 12 valid machines)

Jan

Gengbin Zheng wrote:
>
> Hi Jan,
>
> Clock skew may cause misleading time output, but I doubt it is the
> case here (queens program) because the time was printed from the same
> processor (0).
> When you run the program, did it really take 7 minutes wallclock time?
> Also, have you tried pingpong test from charm/tests/charm++/pingpong
> to test network latency?
>
> Gengbin
>
> Jan Saam wrote:
>
>> I forgot to say that I checked already that the problem is not ssh
>> taking forever to make a connection.
>> This is at least proven by this simple test:
>> time ssh BPU5 pwd
>> /home/jan
>>
>> real 0m0.236s
>> user 0m0.050s
>> sys 0m0.000s
>>
>> Jan
>>
>>
>> Jan Saam wrote:
>>
>>
>>> Hi all,
>>>
>>> I'm experiencing some weird performance problems with NAMD or the
>>> charm++ library on a linux cluster:
>>> When I'm using NAMD or a simple charmm++ demo program on one node
>>> everything is fine, but when I use more that one node each step takes
>>> _very_ much longer!
>>>
>>> Example:
>>> 2s for the program queens on 1 node, 445s on 2 nodes!!!
>>>
>>> running
>>> /home/jan/NAMD_2.6b1_Source/charm-5.9/mpi-linux-gcc/examples/charm++/queens/./pgm
>>>
>>> on 1 LINUX ch_p4 processors
>>> Created
>>> /home/jan/NAMD_2.6b1_Source/charm-5.9/mpi-linux-gcc/examples/charm++/queens/PI28357
>>>
>>> There are 14200 Solutions to 12 queens. Finish time=1.947209
>>> End of program
>>> [jan_at_BPU1 queens]$ mpirun -v -np 2 -machinefile ~/machines ./pgm 12 6
>>> running
>>> /home/jan/NAMD_2.6b1_Source/charm-5.9/mpi-linux-gcc/examples/charm++/queens/./pgm
>>>
>>> on 2 LINUX ch_p4 processors
>>> Created
>>> /home/jan/NAMD_2.6b1_Source/charm-5.9/mpi-linux-gcc/examples/charm++/queens/PI28413
>>>
>>> There are 14200 Solutions to 12 queens. Finish time=445.547998
>>> End of program
>>>
>>> The same is true when I'm building the net-linux versions instead of
>>> mpi-linux, thus the problem is probably independent of MPI.
>>>
>>> One thing I noticed is that there is a several minute clock skew
>>> between
>>> the nodes. Could that be part of my problem (unfortnately I don't have
>>> rights to simply synchronize the clocks)?
>>>
>>> Does anyone have an idea what the problem could be?
>>>
>>> Many thanks,
>>> Jan
>>>
>>>
>>>
>>
>>
>>
>

-- 
---------------------------
Jan Saam
Institute of Biochemistry
Charite Berlin
Monbijoustr. 2
10117 Berlin
Germany
+49 30 450-528-446
saam_at_charite.de

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:43:45 CST