Re: Re: Is clock skew a problem for charm++

From: Gengbin Zheng (gzheng_at_ks.uiuc.edu)
Date: Wed Jun 21 2006 - 23:17:07 CDT

Jan Saam wrote:

>Hi Gengbin and other listeners,
>
>the weird story goes on. From the documentation and older threads in
>namd-l I conclude that the mpi versions and the net versions of NAMD are
>from the users point of view equivalent, while there might be some minor
>performance differences depending on the architecture.
>I tried both versions and found that both suffer from the same
>performance problems mentioned earlier. However, mpi is yet 3 times slower.
>
>

MPI version may be slower than net version, but I don't think 3 times is
typical. There may be problem with your MPI installation.

>One thing I noticed about the mpi version of NAMD is that it spawns two
>processes on each node (while only one appears to consume significant
>resources according to "top"):
>
>jan 22216 22214 0 21:02 ? 00:00:00 /usr/sbin/sshd
>jan 22217 22216 81 21:02 ? 00:08:53 /home/jan/bin/namd2 BPU1
>32887 4amslave -p4yourname BPU5 -p4rmrank 3
>jan 22246 22217 0 21:02 ? 00:00:00 /home/jan/bin/namd2 BPU1
>32887 4amslave -p4yourname BPU5 -p4rmrank 3
>
>Should it be like that?
>
>
>

Some MPI implementations such as LAM do spawn pthreads, that is probably
the reason you saw several instances in top.
If it is LAM, try MPICH. We have bad luck with LAM in the past.

Gengbin

>I started namd on 12 processors:
>mpirun -v -np 12 -leave_pg -machinefile ~/machines ~/bin/namd2 eq2.namd
>
>
>>eq2.out&
>>
>>
>(My machinefile contains 12 valid machines)
>
>Jan
>
>
>Gengbin Zheng wrote:
>
>
>>Hi Jan,
>>
>>Clock skew may cause misleading time output, but I doubt it is the
>>case here (queens program) because the time was printed from the same
>>processor (0).
>>When you run the program, did it really take 7 minutes wallclock time?
>>Also, have you tried pingpong test from charm/tests/charm++/pingpong
>>to test network latency?
>>
>>Gengbin
>>
>>Jan Saam wrote:
>>
>>
>>
>>>I forgot to say that I checked already that the problem is not ssh
>>>taking forever to make a connection.
>>>This is at least proven by this simple test:
>>>time ssh BPU5 pwd
>>>/home/jan
>>>
>>>real 0m0.236s
>>>user 0m0.050s
>>>sys 0m0.000s
>>>
>>>Jan
>>>
>>>
>>>Jan Saam wrote:
>>>
>>>
>>>
>>>
>>>>Hi all,
>>>>
>>>>I'm experiencing some weird performance problems with NAMD or the
>>>>charm++ library on a linux cluster:
>>>>When I'm using NAMD or a simple charmm++ demo program on one node
>>>>everything is fine, but when I use more that one node each step takes
>>>>_very_ much longer!
>>>>
>>>>Example:
>>>>2s for the program queens on 1 node, 445s on 2 nodes!!!
>>>>
>>>>running
>>>>/home/jan/NAMD_2.6b1_Source/charm-5.9/mpi-linux-gcc/examples/charm++/queens/./pgm
>>>>
>>>>on 1 LINUX ch_p4 processors
>>>>Created
>>>>/home/jan/NAMD_2.6b1_Source/charm-5.9/mpi-linux-gcc/examples/charm++/queens/PI28357
>>>>
>>>>There are 14200 Solutions to 12 queens. Finish time=1.947209
>>>>End of program
>>>>[jan_at_BPU1 queens]$ mpirun -v -np 2 -machinefile ~/machines ./pgm 12 6
>>>>running
>>>>/home/jan/NAMD_2.6b1_Source/charm-5.9/mpi-linux-gcc/examples/charm++/queens/./pgm
>>>>
>>>>on 2 LINUX ch_p4 processors
>>>>Created
>>>>/home/jan/NAMD_2.6b1_Source/charm-5.9/mpi-linux-gcc/examples/charm++/queens/PI28413
>>>>
>>>>There are 14200 Solutions to 12 queens. Finish time=445.547998
>>>>End of program
>>>>
>>>>The same is true when I'm building the net-linux versions instead of
>>>>mpi-linux, thus the problem is probably independent of MPI.
>>>>
>>>>One thing I noticed is that there is a several minute clock skew
>>>>between
>>>>the nodes. Could that be part of my problem (unfortnately I don't have
>>>>rights to simply synchronize the clocks)?
>>>>
>>>>Does anyone have an idea what the problem could be?
>>>>
>>>>Many thanks,
>>>>Jan
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>
>
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:42:15 CST