Re: Re: Is clock skew a problem for charm++

From: Jan Saam (saam_at_charite.de)
Date: Thu Jun 22 2006 - 12:54:47 CDT

Hi Sameer,
megatest didn't any obvious problems, at least I can't see any. (see below)

I'll send you the files setting up my system in a separate mail.
And maybe tomorrow I'll try ApoA1...

Thanks,
Jan

[jan_at_BPU9 megatest]$ mpirun -v -machinefile ~/machines -np 12 ./pgm
running
/home/jan/NAMD_2.6b1_Source/charm-5.9/mpi-linux-gcc/tests/charm++/megatest/./pgm
on 12 LINUX ch_p4 processors
Created
/home/jan/NAMD_2.6b1_Source/charm-5.9/mpi-linux-gcc/tests/charm++/megatest/PI31426
Megatest is running on 12 processors.
test 0: initiated [bitvector (jbooth)]
test 0: completed (0.00 sec)
test 1: initiated [immediatering (gengbin)]
test 1: completed (4.14 sec)
test 2: initiated [callback (olawlor)]
test 2: completed (0.07 sec)
test 3: initiated [reduction (olawlor)]
test 3: completed (0.04 sec)
test 4: initiated [inherit (olawlor)]
test 4: completed (0.04 sec)
test 5: initiated [templates (milind)]
test 5: completed (0.00 sec)
test 6: initiated [statistics (olawlor)]
test 6: completed (0.00 sec)
test 7: initiated [rotest (milind)]
test 7: completed (0.00 sec)
test 8: initiated [priotest (mlind)]
test 8: completed (0.00 sec)
test 9: initiated [priomsg (fang)]
test 9: completed (0.00 sec)
test 10: initiated [marshall (olawlor)]
test 10: completed (0.50 sec)
test 11: initiated [migration (jackie)]
test 11: completed (0.51 sec)
test 12: initiated [queens (jackie)]
test 12: completed (0.05 sec)
test 13: initiated [packtest (fang)]
test 13: completed (0.00 sec)
test 14: initiated [tempotest (fang)]
test 14: completed (0.03 sec)
test 15: initiated [arrayring (fang)]
test 15: completed (0.21 sec)
test 16: initiated [fib (jackie)]
test 16: completed (0.04 sec)
test 17: initiated [synctest (mjlang)]
test 17: completed (0.30 sec)
test 18: initiated [nodecast (milind)]
test 18: completed (0.00 sec)
test 19: initiated [groupcast (mjlang)]
test 19: completed (0.00 sec)
test 20: initiated [varraystest (milind)]
test 20: completed (0.00 sec)
test 21: initiated [varsizetest (mjlang)]
test 21: completed (0.00 sec)
test 22: initiated [nodering (milind)]
test 22: completed (0.71 sec)
test 23: initiated [groupring (milind)]
test 23: completed (0.61 sec)
test 24: initiated [multi immediatering (gengbin)]
test 24: completed (5.48 sec)
test 25: initiated [multi callback (olawlor)]
test 25: completed (0.11 sec)
test 26: initiated [multi reduction (olawlor)]
test 26: completed (0.30 sec)
test 27: initiated [multi statistics (olawlor)]
test 27: completed (0.01 sec)
test 28: initiated [multi priotest (mlind)]
test 28: completed (0.03 sec)
test 29: initiated [multi priomsg (fang)]
test 29: completed (0.00 sec)
test 30: initiated [multi marshall (olawlor)]
test 30: completed (7.97 sec)
test 31: initiated [multi migration (jackie)]
test 31: completed (0.27 sec)
test 32: initiated [multi packtest (fang)]
test 32: completed (0.04 sec)
test 33: initiated [multi tempotest (fang)]
test 33: completed (0.08 sec)
test 34: initiated [multi arrayring (fang)]
test 34: completed (0.61 sec)
test 35: initiated [multi fib (jackie)]
test 35: completed (0.05 sec)
test 36: initiated [multi synctest (mjlang)]
test 36: completed (0.99 sec)
test 37: initiated [multi nodecast (milind)]
test 37: completed (0.00 sec)
test 38: initiated [multi groupcast (mjlang)]
test 38: completed (0.05 sec)
test 39: initiated [multi varraystest (milind)]
test 39: completed (0.21 sec)
test 40: initiated [multi varsizetest (mjlang)]
test 40: completed (0.25 sec)
test 41: initiated [multi nodering (milind)]
test 41: completed (1.31 sec)
test 42: initiated [multi groupring (milind)]
test 42: completed (1.04 sec)
test 43: initiated [all-at-once]
test 43: completed (3.48 sec)
All tests completed, exiting
End of program

Sameer Kumar wrote:
> Hi Jan,
>
> If you try the Charm++ test suite megatest it should be able to
> track down data corruption on the network.
>
> cd charm/tests/charm++/megatest/
> #and then run megatest on the same #pes as Namd
>
> Also what system (molecule) are you running? Have you tried running the
> standard ApoA1 benchmark? We could eliminate a few more possibilities
> here.
>
> sameer.
>
>
>
> On Thu, 22 Jun 2006, Jan Saam wrote:
>
>
>> Hi Gengbin,
>>
>> thanks for your comments, I used MPICH. As mentioned in one of the other
>> mails the problem is not specific to the mpi versions and also occurs in
>> the prebuild versions.
>>
>> I think we must find out if it's possible that there occurs some data
>> corruption. My assumption is that the master simply doesn't get what it
>> expects. Since it's not segfaulting the data format might be correct but
>> maybe the results from different nodes get mixed up?
>>
>> Is there some kind of test for that?
>>
>> Jan
>>
>>
>> Gengbin Zheng wrote:
>>
>>> Jan Saam wrote:
>>>
>>>
>>>> Hi Gengbin and other listeners,
>>>>
>>>> the weird story goes on. From the documentation and older threads in
>>>> namd-l I conclude that the mpi versions and the net versions of NAMD are
>>>> from the users point of view equivalent, while there might be some minor
>>>> performance differences depending on the architecture.
>>>> I tried both versions and found that both suffer from the same
>>>> performance problems mentioned earlier. However, mpi is yet 3 times
>>>> slower.
>>>>
>>>>
>>>>
>>> MPI version may be slower than net version, but I don't think 3 times
>>> is typical. There may be problem with your MPI installation.
>>>
>>>
>>>> One thing I noticed about the mpi version of NAMD is that it spawns two
>>>> processes on each node (while only one appears to consume significant
>>>> resources according to "top"):
>>>>
>>>> jan 22216 22214 0 21:02 ? 00:00:00 /usr/sbin/sshd
>>>> jan 22217 22216 81 21:02 ? 00:08:53 /home/jan/bin/namd2 BPU1
>>>> 32887 4amslave -p4yourname BPU5 -p4rmrank 3
>>>> jan 22246 22217 0 21:02 ? 00:00:00 /home/jan/bin/namd2 BPU1
>>>> 32887 4amslave -p4yourname BPU5 -p4rmrank 3
>>>>
>>>> Should it be like that?
>>>>
>>>>
>>>>
>>>>
>>> Some MPI implementations such as LAM do spawn pthreads, that is
>>> probably the reason you saw several instances in top.
>>> If it is LAM, try MPICH. We have bad luck with LAM in the past.
>>>
>>> Gengbin
>>>
>>>
>>>> I started namd on 12 processors:
>>>> mpirun -v -np 12 -leave_pg -machinefile ~/machines ~/bin/namd2 eq2.namd
>>>>
>>>>
>>>>
>>>>> eq2.out&
>>>>>
>>>>>
>>>> (My machinefile contains 12 valid machines)
>>>>
>>>> Jan
>>>>
>>>>
>>>> Gengbin Zheng wrote:
>>>>
>>>>
>>>>
>>>>> Hi Jan,
>>>>>
>>>>> Clock skew may cause misleading time output, but I doubt it is the
>>>>> case here (queens program) because the time was printed from the same
>>>>> processor (0).
>>>>> When you run the program, did it really take 7 minutes wallclock time?
>>>>> Also, have you tried pingpong test from charm/tests/charm++/pingpong
>>>>> to test network latency?
>>>>>
>>>>> Gengbin
>>>>>
>>>>> Jan Saam wrote:
>>>>>
>>>>>
>>>>>
>>>>>> I forgot to say that I checked already that the problem is not ssh
>>>>>> taking forever to make a connection.
>>>>>> This is at least proven by this simple test:
>>>>>> time ssh BPU5 pwd
>>>>>> /home/jan
>>>>>>
>>>>>> real 0m0.236s
>>>>>> user 0m0.050s
>>>>>> sys 0m0.000s
>>>>>>
>>>>>> Jan
>>>>>>
>>>>>>
>>>>>> Jan Saam wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I'm experiencing some weird performance problems with NAMD or the
>>>>>>> charm++ library on a linux cluster:
>>>>>>> When I'm using NAMD or a simple charmm++ demo program on one node
>>>>>>> everything is fine, but when I use more that one node each step takes
>>>>>>> _very_ much longer!
>>>>>>>
>>>>>>> Example:
>>>>>>> 2s for the program queens on 1 node, 445s on 2 nodes!!!
>>>>>>>
>>>>>>> running
>>>>>>> /home/jan/NAMD_2.6b1_Source/charm-5.9/mpi-linux-gcc/examples/charm++/queens/./pgm
>>>>>>>
>>>>>>>
>>>>>>> on 1 LINUX ch_p4 processors
>>>>>>> Created
>>>>>>> /home/jan/NAMD_2.6b1_Source/charm-5.9/mpi-linux-gcc/examples/charm++/queens/PI28357
>>>>>>>
>>>>>>>
>>>>>>> There are 14200 Solutions to 12 queens. Finish time=1.947209
>>>>>>> End of program
>>>>>>> [jan_at_BPU1 queens]$ mpirun -v -np 2 -machinefile ~/machines ./pgm 12 6
>>>>>>> running
>>>>>>> /home/jan/NAMD_2.6b1_Source/charm-5.9/mpi-linux-gcc/examples/charm++/queens/./pgm
>>>>>>>
>>>>>>>
>>>>>>> on 2 LINUX ch_p4 processors
>>>>>>> Created
>>>>>>> /home/jan/NAMD_2.6b1_Source/charm-5.9/mpi-linux-gcc/examples/charm++/queens/PI28413
>>>>>>>
>>>>>>>
>>>>>>> There are 14200 Solutions to 12 queens. Finish time=445.547998
>>>>>>> End of program
>>>>>>>
>>>>>>> The same is true when I'm building the net-linux versions instead of
>>>>>>> mpi-linux, thus the problem is probably independent of MPI.
>>>>>>>
>>>>>>> One thing I noticed is that there is a several minute clock skew
>>>>>>> between
>>>>>>> the nodes. Could that be part of my problem (unfortnately I don't
>>>>>>> have
>>>>>>> rights to simply synchronize the clocks)?
>>>>>>>
>>>>>>> Does anyone have an idea what the problem could be?
>>>>>>>
>>>>>>> Many thanks,
>>>>>>> Jan
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>
>>>>
>> --
>> ---------------------------
>> Jan Saam
>> Institute of Biochemistry
>> Charite Berlin
>> Monbijoustr. 2
>> 10117 Berlin
>> Germany
>>
>> +49 30 450-528-446
>> saam_at_charite.de
>>
>>
>
>

-- 
---------------------------
Jan Saam
Institute of Biochemistry
Charite Berlin
Monbijoustr. 2
10117 Berlin
Germany
+49 30 450-528-446
saam_at_charite.de

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:43:46 CST