Re: NAMD results depend on # of processors!!?? (was Re: Is clock skew a problem for charm++)

From: Jan Saam (saam_at_charite.de)
Date: Thu Jun 22 2006 - 06:36:50 CDT

Hi Brian,

the problem seems to be compiler independent. I used gcc for my own
builds but the problem is the same with the precompiled versions
downloaded from UIUC. :-(

Thanks anyway,
Jan

Brian Bennion wrote:
>
> Hello,
> I have been playing with linux clusters and compiling namd/charm for
> them for a couple of weeks.
> After several strange errors running in multinode mode I traced my
> problems to the compiler. Once I switched to gcc everything works on
> multiple nodes. I almost thought the hardware was buggy and played
> with different ports on the switch and such.
>
> I don't remember you stating which compiler you have been using.
>
> Brian
>
> On Wed, 21 Jun 2006, Jan Saam wrote:
>
>> Ouch! This is really bad:
>>
>> I just found out that NAMD yields different results depending on the
>> number of processors used in the calculation!
>>
>> I ran the exact same simulation once using the standalone mode by just
>> invoking namd2 and once on 12 processors through charmrun. (I used the
>> precompiled version NAMD_2.6b1_Linux-i686 but it's the same story with
>> NAMD_2.5_Linux-i686).
>>
>> In the first case the simulation just runs fine as long as I wish. But
>> on multiple nodes the simulation dies after a couple hundred steps
>> either with:
>> FATAL ERROR: Periodic cell has become too small for original patch grid!
>> or with
>> Atoms moving too fast
>>
>> I want to stress the point that this is not due to a bad setup since
>> 1) This is a restart of a simulation that has been running stable for
>> 1ns on a diffenent machine
>> 2) on a complete different machine (IBM p690) the exact same simulation
>> runs fine (just as the standalone version).
>>
>> I tried this several times independently with net and mpi versions,
>> precompiled and self compiled NAMD, and with NAMD 2.5 and 2.6b1.
>> Whenever the network comes into play the results are corrupted.
>>
>> How on earth, in heaven or in hell can that be???
>>
>> Does anyone have an explanation?
>>
>> Thanks for your suggestions,
>>
>> Jan
>>
>>
>>
>> Gengbin Zheng wrote:
>>>
>>> Hi Jan,
>>>
>>> Clock skew may cause misleading time output, but I doubt it is the
>>> case here (queens program) because the time was printed from the same
>>> processor (0).
>>> When you run the program, did it really take 7 minutes wallclock time?
>>> Also, have you tried pingpong test from charm/tests/charm++/pingpong
>>> to test network latency?
>>>
>>> Gengbin
>>>
>>> Jan Saam wrote:
>>>
>>>> I forgot to say that I checked already that the problem is not ssh
>>>> taking forever to make a connection.
>>>> This is at least proven by this simple test:
>>>> time ssh BPU5 pwd
>>>> /home/jan
>>>>
>>>> real 0m0.236s
>>>> user 0m0.050s
>>>> sys 0m0.000s
>>>>
>>>> Jan
>>>>
>>>>
>>>> Jan Saam wrote:
>>>>
>>>>
>>>>> Hi all,
>>>>>
>>>>> I'm experiencing some weird performance problems with NAMD or the
>>>>> charm++ library on a linux cluster:
>>>>> When I'm using NAMD or a simple charmm++ demo program on one node
>>>>> everything is fine, but when I use more that one node each step takes
>>>>> _very_ much longer!
>>>>>
>>>>> Example:
>>>>> 2s for the program queens on 1 node, 445s on 2 nodes!!!
>>>>>
>>>>> running
>>>>> /home/jan/NAMD_2.6b1_Source/charm-5.9/mpi-linux-gcc/examples/charm++/queens/./pgm
>>>>>
>>>>>
>>>>> on 1 LINUX ch_p4 processors
>>>>> Created
>>>>> /home/jan/NAMD_2.6b1_Source/charm-5.9/mpi-linux-gcc/examples/charm++/queens/PI28357
>>>>>
>>>>>
>>>>> There are 14200 Solutions to 12 queens. Finish time=1.947209
>>>>> End of program
>>>>> [jan_at_BPU1 queens]$ mpirun -v -np 2 -machinefile ~/machines ./pgm 12 6
>>>>> running
>>>>> /home/jan/NAMD_2.6b1_Source/charm-5.9/mpi-linux-gcc/examples/charm++/queens/./pgm
>>>>>
>>>>>
>>>>> on 2 LINUX ch_p4 processors
>>>>> Created
>>>>> /home/jan/NAMD_2.6b1_Source/charm-5.9/mpi-linux-gcc/examples/charm++/queens/PI28413
>>>>>
>>>>>
>>>>> There are 14200 Solutions to 12 queens. Finish time=445.547998
>>>>> End of program
>>>>>
>>>>> The same is true when I'm building the net-linux versions instead of
>>>>> mpi-linux, thus the problem is probably independent of MPI.
>>>>>
>>>>> One thing I noticed is that there is a several minute clock skew
>>>>> between
>>>>> the nodes. Could that be part of my problem (unfortnately I don't
>>>>> have
>>>>> rights to simply synchronize the clocks)?
>>>>>
>>>>> Does anyone have an idea what the problem could be?
>>>>>
>>>>> Many thanks,
>>>>> Jan
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>
>> --
>> ---------------------------
>> Jan Saam
>> Institute of Biochemistry
>> Charite Berlin
>> Monbijoustr. 2
>> 10117 Berlin
>> Germany
>>
>> +49 30 450-528-446
>> saam_at_charite.de
>>
>
> ************************************************
> Brian Bennion, Ph.D.
> Biosciences Directorate
> Lawrence Livermore National Laboratory
> P.O. Box 808, L-448 bennion1_at_llnl.gov
> 7000 East Avenue phone: (925) 422-5722
> Livermore, CA 94550 fax: (925) 424-5513
> ************************************************
>
>

-- 
---------------------------
Jan Saam
Institute of Biochemistry
Charite Berlin
Monbijoustr. 2
10117 Berlin
Germany
+49 30 450-528-446
saam_at_charite.de

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:42:15 CST