Re: NAMD results depend on # of processors!!?? (was Re: Is clock skew a problem for charm++)

From: Brian Bennion (brian_at_youkai.llnl.gov)
Date: Wed Jun 21 2006 - 17:24:11 CDT

Hello,
I have been playing with linux clusters and compiling namd/charm for them
for a couple of weeks.
After several strange errors running in multinode mode I traced my
problems to the compiler. Once I switched to gcc everything works on
multiple nodes. I almost thought the hardware was buggy and played with
different ports on the switch and such.

I don't remember you stating which compiler you have been using.

Brian

On Wed, 21 Jun 2006, Jan Saam wrote:

> Ouch! This is really bad:
>
> I just found out that NAMD yields different results depending on the
> number of processors used in the calculation!
>
> I ran the exact same simulation once using the standalone mode by just
> invoking namd2 and once on 12 processors through charmrun. (I used the
> precompiled version NAMD_2.6b1_Linux-i686 but it's the same story with
> NAMD_2.5_Linux-i686).
>
> In the first case the simulation just runs fine as long as I wish. But
> on multiple nodes the simulation dies after a couple hundred steps
> either with:
> FATAL ERROR: Periodic cell has become too small for original patch grid!
> or with
> Atoms moving too fast
>
> I want to stress the point that this is not due to a bad setup since
> 1) This is a restart of a simulation that has been running stable for
> 1ns on a diffenent machine
> 2) on a complete different machine (IBM p690) the exact same simulation
> runs fine (just as the standalone version).
>
> I tried this several times independently with net and mpi versions,
> precompiled and self compiled NAMD, and with NAMD 2.5 and 2.6b1.
> Whenever the network comes into play the results are corrupted.
>
> How on earth, in heaven or in hell can that be???
>
> Does anyone have an explanation?
>
> Thanks for your suggestions,
>
> Jan
>
>
>
> Gengbin Zheng wrote:
>>
>> Hi Jan,
>>
>> Clock skew may cause misleading time output, but I doubt it is the
>> case here (queens program) because the time was printed from the same
>> processor (0).
>> When you run the program, did it really take 7 minutes wallclock time?
>> Also, have you tried pingpong test from charm/tests/charm++/pingpong
>> to test network latency?
>>
>> Gengbin
>>
>> Jan Saam wrote:
>>
>>> I forgot to say that I checked already that the problem is not ssh
>>> taking forever to make a connection.
>>> This is at least proven by this simple test:
>>> time ssh BPU5 pwd
>>> /home/jan
>>>
>>> real 0m0.236s
>>> user 0m0.050s
>>> sys 0m0.000s
>>>
>>> Jan
>>>
>>>
>>> Jan Saam wrote:
>>>
>>>
>>>> Hi all,
>>>>
>>>> I'm experiencing some weird performance problems with NAMD or the
>>>> charm++ library on a linux cluster:
>>>> When I'm using NAMD or a simple charmm++ demo program on one node
>>>> everything is fine, but when I use more that one node each step takes
>>>> _very_ much longer!
>>>>
>>>> Example:
>>>> 2s for the program queens on 1 node, 445s on 2 nodes!!!
>>>>
>>>> running
>>>> /home/jan/NAMD_2.6b1_Source/charm-5.9/mpi-linux-gcc/examples/charm++/queens/./pgm
>>>>
>>>> on 1 LINUX ch_p4 processors
>>>> Created
>>>> /home/jan/NAMD_2.6b1_Source/charm-5.9/mpi-linux-gcc/examples/charm++/queens/PI28357
>>>>
>>>> There are 14200 Solutions to 12 queens. Finish time=1.947209
>>>> End of program
>>>> [jan_at_BPU1 queens]$ mpirun -v -np 2 -machinefile ~/machines ./pgm 12 6
>>>> running
>>>> /home/jan/NAMD_2.6b1_Source/charm-5.9/mpi-linux-gcc/examples/charm++/queens/./pgm
>>>>
>>>> on 2 LINUX ch_p4 processors
>>>> Created
>>>> /home/jan/NAMD_2.6b1_Source/charm-5.9/mpi-linux-gcc/examples/charm++/queens/PI28413
>>>>
>>>> There are 14200 Solutions to 12 queens. Finish time=445.547998
>>>> End of program
>>>>
>>>> The same is true when I'm building the net-linux versions instead of
>>>> mpi-linux, thus the problem is probably independent of MPI.
>>>>
>>>> One thing I noticed is that there is a several minute clock skew
>>>> between
>>>> the nodes. Could that be part of my problem (unfortnately I don't have
>>>> rights to simply synchronize the clocks)?
>>>>
>>>> Does anyone have an idea what the problem could be?
>>>>
>>>> Many thanks,
>>>> Jan
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>
> --
> ---------------------------
> Jan Saam
> Institute of Biochemistry
> Charite Berlin
> Monbijoustr. 2
> 10117 Berlin
> Germany
>
> +49 30 450-528-446
> saam_at_charite.de
>

************************************************
   Brian Bennion, Ph.D.
   Biosciences Directorate
   Lawrence Livermore National Laboratory
   P.O. Box 808, L-448 bennion1_at_llnl.gov
   7000 East Avenue phone: (925) 422-5722
   Livermore, CA 94550 fax: (925) 424-5513
************************************************

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:42:14 CST