Re: NAMD results depend on # of processors!!??

From: Jan Saam (saam_at_charite.de)
Date: Wed Jun 21 2006 - 16:25:21 CDT

Thanks for your note, Nitin.

It is not possible that the problems are due to different initial
velocities because I used the exact same input files for all
simulations. I double checked this. :-(

Jan

Nitin Bhardwaj wrote:
> Jan,
> I doubt that this may be due to a different set of initial
> velocities for the atoms.
>
> Rgds,
> Nitin
> On 21/06/06, Jan Saam <saam_at_charite.de> wrote:
>> Ouch! This is really bad:
>>
>> I just found out that NAMD yields different results depending on the
>> number of processors used in the calculation!
>>
>> I ran the exact same simulation once using the standalone mode by just
>> invoking namd2 and once on 12 processors through charmrun. (I used the
>> precompiled version NAMD_2.6b1_Linux-i686 but it's the same story with
>> NAMD_2.5_Linux-i686).
>>
>> In the first case the simulation just runs fine as long as I wish. But
>> on multiple nodes the simulation dies after a couple hundred steps
>> either with:
>> FATAL ERROR: Periodic cell has become too small for original patch grid!
>> or with
>> Atoms moving too fast
>>
>> I want to stress the point that this is not due to a bad setup since
>> 1) This is a restart of a simulation that has been running stable for
>> 1ns on a diffenent machine
>> 2) on a complete different machine (IBM p690) the exact same simulation
>> runs fine (just as the standalone version).
>>
>> I tried this several times independently with net and mpi versions,
>> precompiled and self compiled NAMD, and with NAMD 2.5 and 2.6b1.
>> Whenever the network comes into play the results are corrupted.
>>
>> How on earth, in heaven or in hell can that be???
>>
>> Does anyone have an explanation?
>>
>> Thanks for your suggestions,
>>
>> Jan
>>
>>
>>
>> Gengbin Zheng wrote:
>> >
>> > Hi Jan,
>> >
>> > Clock skew may cause misleading time output, but I doubt it is the
>> > case here (queens program) because the time was printed from the same
>> > processor (0).
>> > When you run the program, did it really take 7 minutes wallclock time?
>> > Also, have you tried pingpong test from charm/tests/charm++/pingpong
>> > to test network latency?
>> >
>> > Gengbin
>> >
>> > Jan Saam wrote:
>> >
>> >> I forgot to say that I checked already that the problem is not ssh
>> >> taking forever to make a connection.
>> >> This is at least proven by this simple test:
>> >> time ssh BPU5 pwd
>> >> /home/jan
>> >>
>> >> real 0m0.236s
>> >> user 0m0.050s
>> >> sys 0m0.000s
>> >>
>> >> Jan
>> >>
>> >>
>> >> Jan Saam wrote:
>> >>
>> >>
>> >>> Hi all,
>> >>>
>> >>> I'm experiencing some weird performance problems with NAMD or the
>> >>> charm++ library on a linux cluster:
>> >>> When I'm using NAMD or a simple charmm++ demo program on one node
>> >>> everything is fine, but when I use more that one node each step
>> takes
>> >>> _very_ much longer!
>> >>>
>> >>> Example:
>> >>> 2s for the program queens on 1 node, 445s on 2 nodes!!!
>> >>>
>> >>> running
>> >>>
>> /home/jan/NAMD_2.6b1_Source/charm-5.9/mpi-linux-gcc/examples/charm++/queens/./pgm
>>
>> >>>
>> >>> on 1 LINUX ch_p4 processors
>> >>> Created
>> >>>
>> /home/jan/NAMD_2.6b1_Source/charm-5.9/mpi-linux-gcc/examples/charm++/queens/PI28357
>>
>> >>>
>> >>> There are 14200 Solutions to 12 queens. Finish time=1.947209
>> >>> End of program
>> >>> [jan_at_BPU1 queens]$ mpirun -v -np 2 -machinefile ~/machines ./pgm
>> 12 6
>> >>> running
>> >>>
>> /home/jan/NAMD_2.6b1_Source/charm-5.9/mpi-linux-gcc/examples/charm++/queens/./pgm
>>
>> >>>
>> >>> on 2 LINUX ch_p4 processors
>> >>> Created
>> >>>
>> /home/jan/NAMD_2.6b1_Source/charm-5.9/mpi-linux-gcc/examples/charm++/queens/PI28413
>>
>> >>>
>> >>> There are 14200 Solutions to 12 queens. Finish time=445.547998
>> >>> End of program
>> >>>
>> >>> The same is true when I'm building the net-linux versions instead of
>> >>> mpi-linux, thus the problem is probably independent of MPI.
>> >>>
>> >>> One thing I noticed is that there is a several minute clock skew
>> >>> between
>> >>> the nodes. Could that be part of my problem (unfortnately I don't
>> have
>> >>> rights to simply synchronize the clocks)?
>> >>>
>> >>> Does anyone have an idea what the problem could be?
>> >>>
>> >>> Many thanks,
>> >>> Jan
>> >>>
>> >>>
>> >>>
>> >>
>> >>
>> >>
>> >
>>
>> --
>> ---------------------------
>> Jan Saam
>> Institute of Biochemistry
>> Charite Berlin
>> Monbijoustr. 2
>> 10117 Berlin
>> Germany
>>
>> +49 30 450-528-446
>> saam_at_charite.de
>>
>>
>
>

-- 
---------------------------
Jan Saam
Institute of Biochemistry
Charite Berlin
Monbijoustr. 2
10117 Berlin
Germany
+49 30 450-528-446
saam_at_charite.de

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:42:14 CST