NAMD results depend on # of processors!!?? (was Re: Is clock skew a problem for charm++)

From: Jan Saam (saam_at_charite.de)
Date: Wed Jun 21 2006 - 15:55:13 CDT

Ouch! This is really bad:

I just found out that NAMD yields different results depending on the
number of processors used in the calculation!

I ran the exact same simulation once using the standalone mode by just
invoking namd2 and once on 12 processors through charmrun. (I used the
precompiled version NAMD_2.6b1_Linux-i686 but it's the same story with
NAMD_2.5_Linux-i686).

In the first case the simulation just runs fine as long as I wish. But
on multiple nodes the simulation dies after a couple hundred steps
either with:
FATAL ERROR: Periodic cell has become too small for original patch grid!
or with
Atoms moving too fast

I want to stress the point that this is not due to a bad setup since
1) This is a restart of a simulation that has been running stable for
1ns on a diffenent machine
2) on a complete different machine (IBM p690) the exact same simulation
runs fine (just as the standalone version).

I tried this several times independently with net and mpi versions,
precompiled and self compiled NAMD, and with NAMD 2.5 and 2.6b1.
Whenever the network comes into play the results are corrupted.

How on earth, in heaven or in hell can that be???

Does anyone have an explanation?

Thanks for your suggestions,

Jan

Gengbin Zheng wrote:
>
> Hi Jan,
>
> Clock skew may cause misleading time output, but I doubt it is the
> case here (queens program) because the time was printed from the same
> processor (0).
> When you run the program, did it really take 7 minutes wallclock time?
> Also, have you tried pingpong test from charm/tests/charm++/pingpong
> to test network latency?
>
> Gengbin
>
> Jan Saam wrote:
>
>> I forgot to say that I checked already that the problem is not ssh
>> taking forever to make a connection.
>> This is at least proven by this simple test:
>> time ssh BPU5 pwd
>> /home/jan
>>
>> real 0m0.236s
>> user 0m0.050s
>> sys 0m0.000s
>>
>> Jan
>>
>>
>> Jan Saam wrote:
>>
>>
>>> Hi all,
>>>
>>> I'm experiencing some weird performance problems with NAMD or the
>>> charm++ library on a linux cluster:
>>> When I'm using NAMD or a simple charmm++ demo program on one node
>>> everything is fine, but when I use more that one node each step takes
>>> _very_ much longer!
>>>
>>> Example:
>>> 2s for the program queens on 1 node, 445s on 2 nodes!!!
>>>
>>> running
>>> /home/jan/NAMD_2.6b1_Source/charm-5.9/mpi-linux-gcc/examples/charm++/queens/./pgm
>>>
>>> on 1 LINUX ch_p4 processors
>>> Created
>>> /home/jan/NAMD_2.6b1_Source/charm-5.9/mpi-linux-gcc/examples/charm++/queens/PI28357
>>>
>>> There are 14200 Solutions to 12 queens. Finish time=1.947209
>>> End of program
>>> [jan_at_BPU1 queens]$ mpirun -v -np 2 -machinefile ~/machines ./pgm 12 6
>>> running
>>> /home/jan/NAMD_2.6b1_Source/charm-5.9/mpi-linux-gcc/examples/charm++/queens/./pgm
>>>
>>> on 2 LINUX ch_p4 processors
>>> Created
>>> /home/jan/NAMD_2.6b1_Source/charm-5.9/mpi-linux-gcc/examples/charm++/queens/PI28413
>>>
>>> There are 14200 Solutions to 12 queens. Finish time=445.547998
>>> End of program
>>>
>>> The same is true when I'm building the net-linux versions instead of
>>> mpi-linux, thus the problem is probably independent of MPI.
>>>
>>> One thing I noticed is that there is a several minute clock skew
>>> between
>>> the nodes. Could that be part of my problem (unfortnately I don't have
>>> rights to simply synchronize the clocks)?
>>>
>>> Does anyone have an idea what the problem could be?
>>>
>>> Many thanks,
>>> Jan
>>>
>>>
>>>
>>
>>
>>
>

-- 
---------------------------
Jan Saam
Institute of Biochemistry
Charite Berlin
Monbijoustr. 2
10117 Berlin
Germany
+49 30 450-528-446
saam_at_charite.de

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:42:14 CST