From: Peter Freddolino (petefred_at_ks.uiuc.edu)
Date: Thu Nov 29 2007 - 13:25:35 CST
it sounds like there are several different issues here. Let's work our
way down the hierarchy... at the end I'll discuss how this actually
affects simulation reliability/accuracy.
If you run constant temperature in parallel, you should not expect
identical results even using identical runs with the same seed. This was
mentioned earlier in this discussion (see my first reply) and in the
manual (http://www.ks.uiuc.edu/Research/namd/2.6/ug/node26.html) and
occurs because of nondeterminism in the order of different processes
communicating with the head node.
If you run constant temperature on one processor, and you run the SAME
simulation twice, you should expect the same results; for example,
running B1 twice.
If you run constant temperature on one processor but you restart, you
should not expect identical results to a non-restarted run because the
same seed is being applied to a different point. So, for example,
running A with a seed and B1 with the same seed *should* give them same
result, but because you don't know the internal state of the
simulation's random number generator at the end of B1, you should *not*
expect the second half of A to correspond to B2. Note that this is the
one place where namd differs from charmm, since charmm's restart files
carry a seed. It sounds like this is what you're running into here.
If you run NVE simulations, you should always get the same results
(limited by floating point imprecision, if you're running in parallel)
from the same input coordinates and velocities.
Because of the third point above, what you did isn't really a fair test
of the input precision; the proper test of the input precision would be
to run A, B1, and B2 all as *NVE* simulations; if they're NVT, then even
on one processor, I believe you'd expect different effects from the RNG.
The differences between the second half of A and B2 should then be
compared to two separate B2 runs.
Perhaps the most important question is, does this matter. For NVT
simulations, the nondeterminism of langevin dynamics between
serial/parallel runs and across restarts should not matter if you do
sufficient sampling, since either way you're sampling from the same
ensemble. As long as you do enough sampling to get meaningful results,
all of your observables should come out identical. The precision of
restarts themselves *does* matter, since imprecision here actually
changes the physics of what you're doing (this is particularly important
So, by my best understanding, barring any input/output imprecision
(which will only be apparent to tests in NVE), B1 and B2 should be
considered as good as A because they're both sampling from the same
thermodynamic ensemble, and the only differences are in things that are
supposed to be random (ie, the Langevin random forces); there's nothing
that makes the particular random force in a given timestep of A more or
less correct that that in B2. I just spoke with Jim Phillips, who
confirmed that the old imprecision-on-restart issues were fixed
immediately after the old discussion thread you linked to, so there
should be no problems with the restart files themselves.
Please let me know if any of this is unclear.
Alok Juneja wrote:
> Dear Peter,
> Yes, I am specifying the identical seed value in A (complete run), B1
> (1st half) and B2 (2nd half). A is one complete run where as B1 & B2
> simulations are serial that means that I am using the restart file of
> B1 for the B2 run.
> Peter, I am not clear with what do you mean by serial or parallel? As
> I mentioned earlier my runs BI and B2 are serial. This simulation I am
> running on single same processor. Kindly mention the link where the
> non-determinism of the Langevin thermostat in parallel has been talked
> So comming back to square one, after reading all the comments in this
> discussion, I believe there exist NO solution to this problem that is
> occuring either because of numerical inaccuracy or non-determinism.
> Could the B1 and B2 MD runs be considered as good as single A MD run.
> Peter Freddolino wrote:
>> Hi Alok,
>> just to verify, since you're running NVT, did you specify a seed value
>> in your config file for the A-B1-B2 simulations? And were your
>> production runs serial or parallel? If your production runs are done in
>> parallel then the differences you observe in the first part of your
>> email are really unremarkable, and have nothing to do with precision and
>> everything to do with the nondeterminism of the langevin thermostat in
>> parallel that has been mentioned earlier.
>> Alok Juneja wrote:
>>> Dear Peter, Dave, Himanshu & other list member,
>>> Sorry for not answering ealier though I was regularly following the
>>> discussion on this issue. As requested by Peter, I am providing my
>>> findings about this issue..
>>> I am running constant temperature 50 ns dynamics, total of 25000000
>>> steps with time step of 0.002ps and dcdfreq of 100 however restartfreq
>>> of 100000. Somehow my MD crashed at 5459300 but my last restrart was
>>> 5400000. I restarted with this. I am doing this MD to see the protein
>>> behavious and am calculating the N and C terminal distance (Ang.).
>>> Following is the N-C terminal distance before crash and after crash. I
>>> am running this simulation in parallel.
>>> # TIME(PS) Before-Crash After-Crash
>>> 10800 10.833
>>> 10800.2 11.3259 11.0924
>>> 10800.4 11.2417 11.1039
>>> 10800.6 10.985 10.9962
>>> 10800.8 10.7715 11.1593
>>> 10801 11.3783 11.4828
>>> 10801.2 11.1862 10.9861
>>> 10801.4 11.3925 10.9671
>>> 10801.6 10.8473 10.9287
>>> (*) 10801.8 10.5789 11.013
>>> 10802 10.8792 10.4324
>>> 10802.2 10.6182 10.4422
>>> 10802.4 10.8918 10.6541
>>> 10802.6 10.9267 10.7829
>>> 10802.8 10.6352 10.8386
>>> 10803 10.8069 10.4295
>>> (*) 10803.2 11.3242 10.5952 (*) 10803.4
>>> 11.3397 10.4784
>>> (*) 10803.6 11.5822 10.4696
>>> (*) 10803.8 11.023 10.8231
>>> 10804 10.9887 10.4586
>>> 10804.2 10.5118 10.3266
>>> (*) 10804.4 10.4329 9.95989
>>> 10804.6 10.6863 10.2366
>>> (*) 10804.8 11.3551 10.2149
>>> (*) 10805 11.3445 9.88589
>>> 10805.2 10.7702 10.1757
>>> 10805.4 10.4436 10.3636
>>> 10805.6 10.3206 10.2086
>>> 10805.8 10.8214 10.5937
>>> 10806 11.2742 10.3849
>>> 10806.2 11.44 10.2721
>>> (*) 10806.4 11.2566 10.1909
>>> 10806.6 10.9381 10.7606
>>> 10806.8 11.5617 10.8286
>>> 10807 11.7283 11.246
>>> 10807.2 11.4038 11.2901
>>> 10807.4 10.5862 10.708
>>> 10807.6 10.61 10.6308
>>> 10807.8 11.1818 10.2391
>>> 10808 11.3433 10.5278
>>> 10808.2 11.1947 11.0142
>>> 10808.4 10.9988 11.2578
>>> (*) 10808.6 10.447 11.334
>>> 10808.8 10.3205 10.9368
>>> 10809 10.7634 10.9165
>>> 10809.2 10.7874 11.1041
>>> 10809.4 11.011 11.15
>>> 10809.6 10.8222 10.9214
>>> 10809.8 10.8731 10.2806
>>> 10810 11.0003 10.908
>>> You will find so many time steps where the difference is remarkable
>>> (indicated by *). I believe that these difference is too much for me.
>>> I checked this and found that this is not the case with CHARMM where
>>> you get the identical results even after restart.
>>> For your ready reference, I am attaching the total energy graph for
>>> comparision (comparision.pdf
>>> As requested by Dave, I am attaching file A-B1-B2.pdf
>>> [http://www.geocities.com/junejaalok/A-B1-B2.pdf], the job run on
>>> single same processor.
>>> Test A energy profile on
>>> TestB1 energy profile on
>>> TestB2 energy profile on
>>> since, i am restricted the with the amount of characters that one can
>>> write in NAMD forum and the size of attachments, I am putting an extra
>>> links for you to see the files and results..hope you understand.
>>> I appreciate your efforts to get into the depth. But I believe the
>>> NAMD developers should really think over this issue..however, any
>>> solution and suggestions in this regard would be of great help for
>>> others as well..
>>> Best Wishes,
This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:47:20 CST