Re: Is there solution to numerical inaccuracy

From: Alok Juneja (juneja_at_speakeasy.net)
Date: Mon Dec 03 2007 - 12:49:50 CST

Dear Peter,

I am thankful to you for making thing clear. It would be nice if you
please put more ligth to few things.
please read in between your comments

best
alok
Peter Freddolino wrote:

>Hi Alok,
>it sounds like there are several different issues here. Let's work our
>way down the hierarchy... at the end I'll discuss how this actually
>affects simulation reliability/accuracy.
>
>If you run constant temperature in parallel, you should not expect
>identical results even using identical runs with the same seed. This was
>mentioned earlier in this discussion (see my first reply) and in the
>manual (http://www.ks.uiuc.edu/Research/namd/2.6/ug/node26.html) and
>occurs because of nondeterminism in the order of different processes
>communicating with the head node.
>
>If you run constant temperature on one processor, and you run the SAME
>simulation twice, you should expect the same results; for example,
>running B1 twice.
>
>If you run constant temperature on one processor but you restart, you
>should not expect identical results to a non-restarted run because the
>same seed is being applied to a different point. So, for example,
>running A with a seed and B1 with the same seed *should* give them same
>result, but because you don't know the internal state of the
>simulation's random number generator at the end of B1, you should *not*
>expect the second half of A to correspond to B2. Note that this is the
>one place where namd differs from charmm, since charmm's restart files
>carry a seed. It sounds like this is what you're running into here.
>
>
>
what do you mean by

"because the same seed is being applied to a different point."

This is what I can make out from this comment. I explain with an
example. Suppose we take 'x1' as the seed for the A run (same seed for
B1 run) and random number
at the half of run A or at the end of B1 run is 'y1'. But since, we
donot know the internal state of random number generator we are not
aware of 'y1'. The second half
of run A is proceding with 'y1' whereas to start the run B2 though we
are using the restart files that we get at the end of B1 but we are
providing 'x1' as the seed. Which I
believe is the root cause. Peter, am I right in quoting this example?
Please correct if I did a mistake.

Is there no way in NAMD which is also keeping track of the random
numbers generated so that at the time of crash we can open some kind
of log file and see what was the random number at the time of crash and
further use that random number for the restart file. I know all this holds
true if we are running constant temperature on single processor.
However, this random number generator is no more deterministic when it
comes to
doing simulation in parallel.

>If you run NVE simulations, you should always get the same results
>(limited by floating point imprecision, if you're running in parallel)
>from the same input coordinates and velocities.
>
>
Is this imprecision is because of nodeterminism in different processes
talking to master node and
non-associativity of floating point addition?

>Because of the third point above, what you did isn't really a fair test
>of the input precision; the proper test of the input precision would be
>to run A, B1, and B2 all as *NVE* simulations; if they're NVT, then even
>on one processor, I believe you'd expect different effects from the RNG.
>The differences between the second half of A and B2 should then be
>compared to two separate B2 runs.
>
>Perhaps the most important question is, does this matter. For NVT
>simulations, the nondeterminism of langevin dynamics between
>serial/parallel runs and across restarts should not matter if you do
>sufficient sampling, since either way you're sampling from the same
>ensemble. As long as you do enough sampling to get meaningful results,
>all of your observables should come out identical. The precision of
>restarts themselves *does* matter, since imprecision here actually
>changes the physics of what you're doing (this is particularly important
>in NVE).
>
>So, by my best understanding, barring any input/output imprecision
>(which will only be apparent to tests in NVE), B1 and B2 should be
>considered as good as A because they're both sampling from the same
>thermodynamic ensemble, and the only differences are in things that are
>supposed to be random (ie, the Langevin random forces); there's nothing
>that makes the particular random force in a given timestep of A more or
>less correct that that in B2. I just spoke with Jim Phillips, who
>confirmed that the old imprecision-on-restart issues were fixed
>immediately after the old discussion thread you linked to, so there
>should be no problems with the restart files themselves.
>
>Please let me know if any of this is unclear.
>
>Best,
>Peter
>
>Alok Juneja wrote:
>
>
>>Dear Peter,
>>
>>Yes, I am specifying the identical seed value in A (complete run), B1
>>(1st half) and B2 (2nd half). A is one complete run where as B1 & B2
>>simulations are serial that means that I am using the restart file of
>>B1 for the B2 run.
>>Peter, I am not clear with what do you mean by serial or parallel? As
>>I mentioned earlier my runs BI and B2 are serial. This simulation I am
>>running on single same processor. Kindly mention the link where the
>>non-determinism of the Langevin thermostat in parallel has been talked
>>about.
>>
>>So comming back to square one, after reading all the comments in this
>>discussion, I believe there exist NO solution to this problem that is
>>occuring either because of numerical inaccuracy or non-determinism.
>>
>>Could the B1 and B2 MD runs be considered as good as single A MD run.
>>
>>-Alok
>>
>>Peter Freddolino wrote:
>>
>>
>>
>>>Hi Alok,
>>>just to verify, since you're running NVT, did you specify a seed value
>>>in your config file for the A-B1-B2 simulations? And were your
>>>production runs serial or parallel? If your production runs are done in
>>>parallel then the differences you observe in the first part of your
>>>email are really unremarkable, and have nothing to do with precision and
>>>everything to do with the nondeterminism of the langevin thermostat in
>>>parallel that has been mentioned earlier.
>>>Best,
>>>Peter
>>>
>>>Alok Juneja wrote:
>>>
>>>
>>>
>>>
>>>>Dear Peter, Dave, Himanshu & other list member,
>>>>
>>>>Sorry for not answering ealier though I was regularly following the
>>>>discussion on this issue. As requested by Peter, I am providing my
>>>>findings about this issue..
>>>>
>>>>I am running constant temperature 50 ns dynamics, total of 25000000
>>>>steps with time step of 0.002ps and dcdfreq of 100 however restartfreq
>>>>of 100000. Somehow my MD crashed at 5459300 but my last restrart was
>>>>5400000. I restarted with this. I am doing this MD to see the protein
>>>>behavious and am calculating the N and C terminal distance (Ang.).
>>>>Following is the N-C terminal distance before crash and after crash. I
>>>>am running this simulation in parallel.
>>>>
>>>># TIME(PS) Before-Crash After-Crash
>>>>10800 10.833
>>>>10800.2 11.3259 11.0924
>>>>10800.4 11.2417 11.1039
>>>>10800.6 10.985 10.9962
>>>>10800.8 10.7715 11.1593
>>>>10801 11.3783 11.4828
>>>>10801.2 11.1862 10.9861
>>>>10801.4 11.3925 10.9671
>>>>10801.6 10.8473 10.9287
>>>>(*) 10801.8 10.5789 11.013
>>>>10802 10.8792 10.4324
>>>>10802.2 10.6182 10.4422
>>>>10802.4 10.8918 10.6541
>>>>10802.6 10.9267 10.7829
>>>>10802.8 10.6352 10.8386
>>>>10803 10.8069 10.4295
>>>>(*) 10803.2 11.3242 10.5952 (*) 10803.4
>>>>11.3397 10.4784
>>>>(*) 10803.6 11.5822 10.4696
>>>>(*) 10803.8 11.023 10.8231
>>>>10804 10.9887 10.4586
>>>>10804.2 10.5118 10.3266
>>>>(*) 10804.4 10.4329 9.95989
>>>>10804.6 10.6863 10.2366
>>>>(*) 10804.8 11.3551 10.2149
>>>>(*) 10805 11.3445 9.88589
>>>>10805.2 10.7702 10.1757
>>>>10805.4 10.4436 10.3636
>>>>10805.6 10.3206 10.2086
>>>>10805.8 10.8214 10.5937
>>>>10806 11.2742 10.3849
>>>>10806.2 11.44 10.2721
>>>>(*) 10806.4 11.2566 10.1909
>>>>10806.6 10.9381 10.7606
>>>>10806.8 11.5617 10.8286
>>>>10807 11.7283 11.246
>>>>10807.2 11.4038 11.2901
>>>>10807.4 10.5862 10.708
>>>>10807.6 10.61 10.6308
>>>>10807.8 11.1818 10.2391
>>>>10808 11.3433 10.5278
>>>>10808.2 11.1947 11.0142
>>>>10808.4 10.9988 11.2578
>>>>(*) 10808.6 10.447 11.334
>>>>10808.8 10.3205 10.9368
>>>>10809 10.7634 10.9165
>>>>10809.2 10.7874 11.1041
>>>>10809.4 11.011 11.15
>>>>10809.6 10.8222 10.9214
>>>>10809.8 10.8731 10.2806
>>>>10810 11.0003 10.908
>>>>
>>>>You will find so many time steps where the difference is remarkable
>>>>(indicated by *). I believe that these difference is too much for me.
>>>>I checked this and found that this is not the case with CHARMM where
>>>>you get the identical results even after restart.
>>>>
>>>>For your ready reference, I am attaching the total energy graph for
>>>>comparision (comparision.pdf
>>>>[http://www.geocities.com/junejaalok/comparision.pdf]).
>>>>As requested by Dave, I am attaching file A-B1-B2.pdf
>>>>[http://www.geocities.com/junejaalok/A-B1-B2.pdf], the job run on
>>>>single same processor.
>>>>
>>>>Test A energy profile on
>>>>[http://www.geocities.com/junejaalok/testA.txt]
>>>>TestB1 energy profile on
>>>>[http://www.geocities.com/junejaalok/testB1.txt]
>>>>TestB2 energy profile on
>>>>[http://www.geocities.com/junejaalok/testB2.txt]
>>>>
>>>>since, i am restricted the with the amount of characters that one can
>>>>write in NAMD forum and the size of attachments, I am putting an extra
>>>>links for you to see the files and results..hope you understand.
>>>>
>>>>I appreciate your efforts to get into the depth. But I believe the
>>>>NAMD developers should really think over this issue..however, any
>>>>solution and suggestions in this regard would be of great help for
>>>>others as well..
>>>>
>>>>
>>>>Best Wishes,
>>>>Alok
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>
>
>
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:45:39 CST