Re: Re: numerical inaccuracy upon restart

From: Harald Tepper (h.l.tepper_at_amolf.nl)
Date: Tue Aug 31 2004 - 05:21:21 CDT

Dear Erik, Blake, and others,

Since I originally posted this issue, I'd like to add some more details
on my observations. My apologies that I did not reply earlier, I just
came back from a long trip and only just now found time to look at it again.

I still feel this is not just a random number issue.
In principle, any numerical integrator should be able to dump all its
information in binary restart files and then carry on exactly where it
left off. In my understanding, binary files are supposed to contain all
numbers up to machine precision so the accuracy of the restart should
then be equal to the original one up to the very last digit.
This could even hold for integrators that involve random numbers because
the intermediate 'seed' in the sequence could be stored in the restart
files as well.

I agree that when a user chooses an integrator with a random component,
he is probably not interested in such absolute accuracy in the first
place, so there is no problem if it was decided in NAMD not preserve the
accuracy.

That is why I abandoned Langevin dynamics in the reported test and went
to Temperature Coupling. From the NAMD manual, this seems to be
equivalent to just Berendsen temperature scaling. The NAMD manual refers
to XPLOR, which is explained here:
http://atb.slac.stanford.edu/xplor/manual/node212.html#SECTION001113300000000000000

The Berendsen thermostat does not contain random components and is thus
completely deterministic. I.e. the trajectory should depend solely on
supplied coordinates and velocities so a binary restart from coor and
vel files should preserve accuracy.

Here is my result for steps 49 and 50 of a 100 timestep run, Berendsen
thermostat, stepsperCycle 1, COMmotion yes, binaryrestart yes:

ENERGY: 49 18676.6336 12527.0398 1730.0350
158.8233 -224772.7641 21294.8143 0.0000
0.0000 47194.0352 -123191.3829 300.1412
-122751.8230 -122757.8443 300.1412 -979.0217
-707.0165 562855.7684 -979.0217 -707.0165

ENERGY: 50 18765.1163 12469.2288 1734.0654
163.2593 -224813.4093 21293.9241 0.0000
0.0000 47198.4790 -123189.3364 300.1694
-122748.8909 -122757.8569 300.1694 -1080.1996
-712.9883 562855.7684 -1080.1996 -712.9883

******
The above lines are exactly identical for a 50 step run with the same
conditions.

******
Now the first two steps of the restart (firstTimestep 50):

ENERGY: 50 18765.1266 12469.2309 1734.0654
163.2592 -224813.4146 21293.9224 0.0000
0.0000 47198.4790 -123189.3310 300.1694
-122748.8854 -122748.8854 300.1694 -1080.2132
-712.9895 562855.7684 -1080.2132 -712.9895

ENERGY: 51 18668.0012 12425.7646 1737.4346
168.1110 -224782.1857 21291.2156 0.0000
0.0000 47293.2773 -123198.3814 300.7723
-122760.8973 -122749.7911 300.7723 -1024.9337
-710.1710 562855.7684 -1024.9337 -710.1710

As far as I understand, at least step number 50 should be exactly
identical. Is it not supposed to be the 'zeroth' step of the restart,
i.e. should not the energies reflect the exact same configuration,
without any integration step? What is puzzling me is that even the bond
energy, angle energy, etc. are already different.
In all runs I allowed for COM motion, just to be sure that no velocities
would be touched upon restart.

As a cross-check, I now also did an NVE run, i.e. without tcouple, this
gave the following results.
100 step run:
ENERGY: 49 18676.2060 12526.6298 1730.0089
158.8146 -224774.3459 21294.2251 0.0000
0.0000 47190.7551 -123197.7064 300.1203
-122758.1653 -122764.2241 300.1203 -979.7469
-707.6870 562855.7684 -979.7469 -707.6870

ENERGY: 50 18764.6472 12468.7223 1734.0395
163.2508 -224814.9560 21293.3453 0.0000
0.0000 47195.3315 -123195.6194 300.1494
-122755.1941 -122764.1901 300.1494 -1080.8269
-713.6474 562855.7684 -1080.8269 -713.6474

*****
again exactly the same for the 50 timestep run

*****
first two steps of the restart:
ENERGY: 50 18764.6533 12468.7215 1734.0396
163.2509 -224814.9590 21293.3450 0.0000
0.0000 47195.3315 -123195.6171 300.1494
-122755.1916 -122755.1916 300.1494 -1080.8295
-713.6468 562855.7684 -1080.8295 -713.6468

ENERGY: 51 18667.0796 12425.1978 1737.4089
168.1029 -224783.6911 21290.6477 0.0000
0.0000 47290.5966 -123204.6577 300.7553
-122767.2084 -122756.0478 300.7553 -1025.4238
-710.8155 562855.7684 -1025.4238 -710.8155

********
Clearly, the differences are not negligible.
Of course, for most applications such a discrepancy will not be a
problem, since the trajectories are chaotic anyway, but it does mean
that an important check for debugging is lost if the source of this is
not understood.
For all other MD programs I have experience with, restarts where always
exact.

It would be nice if someone has another idea what could be the cause of
this. Maybe I am just misunderstanding something that the code does in
between the 2 separate runs.

Thanks in advance for any help / suggestions,

Harald

nordgren_at_sas.upenn.edu wrote:

>Dear Blake (and list):
>
>I believe there is a simple and logical answer here. Anytime a NAMD run
>begins, the random number generator is seeded (either by the user-supplied
>seed or by the current system time). From then on, during a single run,
>the (psuedo)random numbers needed by the program will be taken from a
>deterministic list.
>
>Thus, if you split one run into two separate ones, you are effectively
>choosing a different random seed for the start of the second run (compared
>to where you would be halfway through the single run), and any random
>numbers after that will be divergent in the two cases. This is why the
>two cases are identical at first, but differ slightly as soon as the second
>shorter run begins.
>
>This would be the case using *any* of the constant-temperature algorithms
>(except, I believe, velocity rescaling) and not just the Langevin method,
>since they all involve using random numbers. If you really aren't using
>any CT algorithm, then I'm not sure where the random numbers are being used,
>but I strongly suspect that this is still the cause of your observations.
>
>In any case, seeing discrepancies at the 10^-7 level isn't exactly
>catastrophic for most applications.
>
>- Erik
>
>C. Erik Nordgren, Ph.D.
>Department of Chemistry
>University of Pennsylvania
>
>
>
>
>>Hello everyone,
>>
>>This is further to a post by Harald Tepper
>>(http://www.ks.uiuc.edu/Research/namd/mailing_list/namd-l/0972.html) in
>>which he noted that one run of length t does not produce the same results
>>as
>>two consecutive runs of length t/2 each, which suggests that it is best
>>to
>>avoid splitting a run into several smaller runs. The details of my own
>>similar test, in which I found that restarting introduces small
>>differences
>>in the total system energy at the 8th significant digit, are at the end
>>of
>>this message.
>>
>>This interests me because I am using a computational facility at which I
>>cannot run jobs for more than 8 hours.
>>
>>>From the discussion on shadow orbits, etc., by Frenkel & Smit
>>(Understanding
>>Molecular Dynamics from Algorithms to Applications, 2002, pp. 71-74) and
>>from the discussion of single and double precision in Appendix A of the
>>GROMACS manual (pp. 169-170), I would expect that neither slight
>>imprecision
>>nor slight inaccuracy leads to trajectories that are more inaccurate than
>>would result from high numerical precision and accuracy. Therefore,
>>repeatedly terminating and restarting simulations will not decrease
>>accuracy.
>>
>>Am I correct in assuming that repeatedly terminating and restarting
>>simulations is ok for free dynamics runs and for steered MD runs?
>>
>>Thank you,
>>Blake Charlebois
>>
>>DETAILS:
>>
>>System: a protein solvated in water
>>Atoms in system: 12696
>>Number of processors: 4 (on one machine)
>>Timestep: 1.0 fs
>>Temperature control: none
>>
>>I tried several runs using binary coordinate and velocity files:
>>A) a 200-fs run (initial step 0) starting from an initial set of
>>coordinates
>>and velocities
>>B1) a 100-fs run (initial step 0) starting from the same initial state as
>>A
>>B2) a 100-fs run (initial step 100) starting from the end of B1
>>
>>I used the same random number seed for all runs.
>>
>>I specified an output every timestep:
>>outputEnergies 1
>>outputPressure 1
>>stepspercycle 1
>>
>>I monitored total system energy.
>>
>>The first 100 fs of run A seems to be identical to run B1. However, the
>>second 100 fs of run A is different from run B2. For instance, energies in
>>A
>>and B1 are the same while energies in A and B2 are the same to at least 7
>>significant figures. The discrepancy does not seem to increase over 100
>>fs.
>>I would have expected a them to be the same to a few more digits. I
>>suppose
>>errors propagate quickly in such a large system.
>>
>>The backbone RMSD between A and B1 seems to be zero while the backbone
>>RMSD
>>between A and B2 seems to be (very roughly) 4e-7 Angstroms. It is 6e-7
>>for
>>the entire protein and 1e-6 Angstroms for all atoms including solvent.
>>
>>Loosely related:
>>http://www.ks.uiuc.edu/Research/namd/mailing_list/namd-l/0585.html
>>http://www.ks.uiuc.edu/Research/namd/mailing_list/namd-l/0780.html
>>http://www.ks.uiuc.edu/Research/namd/mailing_list/namd-l/0677.html
>>http://www.ks.uiuc.edu/Research/namd/mailing_list/namd-l/0937.html
>>
>>
>>
>
>
>
>

-- 
--------------------------------------------------------------------
      'Wat baten kaars en bril als de uil niet zien wil?'
Dr. Harald Tepper
FOM Institute for Atomic and Molecular Physics [AMOLF]
P.O.Box 41883
1009 DB Amsterdam
The Netherlands
Tel:   +31 (20) 6081389
Fax:   +31 (20) 6684106
mailto:H.L.Tepper_at_amolf.nl
WWW:   http://www.hec.utah.edu/~harald/
--------------------------------------------------------------------

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:37:50 CST