Re: Re: numerical inaccuracy upon restart

From: Jim Phillips (jim_at_ks.uiuc.edu)
Date: Fri Sep 03 2004 - 17:21:25 CDT

Hi all,

I can explain this now, and it's really stupid. Binary files are written
out in blessed double precision. There's some minor twiddling that
happens with periodic boundary conditions, but that's only the last two
significant digits so it shouldn't give a noticable error. In any case,
when that nice double precision coordinate file is read back in, it is
immediately stored in a PDB class. This PDB class, it so happens, stores
coordinates as single-precision floats, dropping about nine significant
digits, which nicely explains the observed deviations. :-(

This shall be confirmed and fixed.

For the record, another factor in non-repeatability is that since NAMD is
execution-driven, in a parallel simulation forces and energies are added
in a possibly different order each time a job is run. The small
differences usually won't show up in the energies on the first step, but
since MD is a chaotic algorithm, they will accumulate over time and be
noticable after a couple of hundred steps.

-Jim

On Tue, 31 Aug 2004, Harald Tepper wrote:

> Dear Erik, Blake, and others,
>
> Since I originally posted this issue, I'd like to add some more details on my
> observations. My apologies that I did not reply earlier, I just came back
> from a long trip and only just now found time to look at it again.
>
> I still feel this is not just a random number issue.
> In principle, any numerical integrator should be able to dump all its
> information in binary restart files and then carry on exactly where it left
> off. In my understanding, binary files are supposed to contain all numbers up
> to machine precision so the accuracy of the restart should then be equal to
> the original one up to the very last digit.
> This could even hold for integrators that involve random numbers because the
> intermediate 'seed' in the sequence could be stored in the restart files as
> well.
>
> I agree that when a user chooses an integrator with a random component, he is
> probably not interested in such absolute accuracy in the first place, so
> there is no problem if it was decided in NAMD not preserve the accuracy.
>
> That is why I abandoned Langevin dynamics in the reported test and went to
> Temperature Coupling. From the NAMD manual, this seems to be equivalent to
> just Berendsen temperature scaling. The NAMD manual refers to XPLOR, which is
> explained here:
> http://atb.slac.stanford.edu/xplor/manual/node212.html#SECTION001113300000000000000
>
> The Berendsen thermostat does not contain random components and is thus
> completely deterministic. I.e. the trajectory should depend solely on
> supplied coordinates and velocities so a binary restart from coor and vel
> files should preserve accuracy.
>
> Here is my result for steps 49 and 50 of a 100 timestep run, Berendsen
> thermostat, stepsperCycle 1, COMmotion yes, binaryrestart yes:
>
> ENERGY: 49 18676.6336 12527.0398 1730.0350 158.8233
> -224772.7641 21294.8143 0.0000 0.0000 47194.0352
> -123191.3829 300.1412 -122751.8230 -122757.8443 300.1412
> -979.0217 -707.0165 562855.7684 -979.0217 -707.0165
>
> ENERGY: 50 18765.1163 12469.2288 1734.0654 163.2593
> -224813.4093 21293.9241 0.0000 0.0000 47198.4790
> -123189.3364 300.1694 -122748.8909 -122757.8569 300.1694
> -1080.1996 -712.9883 562855.7684 -1080.1996 -712.9883
>
>
> ******
> The above lines are exactly identical for a 50 step run with the same
> conditions.
>
> ******
> Now the first two steps of the restart (firstTimestep 50):
>
> ENERGY: 50 18765.1266 12469.2309 1734.0654 163.2592
> -224813.4146 21293.9224 0.0000 0.0000 47198.4790
> -123189.3310 300.1694 -122748.8854 -122748.8854 300.1694
> -1080.2132 -712.9895 562855.7684 -1080.2132 -712.9895
>
> ENERGY: 51 18668.0012 12425.7646 1737.4346 168.1110
> -224782.1857 21291.2156 0.0000 0.0000 47293.2773
> -123198.3814 300.7723 -122760.8973 -122749.7911 300.7723
> -1024.9337 -710.1710 562855.7684 -1024.9337 -710.1710
>
>
>
> As far as I understand, at least step number 50 should be exactly identical.
> Is it not supposed to be the 'zeroth' step of the restart, i.e. should not
> the energies reflect the exact same configuration, without any integration
> step? What is puzzling me is that even the bond energy, angle energy, etc.
> are already different.
> In all runs I allowed for COM motion, just to be sure that no velocities
> would be touched upon restart.
>
> As a cross-check, I now also did an NVE run, i.e. without tcouple, this gave
> the following results.
> 100 step run:
> ENERGY: 49 18676.2060 12526.6298 1730.0089 158.8146
> -224774.3459 21294.2251 0.0000 0.0000 47190.7551
> -123197.7064 300.1203 -122758.1653 -122764.2241 300.1203
> -979.7469 -707.6870 562855.7684 -979.7469 -707.6870
>
> ENERGY: 50 18764.6472 12468.7223 1734.0395 163.2508
> -224814.9560 21293.3453 0.0000 0.0000 47195.3315
> -123195.6194 300.1494 -122755.1941 -122764.1901 300.1494
> -1080.8269 -713.6474 562855.7684 -1080.8269 -713.6474
>
> *****
> again exactly the same for the 50 timestep run
>
> *****
> first two steps of the restart:
> ENERGY: 50 18764.6533 12468.7215 1734.0396 163.2509
> -224814.9590 21293.3450 0.0000 0.0000 47195.3315
> -123195.6171 300.1494 -122755.1916 -122755.1916 300.1494
> -1080.8295 -713.6468 562855.7684 -1080.8295 -713.6468
>
> ENERGY: 51 18667.0796 12425.1978 1737.4089 168.1029
> -224783.6911 21290.6477 0.0000 0.0000 47290.5966
> -123204.6577 300.7553 -122767.2084 -122756.0478 300.7553
> -1025.4238 -710.8155 562855.7684 -1025.4238 -710.8155
>
>
> ********
> Clearly, the differences are not negligible.
> Of course, for most applications such a discrepancy will not be a problem,
> since the trajectories are chaotic anyway, but it does mean that an important
> check for debugging is lost if the source of this is not understood.
> For all other MD programs I have experience with, restarts where always
> exact.
>
> It would be nice if someone has another idea what could be the cause of this.
> Maybe I am just misunderstanding something that the code does in between the
> 2 separate runs.
>
> Thanks in advance for any help / suggestions,
>
> Harald
>
>
>
>
> nordgren_at_sas.upenn.edu wrote:
>
>> Dear Blake (and list):
>>
>> I believe there is a simple and logical answer here. Anytime a NAMD run
>> begins, the random number generator is seeded (either by the user-supplied
>> seed or by the current system time). From then on, during a single run,
>> the (psuedo)random numbers needed by the program will be taken from a
>> deterministic list.
>>
>> Thus, if you split one run into two separate ones, you are effectively
>> choosing a different random seed for the start of the second run (compared
>> to where you would be halfway through the single run), and any random
>> numbers after that will be divergent in the two cases. This is why the
>> two cases are identical at first, but differ slightly as soon as the second
>> shorter run begins.
>>
>> This would be the case using *any* of the constant-temperature algorithms
>> (except, I believe, velocity rescaling) and not just the Langevin method,
>> since they all involve using random numbers. If you really aren't using
>> any CT algorithm, then I'm not sure where the random numbers are being
>> used,
>> but I strongly suspect that this is still the cause of your observations.
>>
>> In any case, seeing discrepancies at the 10^-7 level isn't exactly
>> catastrophic for most applications.
>>
>> - Erik
>>
>> C. Erik Nordgren, Ph.D. Department of Chemistry University of Pennsylvania
>>
>>
>>> Hello everyone,
>>>
>>> This is further to a post by Harald Tepper
>>> (http://www.ks.uiuc.edu/Research/namd/mailing_list/namd-l/0972.html) in
>>> which he noted that one run of length t does not produce the same results
>>> as
>>> two consecutive runs of length t/2 each, which suggests that it is best
>>> to
>>> avoid splitting a run into several smaller runs. The details of my own
>>> similar test, in which I found that restarting introduces small
>>> differences
>>> in the total system energy at the 8th significant digit, are at the end
>>> of
>>> this message.
>>>
>>> This interests me because I am using a computational facility at which I
>>> cannot run jobs for more than 8 hours.
>>>
>>>> From the discussion on shadow orbits, etc., by Frenkel & Smit
>>> (Understanding
>>> Molecular Dynamics from Algorithms to Applications, 2002, pp. 71-74) and
>>> from the discussion of single and double precision in Appendix A of the
>>> GROMACS manual (pp. 169-170), I would expect that neither slight
>>> imprecision
>>> nor slight inaccuracy leads to trajectories that are more inaccurate than
>>> would result from high numerical precision and accuracy. Therefore,
>>> repeatedly terminating and restarting simulations will not decrease
>>> accuracy.
>>>
>>> Am I correct in assuming that repeatedly terminating and restarting
>>> simulations is ok for free dynamics runs and for steered MD runs?
>>>
>>> Thank you,
>>> Blake Charlebois
>>>
>>> DETAILS:
>>>
>>> System: a protein solvated in water
>>> Atoms in system: 12696
>>> Number of processors: 4 (on one machine)
>>> Timestep: 1.0 fs
>>> Temperature control: none
>>>
>>> I tried several runs using binary coordinate and velocity files:
>>> A) a 200-fs run (initial step 0) starting from an initial set of
>>> coordinates
>>> and velocities
>>> B1) a 100-fs run (initial step 0) starting from the same initial state as
>>> A
>>> B2) a 100-fs run (initial step 100) starting from the end of B1
>>>
>>> I used the same random number seed for all runs.
>>>
>>> I specified an output every timestep:
>>> outputEnergies 1
>>> outputPressure 1
>>> stepspercycle 1
>>>
>>> I monitored total system energy.
>>>
>>> The first 100 fs of run A seems to be identical to run B1. However, the
>>> second 100 fs of run A is different from run B2. For instance, energies in
>>> A
>>> and B1 are the same while energies in A and B2 are the same to at least 7
>>> significant figures. The discrepancy does not seem to increase over 100
>>> fs.
>>> I would have expected a them to be the same to a few more digits. I
>>> suppose
>>> errors propagate quickly in such a large system.
>>>
>>> The backbone RMSD between A and B1 seems to be zero while the backbone
>>> RMSD
>>> between A and B2 seems to be (very roughly) 4e-7 Angstroms. It is 6e-7
>>> for
>>> the entire protein and 1e-6 Angstroms for all atoms including solvent.
>>>
>>> Loosely related:
>>> http://www.ks.uiuc.edu/Research/namd/mailing_list/namd-l/0585.html
>>> http://www.ks.uiuc.edu/Research/namd/mailing_list/namd-l/0780.html
>>> http://www.ks.uiuc.edu/Research/namd/mailing_list/namd-l/0677.html
>>> http://www.ks.uiuc.edu/Research/namd/mailing_list/namd-l/0937.html
>>>
>>>
>>
>>
>>
>
>
> --
> --------------------------------------------------------------------
>
> 'Wat baten kaars en bril als de uil niet zien wil?'
>
> Dr. Harald Tepper
> FOM Institute for Atomic and Molecular Physics [AMOLF]
> P.O.Box 41883
> 1009 DB Amsterdam
> The Netherlands
> Tel: +31 (20) 6081389
> Fax: +31 (20) 6684106
> mailto:H.L.Tepper_at_amolf.nl
> WWW: http://www.hec.utah.edu/~harald/
>
> --------------------------------------------------------------------
>
>
>
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:38:52 CST