From: Jim Phillips (jim_at_ks.uiuc.edu)
Date: Fri Sep 03 2004 - 17:21:25 CDT
Hi all,
I can explain this now, and it's really stupid.  Binary files are written 
out in blessed double precision.  There's some minor twiddling that 
happens with periodic boundary conditions, but that's only the last two 
significant digits so it shouldn't give a noticable error.  In any case, 
when that nice double precision coordinate file is read back in, it is 
immediately stored in a PDB class.  This PDB class, it so happens, stores 
coordinates as single-precision floats, dropping about nine significant 
digits, which nicely explains the observed deviations.  :-(
This shall be confirmed and fixed.
For the record, another factor in non-repeatability is that since NAMD is 
execution-driven, in a parallel simulation forces and energies are added 
in a possibly different order each time a job is run.  The small 
differences usually won't show up in the energies on the first step, but 
since MD is a chaotic algorithm, they will accumulate over time and be
noticable after a couple of hundred steps.
-Jim
On Tue, 31 Aug 2004, Harald Tepper wrote:
> Dear Erik, Blake, and others,
>
> Since I originally posted this issue, I'd like to add some more details on my 
> observations. My apologies that I did not reply earlier, I just came back 
> from a long trip and only just now found time to look at it again.
>
> I still feel this is not just a random number issue.
> In principle, any numerical integrator should be able to dump all its 
> information in binary restart files and then carry on exactly where it left 
> off. In my understanding, binary files are supposed to contain all numbers up 
> to machine precision so the accuracy of the restart should then be equal to 
> the original one up to the very last digit.
> This could even hold for integrators that involve random numbers because the 
> intermediate 'seed' in the sequence could be stored in the restart files as 
> well.
>
> I agree that when a user chooses an integrator with a random component, he is 
> probably not interested in such absolute accuracy in the first place, so 
> there is no problem if it was decided in NAMD not preserve the accuracy.
>
> That is why I abandoned Langevin dynamics in the reported test and went to 
> Temperature Coupling. From the NAMD manual, this seems to be equivalent to 
> just Berendsen temperature scaling. The NAMD manual refers to XPLOR, which is 
> explained here:
> http://atb.slac.stanford.edu/xplor/manual/node212.html#SECTION001113300000000000000
>
> The Berendsen thermostat does not contain random components and is thus 
> completely deterministic. I.e. the trajectory should depend solely on 
> supplied coordinates and velocities so a binary restart from coor and vel 
> files should preserve accuracy.
>
> Here is my result for steps 49 and 50 of a 100 timestep run, Berendsen 
> thermostat, stepsperCycle 1, COMmotion yes, binaryrestart yes:
>
> ENERGY:      49     18676.6336     12527.0398      1730.0350       158.8233 
> -224772.7641     21294.8143         0.0000         0.0000     47194.0352 
> -123191.3829       300.1412   -122751.8230   -122757.8443       300.1412 
> -979.0217      -707.0165    562855.7684      -979.0217      -707.0165
>
> ENERGY:      50     18765.1163     12469.2288      1734.0654       163.2593 
> -224813.4093     21293.9241         0.0000         0.0000     47198.4790 
> -123189.3364       300.1694   -122748.8909   -122757.8569       300.1694 
> -1080.1996      -712.9883    562855.7684     -1080.1996      -712.9883
>
>
> ******
> The above lines are exactly identical for a 50 step run with the same 
> conditions.
>
> ******
> Now the first two steps of the restart (firstTimestep   50):
>
> ENERGY:      50     18765.1266     12469.2309      1734.0654       163.2592 
> -224813.4146     21293.9224         0.0000         0.0000     47198.4790 
> -123189.3310       300.1694   -122748.8854   -122748.8854       300.1694 
> -1080.2132      -712.9895    562855.7684     -1080.2132      -712.9895
>
> ENERGY:      51     18668.0012     12425.7646      1737.4346       168.1110 
> -224782.1857     21291.2156         0.0000         0.0000     47293.2773 
> -123198.3814       300.7723   -122760.8973   -122749.7911       300.7723 
> -1024.9337      -710.1710    562855.7684     -1024.9337      -710.1710
>
>
>
> As far as I understand, at least step number 50 should be exactly identical. 
> Is it not supposed to be the 'zeroth' step of the restart, i.e. should not 
> the energies reflect the exact same configuration, without any integration 
> step? What is puzzling me is that even the bond energy, angle energy, etc. 
> are already different.
> In all runs I allowed for COM motion, just to be sure that no velocities 
> would be touched upon restart.
>
> As a cross-check, I now also did an NVE run, i.e. without tcouple, this gave 
> the following results.
> 100 step run:
> ENERGY:      49     18676.2060     12526.6298      1730.0089       158.8146 
> -224774.3459     21294.2251         0.0000         0.0000     47190.7551 
> -123197.7064       300.1203   -122758.1653   -122764.2241       300.1203 
> -979.7469      -707.6870    562855.7684      -979.7469      -707.6870
>
> ENERGY:      50     18764.6472     12468.7223      1734.0395       163.2508 
> -224814.9560     21293.3453         0.0000         0.0000     47195.3315 
> -123195.6194       300.1494   -122755.1941   -122764.1901       300.1494 
> -1080.8269      -713.6474    562855.7684     -1080.8269      -713.6474
>
> *****
> again exactly the same for the 50 timestep run
>
> *****
> first two steps of the restart:
> ENERGY:      50     18764.6533     12468.7215      1734.0396       163.2509 
> -224814.9590     21293.3450         0.0000         0.0000     47195.3315 
> -123195.6171       300.1494   -122755.1916   -122755.1916       300.1494 
> -1080.8295      -713.6468    562855.7684     -1080.8295      -713.6468
>
> ENERGY:      51     18667.0796     12425.1978      1737.4089       168.1029 
> -224783.6911     21290.6477         0.0000         0.0000     47290.5966 
> -123204.6577       300.7553   -122767.2084   -122756.0478       300.7553 
> -1025.4238      -710.8155    562855.7684     -1025.4238      -710.8155
>
>
> ********
> Clearly, the differences are not negligible.
> Of course, for most applications such a discrepancy will not be a problem, 
> since the trajectories are chaotic anyway, but it does mean that an important 
> check for debugging is lost if the source of this is not understood.
> For all other MD programs I have experience with, restarts where always 
> exact.
>
> It would be nice if someone has another idea what could be the cause of this. 
> Maybe I am just misunderstanding something that the code does in between the 
> 2 separate runs.
>
> Thanks in advance for any help / suggestions,
>
> Harald
>
>
>
>
> nordgren_at_sas.upenn.edu wrote:
>
>> Dear Blake (and list):
>> 
>> I believe there is a simple and logical answer here.  Anytime a NAMD run
>> begins, the random number generator is seeded (either by the user-supplied
>> seed or by the current system time).  From then on, during a single run,
>> the (psuedo)random numbers needed by the program will be taken from a
>> deterministic list.
>> 
>> Thus, if you split one run into two separate ones, you are effectively
>> choosing a different random seed for the start of the second run (compared
>> to where you would be halfway through the single run), and any random
>> numbers after that will be divergent in the two cases.  This is why the
>> two cases are identical at first, but differ slightly as soon as the second
>> shorter run begins.
>> 
>> This would be the case using *any* of the constant-temperature algorithms
>> (except, I believe, velocity rescaling) and not just the Langevin method,
>> since they all involve using random numbers.  If you really aren't using
>> any CT algorithm, then I'm not sure where the random numbers are being 
>> used,
>> but I strongly suspect that this is still the cause of your observations.
>> 
>> In any case, seeing discrepancies at the 10^-7 level isn't exactly
>> catastrophic for most applications.
>> 
>> - Erik
>> 
>> C. Erik Nordgren, Ph.D. Department of Chemistry University of Pennsylvania 
>> 
>> 
>>> Hello everyone,
>>> 
>>> This is further to a post by Harald Tepper
>>> (http://www.ks.uiuc.edu/Research/namd/mailing_list/namd-l/0972.html) in
>>> which he noted that one run of length t does not produce the same results
>>> as
>>> two consecutive runs of length t/2 each, which suggests that it is best
>>> to
>>> avoid splitting a run into several smaller runs. The details of my own
>>> similar test, in which I found that restarting introduces small
>>> differences
>>> in the total system energy at the 8th significant digit, are at the end
>>> of
>>> this message.
>>> 
>>> This interests me because I am using a computational facility at which I
>>> cannot run jobs for more than 8 hours.
>>> 
>>>> From the discussion on shadow orbits, etc., by Frenkel & Smit
>>> (Understanding
>>> Molecular Dynamics from Algorithms to Applications, 2002, pp. 71-74) and
>>> from the discussion of single and double precision in Appendix A of the
>>> GROMACS manual (pp. 169-170), I would expect that neither slight
>>> imprecision
>>> nor slight inaccuracy leads to trajectories that are more inaccurate than
>>> would result from high numerical precision and accuracy. Therefore,
>>> repeatedly terminating and restarting simulations will not decrease
>>> accuracy.
>>> 
>>> Am I correct in assuming that repeatedly terminating and restarting
>>> simulations is ok for free dynamics runs and for steered MD runs?
>>> 
>>> Thank you,
>>> Blake Charlebois
>>> 
>>> DETAILS:
>>> 
>>> System: a protein solvated in water
>>> Atoms in system: 12696
>>> Number of processors: 4 (on one machine)
>>> Timestep: 1.0 fs
>>> Temperature control: none
>>> 
>>> I tried several runs using binary coordinate and velocity files:
>>> A) a 200-fs run (initial step 0) starting from an initial set of
>>> coordinates
>>> and velocities
>>> B1) a 100-fs run (initial step 0) starting from the same initial state as
>>> A
>>> B2) a 100-fs run (initial step 100) starting from the end of B1
>>> 
>>> I used the same random number seed for all runs.
>>> 
>>> I specified an output every timestep:
>>> outputEnergies 1
>>> outputPressure 1
>>> stepspercycle 1
>>> 
>>> I monitored total system energy.
>>> 
>>> The first 100 fs of run A seems to be identical to run B1. However, the
>>> second 100 fs of run A is different from run B2. For instance, energies in
>>> A
>>> and B1 are the same while energies in A and B2 are the same to at least 7
>>> significant figures. The discrepancy does not seem to increase over 100
>>> fs.
>>> I would have expected a them to be the same to a few more digits. I
>>> suppose
>>> errors propagate quickly in such a large system.
>>> 
>>> The backbone RMSD between A and B1 seems to be zero while the backbone
>>> RMSD
>>> between A and B2 seems to be (very roughly) 4e-7 Angstroms. It is 6e-7
>>> for
>>> the entire protein and 1e-6 Angstroms for all atoms including solvent.
>>> 
>>> Loosely related:
>>> http://www.ks.uiuc.edu/Research/namd/mailing_list/namd-l/0585.html
>>> http://www.ks.uiuc.edu/Research/namd/mailing_list/namd-l/0780.html
>>> http://www.ks.uiuc.edu/Research/namd/mailing_list/namd-l/0677.html
>>> http://www.ks.uiuc.edu/Research/namd/mailing_list/namd-l/0937.html
>>> 
>>> 
>> 
>> 
>> 
>
>
> -- 
> --------------------------------------------------------------------
>
>     'Wat baten kaars en bril als de uil niet zien wil?'
>
> Dr. Harald Tepper
> FOM Institute for Atomic and Molecular Physics [AMOLF]
> P.O.Box 41883
> 1009 DB Amsterdam
> The Netherlands
> Tel:   +31 (20) 6081389
> Fax:   +31 (20) 6684106
> mailto:H.L.Tepper_at_amolf.nl
> WWW:   http://www.hec.utah.edu/~harald/
>
> --------------------------------------------------------------------
>
>
>
>
This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:38:52 CST