Unpredictably Crashes upon Restarting

From: Matthew Guberman-Pfeffer (matthew.guberman-pfeffer_at_yale.edu)
Date: Fri Mar 27 2020 - 16:56:43 CDT

Dear NAMD community,

I have restarted my simulation from the same point (at 13.8 ns) and end up
with a different outcome each time. Most times the simulation crashes with
an error message, but the messages are always slightly different. I detail
the messages and what I've tried below.

1) First restart: ran from 13.8 to 14.4 ns:

ERROR: Atom 119 velocity is 63537.5 -53607.4 -29502 (limit is 12000, atom
197 of 651 on patch 16 pe 9)ERROR: Atom 127 velocity is -63953.9 53411.6
29213.9 (limit is 12000, atom 200 of 651 on patch 16 pe 9)ERROR: Atoms
moving too fast; simulation has become unstable (2 atoms on patch 16 pe 9).

2) Restarted from 13.8 ns saving every 1 fs to the DCD to visual the issue.
However, the simulation did not crash, and I was forced to terminate the
job at 18.9 ns because the dcd was consuming nearly a TB of space.

3) Restarted from 13.8 ns saving less frequently, hoping to repeat the
previous good performance while using less memory. But, at 16.3 ns, I got
the below error:

ERROR: Atom 125 velocity is -113436 -53195.6 -112022 (limit is 12000, atom
448 of 632 on patch 14 pe 19)
ERROR: Atoms moving too fast; simulation has become unstable (1 atoms on
patch 14 pe 19).

4) Restarted fromm 13.8 ns again. Now it crashed at 15.1 ns with:

ERROR: Atom 120 velocity is 20172.4 -16771.5 -221899 (limit is 12000, atom
83 of 609 on patch 16 pe 9)
ERROR: Atom 127 velocity is -20158.6 16781.3 221594 (limit is 12000, atom
93 of 609 on patch 16 pe 9)
ERROR: Atoms moving too fast; simulation has become unstable (2 atoms on
patch 16 pe 9).

5) Restarted from 13.8 ns. The simulation now crashed at 18.8 ns with:

ERROR: Margin is too small for 1 atoms during timestep 18841762.
ERROR: Incorrect nonbonded forces and energies may be calculated!
ERROR: Atom 284 velocity is -17748.9 688.258 -64061.4 (limit is 12000, atom
124 of 616 on patch 17 pe 22)
ERROR: Atoms moving too fast; simulation has become unstable (1 atoms on
patch 17 pe 22).

I get the point that the simulation is unstable. But why does it become
unstable after 13+ ns? Why do the time at which the simulation crashes and
the error message vary from one restart to the next? More importantly, what
can I try to resolve whatever the problem is preventing this simulation
from continuing?

Best,
Matthew

This archive was generated by hypermail 2.1.6 : Fri Dec 31 2021 - 23:17:08 CST