NAMD3 runs failing with no error - XST failing to load and jumping to negative timestep

From: Nicole Richardson (RCHNIC009_at_myuct.ac.za)
Date: Thu Mar 10 2022 - 01:44:29 CST

Hi all,

I've been running MD simulations of solvated carbohydrate-based molecules with NAMD3 on A100 Nvidia GPU cards. I have successfully run about 2100ns of simulation (broken into 100ns chunks restarting from the last timestep each time). I am trying to extend this simulation to 2500ns, however, when I try and run the next 100ns (2100-2200ns), my job completes in about 40s (usually takes four days) with no error message.

When I look in the log file everything appears to start up and load normally from the last step (2100000000fs), however, when it comes to starting the simulation, the XST file doesn't load normally and the timestep jumps to -2094967296fs before completing with no error messages.

What I see in the failed run log files:

TCL: Running for 100000000 steps
PRESSURE: 2100000000 179.257 -48.3392 14.7691 -48.3391 -23.1117 26.2091 14.7699 26.2087 23.974
GPRESSURE: 2100000000 58.5932 12.5161 -7.17997 -23.4519 -15.1422 7.11123 7.1262 -73.2121 80.6467
ETITLE: TS BOND ANGLE DIHED IMPRP ELECT VDW BOUNDARY MISC KINETIC TOTAL TEMP POTENTIAL TOTAL3 TEMPAVG PRESSURE GPRESSURE VOLUME PRESSAVG GPRESSAVG

ENERGY: 2100000000 91254.0209 51736.6118 1220.2276 3.7103 -1126784.1720 124629.9070 0.0000 0.0000 239952.8038 -617986.8906 300.7945 -857939.6943 -615711.6734 300.7945 60.0397 41.3659 2572854.6858 60.0397 41.3659

OPENING EXTENDED SYSTEM TRAJECTORY FILE
WRITING EXTENDED SYSTEM TO OUTPUT FILE AT STEP -2094967296
CLOSING EXTENDED SYSTEM TRAJECTORY FILE
WRITING COORDINATES TO OUTPUT FILE AT STEP -2094967296
COORDINATE DCD FILE /scratch/rchnic009/pn10b_6RU/run22/pn10b_6RU_run22.dcd WAS NOT CREATED
The last position output (seq=-2) takes 0.019 seconds, 0.000 MB of memory in use
WRITING VELOCITIES TO OUTPUT FILE AT STEP -2094967296
The last velocity output (seq=-2) takes 0.015 seconds, 0.000 MB of memory in use
====================================================

WallClock: 39.436554 CPUTime: 38.389771 Memory: 0.000000 MB
[Partition 0][Node 0] End of program

What I expect from successful runs:

TCL: Running for 100000000 steps
PRESSURE: 2000000000 -220.025 89.416 -156.942 89.4157 -79.0945 16.1138 -156.942 16.1138 -18.1251
GPRESSURE: 2000000000 -75.8928 40.0892 -57.6707 66.6383 38.9224 8.28696 -36.6139 42.0393 -89.5388
ETITLE: TS BOND ANGLE DIHED IMPRP ELECT VDW BOUNDARY MISC KINETIC TOTAL TEMP POTENTIAL TOTAL3 TEMPAVG PRESSURE GPRESSURE VOLUME PRESSAVG GPRESSAVG

ENERGY: 2000000000 91331.3257 51973.8169 1205.2632 2.9099 -1126485.6683 124403.0619 0.0000 0.0000 239372.8908 -618196.4000 300.0676 -857569.2907 -615924.8763 300.0676 -105.7481 -42.1697 2579962.2125 -105.7481 -42.1697

OPENING EXTENDED SYSTEM TRAJECTORY FILE
Info: Initial time: 1 CPUs 0.00554295 s/step 15.5874 ns/day 0 MB memory
Info: Initial time: 1 CPUs 0.00311856 s/step 27.7051 ns/day 0 MB memory
Info: Initial time: 1 CPUs 0.00311102 s/step 27.7722 ns/day 0 MB memory
Info: Benchmark time: 1 CPUs 0.00305723 s/step 28.2609 ns/day 0 MB memory
Info: Benchmark time: 1 CPUs 0.00309188 s/step 27.9442 ns/day 0 MB memory
OPENING COORDINATE DCD FILE
WRITING COORDINATES TO DCD FILE /scratch/rchnic009/pn10b_6RU/run21/pn10b_6RU_run21.dcd AT STEP 2000000250
The last position output (seq=2000000250) takes 0.021 seconds, 0.000 MB of memory in use
Info: Benchmark time: 1 CPUs 0.00367223 s/step 23.5279 ns/day 0 MB memory
Info: Benchmark time: 1 CPUs 0.00301884 s/step 28.6203 ns/day 0 MB memory
Info: Benchmark time: 1 CPUs 0.00303448 s/step 28.4727 ns/day 0 MB memory
Info: Benchmark time: 1 CPUs 0.00303912 s/step 28.4293 ns/day 0 MB memory
WRITING COORDINATES TO DCD FILE /scratch/rchnic009/pn10b_6RU/run21/pn10b_6RU_run21.dcd AT STEP 2000000500
The last position output (seq=2000000500) takes 0.017 seconds, 0.000 MB of memory in use
WRITING COORDINATES TO DCD FILE /scratch/rchnic009/pn10b_6RU/run21/pn10b_6RU_run21.dcd AT STEP 2000000750
The last position output (seq=2000000750) takes 0.017 seconds, 0.000 MB of memory in use
WRITING COORDINATES TO DCD FILE /scratch/rchnic009/pn10b_6RU/run21/pn10b_6RU_run21.dcd AT STEP 2000001000
The last position output (seq=2000001000) takes 0.014 seconds, 0.000 MB of memory in use

I have looked through the relevant user manuals and the mailing list and haven't been able to shed any light on the issue. I have also experienced this exact same problem across four different MD simulations on four different A100 cards. I have also tried re-running my simulation from step 2000ns which runs fine to 2100ns but it fails again when trying to run the next 100ns. Each time it jumps to the same negative timestep.

Has anyone else experienced and solved this issue or have any idea of what may fix this problem?

Please reach out if there is any information I may have omitted and thanks in advance for your time!
Regards
Nicole Richardson
Disclaimer - University of Cape Town This email is subject to UCT policies and email disclaimer published on our website at https://urldefense.com/v3/__http://www.uct.ac.za/main/email-disclaimer__;!!DZ3fjg!qpAbet4sm-t0vANAjUHEw_VdulwXiiVuvYFT_UCQVPxHy7wGBVb3w1LPnVvK8mfabA$ or obtainable from +27 21 650 9111. If this email is not related to the business of UCT, it is sent by the sender in an individual capacity. Please report security incidents or abuse via https://urldefense.com/v3/__https://csirt.uct.ac.za/page/report-an-incident.php__;!!DZ3fjg!qpAbet4sm-t0vANAjUHEw_VdulwXiiVuvYFT_UCQVPxHy7wGBVb3w1LPnVsZxh6VFA$ .

This archive was generated by hypermail 2.1.6 : Tue Dec 13 2022 - 14:32:44 CST