From: Joseph Farran (jfarran_at_uci.edu)
Date: Wed May 07 2014 - 13:54:51 CDT
Hi All / NAMD support.
We are running NAMD 2.9 on CentoOS 6.5 with Berkeley checkpoint and jobs
checkpoint and start-up just fine, however, when the job re-starts on
another node, the time to finish increases 2x to 3x:
TIMING: 16000 CPU: 668.71, 0.0411388/step Wall: 668.71,
0.0411388/step, 5.53088 hours remaining, 4338.894531 MB of memory in use.
TIMING: 17000 CPU: 710.398, 0.0416875/step Wall: 710.398,
0.0416875/step, 5.59307 hours remaining, 4338.894531 MB of memory in use.
<job jumped nodes>
TIMING: 18000 CPU: 817.05, 0.106652/step Wall: 817.05, 0.106652/step,
14.2795 hours remaining, 4338.894531 MB of memory in use.
TIMING: 19000 CPU: 943.168, 0.126118/step Wall: 943.168,
0.126118/step, 16.8507 hours remaining, 4338.894531 MB of memory in use.
The issue seems to be with memory allocation. When the job re-starts
on a different but similar node, memory allocation is lost.
Anyone knows how to save the current memory allocation and be able to
restore it with Linux numactl?
Thanks,
Joseph
This archive was generated by hypermail 2.1.6 : Thu Dec 31 2015 - 23:20:46 CST