NAMD 2.9 ibverbs: Charmrun crashing on checkpoint-restart

From: Tristan Croll (tristan.croll_at_qut.edu.au)
Date: Tue Jul 16 2013 - 00:14:36 CDT

Hi,

We've just gotten NAMD 2.9 running on our local SGI Altix cluster using openMPI. In general it's running great, with near-linear scaling up to 512 cores. However, I'm running into strange crashing errors when I have multiple jobs queued up (using the "-W depend=afterany:" flag in qsub). If the continuation job starts immediately after completion of the previous job, charmrun crashes out on startup. All I get in the logfile is a core dump, while the stdout record gives me the message copied below.

In these runs, I've been doing what's always worked fine for me in the past: telling NAMD to run more steps than it can do in the allotted time, and allowing PBS to kill the job once it hits the walltime. From watching how the abovementioned fault develops with a series of smaller jobs, however, it seems the most reasonable explanation is that PBS is not giving sufficient time for some cleanup task before it starts the new run. Is this a known problem with large NAMD jobs, or is it more likely to be a cluster-specific problem?

Many thanks,

Tristan

Lyra: OpenMPI (1.4.5) module loaded.
Lyra: NAMD (2.9) module loaded.
Lyra: NAMD_2.9 Linux-x86_64-ibverbs
Charmrun> charmrun started...
Charmrun> mpiexec started
Charmrun> node programs all started
Charmrun> node programs all connected
Charmrun> started all node programs in 2.911 seconds.
------------- Processor 0 Exiting: Caught Signal ------------
Signal: segmentation violation
Suggestion: Try running with '++debug', or linking with '-memory paranoid' (memory paranoid requires '+netpoll' at runtime).
------------- Processor 0 Exiting: Caught Signal ------------
Signal: segmentation violation
Suggestion: Try running with '++debug', or linking with '-memory paranoid' (memory paranoid requires '+netpoll' at runtime).
------------- Processor 0 Exiting: Caught Signal ------------
Signal: segmentation violation
Suggestion: Try running with '++debug', or linking with '-memory paranoid' (memory paranoid requires '+netpoll' at runtime).
------------- Processor 0 Exiting: Caught Signal ------------
Signal: segmentation violation
Suggestion: Try running with '++debug', or linking with '-memory paranoid' (memory paranoid requires '+netpoll' at runtime).
------------- Processor 0 Exiting: Caught Signal ------------
Signal: segmentation violation
Suggestion: Try running with '++debug', or linking with '-memory paranoid' (memory paranoid requires '+netpoll' at runtime).
------------- Processor 0 Exiting: Caught Signal ------------
Signal: segmentation violation
Suggestion: Try running with '++debug', or linking with '-memory paranoid' (memory paranoid requires '+netpoll' at runtime).
Fatal error on PE 0> segmentation violation

Tristan Croll
Lecturer
Faculty of Health
Institute of Health and Biomedical Engineering
Queensland University of Technology
60 Musk Ave
Kelvin Grove QLD 4059 Australia
+61 7 3138 6443

This email and its attachments (if any) contain confidential information intended for use by the addressee and may be privileged. We do not waive any confidentiality, privilege or copyright associated with the email or the attachments. If you are not the intended addressee, you must not use, transmit, disclose or copy the email or any attachments. If you receive this email by mistake, please notify the sender immediately and delete the original email.

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:23:28 CST