Re: NAMD 2.9 ibverbs: Charmrun crashing on checkpoint-restart

From: Vibhor Agrawal (vibhora_at_g.clemson.edu)
Date: Tue Jul 16 2013 - 00:56:33 CDT

Hello Tristan:
Can you attach your supercomputer script.so that it may be more clear

Vibhor
On Jul 16, 2013 1:31 AM, "Tristan Croll" <tristan.croll_at_qut.edu.au> wrote:

> Hi,****
>
> ** **
>
> We’ve just gotten NAMD 2.9 running on our local SGI Altix cluster using
> openMPI. In general it’s running great, with near-linear scaling up to 512
> cores. However, I’m running into strange crashing errors when I have
> multiple jobs queued up (using the “-W depend=afterany:” flag in qsub). If
> the continuation job starts immediately after completion of the previous
> job, charmrun crashes out on startup. All I get in the logfile is a core
> dump, while the stdout record gives me the message copied below.****
>
> ** **
>
> In these runs, I’ve been doing what’s always worked fine for me in the
> past: telling NAMD to run more steps than it can do in the allotted time,
> and allowing PBS to kill the job once it hits the walltime. From watching
> how the abovementioned fault develops with a series of smaller jobs,
> however, it seems the most reasonable explanation is that PBS is not giving
> sufficient time for some cleanup task before it starts the new run. Is
> this a known problem with large NAMD jobs, or is it more likely to be a
> cluster-specific problem?****
>
> ** **
>
> Many thanks,****
>
> ** **
>
> Tristan****
>
> ** **
>
> Lyra: OpenMPI (1.4.5) module loaded.****
>
> Lyra: NAMD (2.9) module loaded.****
>
> Lyra: NAMD_2.9 Linux-x86_64-ibverbs****
>
> Charmrun> charmrun started...****
>
> Charmrun> mpiexec started****
>
> Charmrun> node programs all started****
>
> Charmrun> node programs all connected****
>
> Charmrun> started all node programs in 2.911 seconds.****
>
> ------------- Processor 0 Exiting: Caught Signal ------------****
>
> Signal: segmentation violation****
>
> Suggestion: Try running with '++debug', or linking with '-memory paranoid'
> (memory paranoid requires '+netpoll' at runtime).****
>
> ------------- Processor 0 Exiting: Caught Signal ------------****
>
> Signal: segmentation violation****
>
> Suggestion: Try running with '++debug', or linking with '-memory paranoid'
> (memory paranoid requires '+netpoll' at runtime).****
>
> ------------- Processor 0 Exiting: Caught Signal ------------****
>
> Signal: segmentation violation****
>
> Suggestion: Try running with '++debug', or linking with '-memory paranoid'
> (memory paranoid requires '+netpoll' at runtime).****
>
> ------------- Processor 0 Exiting: Caught Signal ------------****
>
> Signal: segmentation violation****
>
> Suggestion: Try running with '++debug', or linking with '-memory paranoid'
> (memory paranoid requires '+netpoll' at runtime).****
>
> ------------- Processor 0 Exiting: Caught Signal ------------****
>
> Signal: segmentation violation****
>
> Suggestion: Try running with '++debug', or linking with '-memory paranoid'
> (memory paranoid requires '+netpoll' at runtime).****
>
> ------------- Processor 0 Exiting: Caught Signal ------------****
>
> Signal: segmentation violation****
>
> Suggestion: Try running with '++debug', or linking with '-memory paranoid'
> (memory paranoid requires '+netpoll' at runtime).****
>
> Fatal error on PE 0> segmentation violation****
>
> ** **
>
> ** **
>
> Tristan Croll****
>
> Lecturer****
>
> Faculty of Health****
>
> Institute of Health and Biomedical Engineering****
>
> Queensland University of Technology****
>
> 60 Musk Ave****
>
> Kelvin Grove QLD 4059 Australia****
>
> +61 7 3138 6443****
>
> ** **
>
> *This email and its attachments (if any) contain confidential information
> intended for use by the addressee and may be privileged. We do not waive
> any confidentiality, privilege or copyright associated with the email or
> the attachments. If you are not the intended addressee, you must not use,
> transmit, disclose or copy the email or any attachments. If you receive
> this email by mistake, please notify the sender immediately and delete the
> original email.*****
>
> ** **
>
> ** **
>

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:23:28 CST