RE: NAMD 2.9 ibverbs: Charmrun crashing on checkpoint-restart

From: Vibhor Agrawal (vibhora_at_g.clemson.edu)
Date: Tue Jul 16 2013 - 01:18:44 CDT

Make sure to give the path of charmm exe and if system is big enough
compile charmm moi xxlarge
On Jul 16, 2013 2:05 AM, "Tristan Croll" <tristan.croll_at_qut.edu.au> wrote:

> To be clear: this only crashes when it’s a queued continuation dependent
> on the completion of a previous job, only when it starts running *
> immediately* after the previous job completes (even a few seconds in ‘Q’
> status on PBS seems to make all the difference), and only when the previous
> job is killed due to exceeding its walltime. I tried putting a “sleep 60”
> command just before the execution of charmrun, but curiously that didn’t
> seem to help. There’s an obvious solution: stop being lazy and calculate a
> number of steps that will complete within the allotted walltime – but I
> thought I should put it out there for discussion anyway.****
>
> ** **
>
> *From:* Tristan Croll
> *Sent:* Tuesday, 16 July 2013 4:00 PM
> *To:* 'Vibhor Agrawal'
> *Cc:* Namd List
> *Subject:* RE: namd-l: NAMD 2.9 ibverbs: Charmrun crashing on
> checkpoint-restart****
>
> ** **
>
> Hi Vibhor,****
>
> ** **
>
> Sure – copied below. I was using this one as a test to replicate the
> error, hence the short walltime.****
>
> ** **
>
> Cheers,****
>
> ** **
>
> Tristan****
>
> ** **
>
> ** **
>
> ** **
>
> #!/bin/bash -l****
>
> #PBS -N exon11****
>
> #PBS -l select=32:ncpus=2:mem=4G:mpiprocs=2:cputype=E5-2670****
>
> #PBS -j oe****
>
> #PBS -m bea****
>
> #PBS -l walltime=00:05:00****
>
> ** **
>
> module load openmpi****
>
> module load namd/2.9-ibverbs****
>
> ****
>
> cd $PBS_O_WORKDIR****
>
> ****
>
> NAMD=`which namd2`****
>
> ** **
>
> charmrun ++mpiexec ++verbose +p64 $NAMD ./equil-23_test2.namd >
> ./equil-23_test2.log****
>
> ** **
>
> *From:* Vibhor Agrawal [mailto:vibhora_at_g.clemson.edu]
> *Sent:* Tuesday, 16 July 2013 3:57 PM
> *To:* Tristan Croll
> *Cc:* Namd List
> *Subject:* Re: namd-l: NAMD 2.9 ibverbs: Charmrun crashing on
> checkpoint-restart****
>
> ** **
>
> Hello Tristan:
> Can you attach your supercomputer script.so that it may be more clear****
>
> Vibhor****
>
> On Jul 16, 2013 1:31 AM, "Tristan Croll" <tristan.croll_at_qut.edu.au> wrote:
> ****
>
> Hi,****
>
> ****
>
> We’ve just gotten NAMD 2.9 running on our local SGI Altix cluster using
> openMPI. In general it’s running great, with near-linear scaling up to 512
> cores. However, I’m running into strange crashing errors when I have
> multiple jobs queued up (using the “-W depend=afterany:” flag in qsub). If
> the continuation job starts immediately after completion of the previous
> job, charmrun crashes out on startup. All I get in the logfile is a core
> dump, while the stdout record gives me the message copied below.****
>
> ****
>
> In these runs, I’ve been doing what’s always worked fine for me in the
> past: telling NAMD to run more steps than it can do in the allotted time,
> and allowing PBS to kill the job once it hits the walltime. From watching
> how the abovementioned fault develops with a series of smaller jobs,
> however, it seems the most reasonable explanation is that PBS is not giving
> sufficient time for some cleanup task before it starts the new run. Is
> this a known problem with large NAMD jobs, or is it more likely to be a
> cluster-specific problem?****
>
> ****
>
> Many thanks,****
>
> ****
>
> Tristan****
>
> ****
>
> Lyra: OpenMPI (1.4.5) module loaded.****
>
> Lyra: NAMD (2.9) module loaded.****
>
> Lyra: NAMD_2.9 Linux-x86_64-ibverbs****
>
> Charmrun> charmrun started...****
>
> Charmrun> mpiexec started****
>
> Charmrun> node programs all started****
>
> Charmrun> node programs all connected****
>
> Charmrun> started all node programs in 2.911 seconds.****
>
> ------------- Processor 0 Exiting: Caught Signal ------------****
>
> Signal: segmentation violation****
>
> Suggestion: Try running with '++debug', or linking with '-memory paranoid'
> (memory paranoid requires '+netpoll' at runtime).****
>
> ------------- Processor 0 Exiting: Caught Signal ------------****
>
> Signal: segmentation violation****
>
> Suggestion: Try running with '++debug', or linking with '-memory paranoid'
> (memory paranoid requires '+netpoll' at runtime).****
>
> ------------- Processor 0 Exiting: Caught Signal ------------****
>
> Signal: segmentation violation****
>
> Suggestion: Try running with '++debug', or linking with '-memory paranoid'
> (memory paranoid requires '+netpoll' at runtime).****
>
> ------------- Processor 0 Exiting: Caught Signal ------------****
>
> Signal: segmentation violation****
>
> Suggestion: Try running with '++debug', or linking with '-memory paranoid'
> (memory paranoid requires '+netpoll' at runtime).****
>
> ------------- Processor 0 Exiting: Caught Signal ------------****
>
> Signal: segmentation violation****
>
> Suggestion: Try running with '++debug', or linking with '-memory paranoid'
> (memory paranoid requires '+netpoll' at runtime).****
>
> ------------- Processor 0 Exiting: Caught Signal ------------****
>
> Signal: segmentation violation****
>
> Suggestion: Try running with '++debug', or linking with '-memory paranoid'
> (memory paranoid requires '+netpoll' at runtime).****
>
> Fatal error on PE 0> segmentation violation****
>
> ****
>
> ****
>
> Tristan Croll****
>
> Lecturer****
>
> Faculty of Health****
>
> Institute of Health and Biomedical Engineering****
>
> Queensland University of Technology****
>
> 60 Musk Ave****
>
> Kelvin Grove QLD 4059 Australia****
>
> +61 7 3138 6443****
>
> ****
>
> *This email and its attachments (if any) contain confidential information
> intended for use by the addressee and may be privileged. We do not waive
> any confidentiality, privilege or copyright associated with the email or
> the attachments. If you are not the intended addressee, you must not use,
> transmit, disclose or copy the email or any attachments. If you receive
> this email by mistake, please notify the sender immediately and delete the
> original email.*****
>
> ****
>
> ****
>

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:23:28 CST