Re: NAMD 2.9 ibverbs: Charmrun crashing on checkpoint-restart

From: Kenno Vanommeslaeghe (kvanomme_at_rx.umaryland.edu)
Date: Tue Jul 16 2013 - 11:29:09 CDT

Note the difference between

Charm++ : the parallelization technology used by NAMD
http://charm.cs.uiuc.edu/

CHARMM : the feature-rich molecular simulation and analysis package
http://www.charmm.org/

On 07/16/2013 02:18 AM, Vibhor Agrawal wrote:
> Make sure to give the path of charmm exe and if system is big enough
> compile charmm moi xxlarge
>
> On Jul 16, 2013 2:05 AM, "Tristan Croll" <tristan.croll_at_qut.edu.au
> <mailto:tristan.croll_at_qut.edu.au>> wrote:
>
> To be clear: this only crashes when it’s a queued continuation
> dependent on the completion of a previous job, only when it starts
> running /immediately/ after the previous job completes (even a few
> seconds in ‘Q’ status on PBS seems to make all the difference), and
> only when the previous job is killed due to exceeding its walltime. I
> tried putting a “sleep 60” command just before the execution of
> charmrun, but curiously that didn’t seem to help. There’s an obvious
> solution: stop being lazy and calculate a number of steps that will
> complete within the allotted walltime – but I thought I should put it
> out there for discussion anyway.____
>
> __ __
>
> *From:*Tristan Croll
> *Sent:* Tuesday, 16 July 2013 4:00 PM
> *To:* 'Vibhor Agrawal'
> *Cc:* Namd List
> *Subject:* RE: namd-l: NAMD 2.9 ibverbs: Charmrun crashing on
> checkpoint-restart____
>
> __ __
>
> Hi Vibhor,____
>
> __ __
>
> Sure – copied below. I was using this one as a test to replicate the
> error, hence the short walltime.____
>
> __ __
>
> Cheers,____
>
> __ __
>
> Tristan____
>
> __ __
>
> __ __
>
> __ __
>
> #!/bin/bash -l____
>
> #PBS -N exon11____
>
> #PBS -l select=32:ncpus=2:mem=4G:mpiprocs=2:cputype=E5-2670____
>
> #PBS -j oe____
>
> #PBS -m bea____
>
> #PBS -l walltime=00:05:00____
>
> __ __
>
> module load openmpi____
>
> module load namd/2.9-ibverbs____
>
> ____
>
> cd $PBS_O_WORKDIR____
>
> ____
>
> NAMD=`which namd2`____
>
> __ __
>
> charmrun ++mpiexec ++verbose +p64 $NAMD ./equil-23_test2.namd >
> ./equil-23_test2.log____
>
> __ __
>
> *From:*Vibhor Agrawal [mailto:vibhora_at_g.clemson.edu
> <mailto:vibhora_at_g.clemson.edu>]
> *Sent:* Tuesday, 16 July 2013 3:57 PM
> *To:* Tristan Croll
> *Cc:* Namd List
> *Subject:* Re: namd-l: NAMD 2.9 ibverbs: Charmrun crashing on
> checkpoint-restart____
>
> __ __
>
> Hello Tristan:
> Can you attach your supercomputer script.so that it may be more clear____
>
> Vibhor____
>
> On Jul 16, 2013 1:31 AM, "Tristan Croll" <tristan.croll_at_qut.edu.au
> <mailto:tristan.croll_at_qut.edu.au>> wrote:____
>
> Hi,____
>
> ____
>
> We’ve just gotten NAMD 2.9 running on our local SGI Altix cluster
> using openMPI. In general it’s running great, with near-linear
> scaling up to 512 cores. However, I’m running into strange crashing
> errors when I have multiple jobs queued up (using the “-W
> depend=afterany:” flag in qsub). If the continuation job starts
> immediately after completion of the previous job, charmrun crashes out
> on startup. All I get in the logfile is a core dump, while the stdout
> record gives me the message copied below.____
>
> ____
>
> In these runs, I’ve been doing what’s always worked fine for me in the
> past: telling NAMD to run more steps than it can do in the allotted
> time, and allowing PBS to kill the job once it hits the walltime.
> From watching how the abovementioned fault develops with a series of
> smaller jobs, however, it seems the most reasonable explanation is
> that PBS is not giving sufficient time for some cleanup task before it
> starts the new run. Is this a known problem with large NAMD jobs, or
> is it more likely to be a cluster-specific problem?____
>
> ____
>
> Many thanks,____
>
> ____
>
> Tristan____
>
> ____
>
> Lyra: OpenMPI (1.4.5) module loaded.____
>
> Lyra: NAMD (2.9) module loaded.____
>
> Lyra: NAMD_2.9 Linux-x86_64-ibverbs____
>
> Charmrun> charmrun started...____
>
> Charmrun> mpiexec started____
>
> Charmrun> node programs all started____
>
> Charmrun> node programs all connected____
>
> Charmrun> started all node programs in 2.911 seconds.____
>
> ------------- Processor 0 Exiting: Caught Signal ------------____
>
> Signal: segmentation violation____
>
> Suggestion: Try running with '++debug', or linking with '-memory
> paranoid' (memory paranoid requires '+netpoll' at runtime).____
>
> ------------- Processor 0 Exiting: Caught Signal ------------____
>
> Signal: segmentation violation____
>
> Suggestion: Try running with '++debug', or linking with '-memory
> paranoid' (memory paranoid requires '+netpoll' at runtime).____
>
> ------------- Processor 0 Exiting: Caught Signal ------------____
>
> Signal: segmentation violation____
>
> Suggestion: Try running with '++debug', or linking with '-memory
> paranoid' (memory paranoid requires '+netpoll' at runtime).____
>
> ------------- Processor 0 Exiting: Caught Signal ------------____
>
> Signal: segmentation violation____
>
> Suggestion: Try running with '++debug', or linking with '-memory
> paranoid' (memory paranoid requires '+netpoll' at runtime).____
>
> ------------- Processor 0 Exiting: Caught Signal ------------____
>
> Signal: segmentation violation____
>
> Suggestion: Try running with '++debug', or linking with '-memory
> paranoid' (memory paranoid requires '+netpoll' at runtime).____
>
> ------------- Processor 0 Exiting: Caught Signal ------------____
>
> Signal: segmentation violation____
>
> Suggestion: Try running with '++debug', or linking with '-memory
> paranoid' (memory paranoid requires '+netpoll' at runtime).____
>
> Fatal error on PE 0> segmentation violation____
>
> ____
>
> ____
>
> Tristan Croll____
>
> Lecturer____
>
> Faculty of Health____
>
> Institute of Health and Biomedical Engineering____
>
> Queensland University of Technology____
>
> 60 Musk Ave____
>
> Kelvin Grove QLD 4059 Australia____
>
> +61 7 3138 6443 <tel:%2B61%207%203138%206443>____
>
> ____
>
> *This email and its attachments (if any) contain confidential
> information intended for use by the addressee and may be privileged.
> We do not waive any confidentiality, privilege or copyright associated
> with the email or the attachments. If you are not the intended
> addressee, you must not use, transmit, disclose or copy the email or
> any attachments. If you receive this email by mistake, please notify
> the sender immediately and delete the original email.*____
>
> ____
>
> ____
>

This archive was generated by hypermail 2.1.6 : Wed Dec 31 2014 - 23:21:26 CST