RE: NAMD 2.9 ibverbs: Charmrun crashing on checkpoint-restart

From: Tristan Croll (tristan.croll_at_qut.edu.au)
Date: Wed Jul 17 2013 - 19:02:25 CDT

Hi all,

It looks like this was most likely due to someone else's large multi-node job taking more resources than it had asked for through PBS, and oversubscribing CPUs. Nothing to see here.

From: Vibhor Agrawal [mailto:vibhora_at_g.clemson.edu]
Sent: Tuesday, 16 July 2013 4:19 PM
To: Tristan Croll
Cc: Namd List
Subject: RE: namd-l: NAMD 2.9 ibverbs: Charmrun crashing on checkpoint-restart

Make sure to give the path of charmm exe and if system is big enough compile charmm moi xxlarge
On Jul 16, 2013 2:05 AM, "Tristan Croll" <tristan.croll_at_qut.edu.au<mailto:tristan.croll_at_qut.edu.au>> wrote:
To be clear: this only crashes when it's a queued continuation dependent on the completion of a previous job, only when it starts running immediately after the previous job completes (even a few seconds in 'Q' status on PBS seems to make all the difference), and only when the previous job is killed due to exceeding its walltime. I tried putting a "sleep 60" command just before the execution of charmrun, but curiously that didn't seem to help. There's an obvious solution: stop being lazy and calculate a number of steps that will complete within the allotted walltime - but I thought I should put it out there for discussion anyway.

From: Tristan Croll
Sent: Tuesday, 16 July 2013 4:00 PM
To: 'Vibhor Agrawal'
Cc: Namd List
Subject: RE: namd-l: NAMD 2.9 ibverbs: Charmrun crashing on checkpoint-restart

Hi Vibhor,

Sure - copied below. I was using this one as a test to replicate the error, hence the short walltime.

Cheers,

Tristan

#!/bin/bash -l
#PBS -N exon11
#PBS -l select=32:ncpus=2:mem=4G:mpiprocs=2:cputype=E5-2670
#PBS -j oe
#PBS -m bea
#PBS -l walltime=00:05:00

module load openmpi
module load namd/2.9-ibverbs

cd $PBS_O_WORKDIR

NAMD=`which namd2`

charmrun ++mpiexec ++verbose +p64 $NAMD ./equil-23_test2.namd > ./equil-23_test2.log

From: Vibhor Agrawal [mailto:vibhora_at_g.clemson.edu<mailto:vibhora_at_g.clemson.edu>]
Sent: Tuesday, 16 July 2013 3:57 PM
To: Tristan Croll
Cc: Namd List
Subject: Re: namd-l: NAMD 2.9 ibverbs: Charmrun crashing on checkpoint-restart

Hello Tristan:
Can you attach your supercomputer script.so that it may be more clear

Vibhor
On Jul 16, 2013 1:31 AM, "Tristan Croll" <tristan.croll_at_qut.edu.au<mailto:tristan.croll_at_qut.edu.au>> wrote:
Hi,

We've just gotten NAMD 2.9 running on our local SGI Altix cluster using openMPI. In general it's running great, with near-linear scaling up to 512 cores. However, I'm running into strange crashing errors when I have multiple jobs queued up (using the "-W depend=afterany:" flag in qsub). If the continuation job starts immediately after completion of the previous job, charmrun crashes out on startup. All I get in the logfile is a core dump, while the stdout record gives me the message copied below.

In these runs, I've been doing what's always worked fine for me in the past: telling NAMD to run more steps than it can do in the allotted time, and allowing PBS to kill the job once it hits the walltime. From watching how the abovementioned fault develops with a series of smaller jobs, however, it seems the most reasonable explanation is that PBS is not giving sufficient time for some cleanup task before it starts the new run. Is this a known problem with large NAMD jobs, or is it more likely to be a cluster-specific problem?

Many thanks,

Tristan

Lyra: OpenMPI (1.4.5) module loaded.
Lyra: NAMD (2.9) module loaded.
Lyra: NAMD_2.9 Linux-x86_64-ibverbs
Charmrun> charmrun started...
Charmrun> mpiexec started
Charmrun> node programs all started
Charmrun> node programs all connected
Charmrun> started all node programs in 2.911 seconds.
------------- Processor 0 Exiting: Caught Signal ------------
Signal: segmentation violation
Suggestion: Try running with '++debug', or linking with '-memory paranoid' (memory paranoid requires '+netpoll' at runtime).
------------- Processor 0 Exiting: Caught Signal ------------
Signal: segmentation violation
Suggestion: Try running with '++debug', or linking with '-memory paranoid' (memory paranoid requires '+netpoll' at runtime).
------------- Processor 0 Exiting: Caught Signal ------------
Signal: segmentation violation
Suggestion: Try running with '++debug', or linking with '-memory paranoid' (memory paranoid requires '+netpoll' at runtime).
------------- Processor 0 Exiting: Caught Signal ------------
Signal: segmentation violation
Suggestion: Try running with '++debug', or linking with '-memory paranoid' (memory paranoid requires '+netpoll' at runtime).
------------- Processor 0 Exiting: Caught Signal ------------
Signal: segmentation violation
Suggestion: Try running with '++debug', or linking with '-memory paranoid' (memory paranoid requires '+netpoll' at runtime).
------------- Processor 0 Exiting: Caught Signal ------------
Signal: segmentation violation
Suggestion: Try running with '++debug', or linking with '-memory paranoid' (memory paranoid requires '+netpoll' at runtime).
Fatal error on PE 0> segmentation violation

Tristan Croll
Lecturer
Faculty of Health
Institute of Health and Biomedical Engineering
Queensland University of Technology
60 Musk Ave
Kelvin Grove QLD 4059 Australia
+61 7 3138 6443<tel:%2B61%207%203138%206443>

This email and its attachments (if any) contain confidential information intended for use by the addressee and may be privileged. We do not waive any confidentiality, privilege or copyright associated with the email or the attachments. If you are not the intended addressee, you must not use, transmit, disclose or copy the email or any attachments. If you receive this email by mistake, please notify the sender immediately and delete the original email.

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:23:28 CST