RE: NAMD 2.9 ibverbs: Charmrun crashing on checkpoint-restart

From: Tristan Croll (tristan.croll_at_qut.edu.au)
Date: Tue Jul 16 2013 - 00:59:49 CDT

Hi Vibhor,

Sure - copied below. I was using this one as a test to replicate the error, hence the short walltime.

Cheers,

Tristan

#!/bin/bash -l
#PBS -N exon11
#PBS -l select=32:ncpus=2:mem=4G:mpiprocs=2:cputype=E5-2670
#PBS -j oe
#PBS -m bea
#PBS -l walltime=00:05:00

module load openmpi
module load namd/2.9-ibverbs

cd $PBS_O_WORKDIR

NAMD=`which namd2`

charmrun ++mpiexec ++verbose +p64 $NAMD ./equil-23_test2.namd > ./equil-23_test2.log

From: Vibhor Agrawal [mailto:vibhora_at_g.clemson.edu]
Sent: Tuesday, 16 July 2013 3:57 PM
To: Tristan Croll
Cc: Namd List
Subject: Re: namd-l: NAMD 2.9 ibverbs: Charmrun crashing on checkpoint-restart

Hello Tristan:
Can you attach your supercomputer script.so that it may be more clear

Vibhor
On Jul 16, 2013 1:31 AM, "Tristan Croll" <tristan.croll_at_qut.edu.au<mailto:tristan.croll_at_qut.edu.au>> wrote:
Hi,

We've just gotten NAMD 2.9 running on our local SGI Altix cluster using openMPI. In general it's running great, with near-linear scaling up to 512 cores. However, I'm running into strange crashing errors when I have multiple jobs queued up (using the "-W depend=afterany:" flag in qsub). If the continuation job starts immediately after completion of the previous job, charmrun crashes out on startup. All I get in the logfile is a core dump, while the stdout record gives me the message copied below.

In these runs, I've been doing what's always worked fine for me in the past: telling NAMD to run more steps than it can do in the allotted time, and allowing PBS to kill the job once it hits the walltime. From watching how the abovementioned fault develops with a series of smaller jobs, however, it seems the most reasonable explanation is that PBS is not giving sufficient time for some cleanup task before it starts the new run. Is this a known problem with large NAMD jobs, or is it more likely to be a cluster-specific problem?

Many thanks,

Tristan

Lyra: OpenMPI (1.4.5) module loaded.
Lyra: NAMD (2.9) module loaded.
Lyra: NAMD_2.9 Linux-x86_64-ibverbs
Charmrun> charmrun started...
Charmrun> mpiexec started
Charmrun> node programs all started
Charmrun> node programs all connected
Charmrun> started all node programs in 2.911 seconds.
------------- Processor 0 Exiting: Caught Signal ------------
Signal: segmentation violation
Suggestion: Try running with '++debug', or linking with '-memory paranoid' (memory paranoid requires '+netpoll' at runtime).
------------- Processor 0 Exiting: Caught Signal ------------
Signal: segmentation violation
Suggestion: Try running with '++debug', or linking with '-memory paranoid' (memory paranoid requires '+netpoll' at runtime).
------------- Processor 0 Exiting: Caught Signal ------------
Signal: segmentation violation
Suggestion: Try running with '++debug', or linking with '-memory paranoid' (memory paranoid requires '+netpoll' at runtime).
------------- Processor 0 Exiting: Caught Signal ------------
Signal: segmentation violation
Suggestion: Try running with '++debug', or linking with '-memory paranoid' (memory paranoid requires '+netpoll' at runtime).
------------- Processor 0 Exiting: Caught Signal ------------
Signal: segmentation violation
Suggestion: Try running with '++debug', or linking with '-memory paranoid' (memory paranoid requires '+netpoll' at runtime).
------------- Processor 0 Exiting: Caught Signal ------------
Signal: segmentation violation
Suggestion: Try running with '++debug', or linking with '-memory paranoid' (memory paranoid requires '+netpoll' at runtime).
Fatal error on PE 0> segmentation violation

Tristan Croll
Lecturer
Faculty of Health
Institute of Health and Biomedical Engineering
Queensland University of Technology
60 Musk Ave
Kelvin Grove QLD 4059 Australia
+61 7 3138 6443<tel:%2B61%207%203138%206443>

This email and its attachments (if any) contain confidential information intended for use by the addressee and may be privileged. We do not waive any confidentiality, privilege or copyright associated with the email or the attachments. If you are not the intended addressee, you must not use, transmit, disclose or copy the email or any attachments. If you receive this email by mistake, please notify the sender immediately and delete the original email.

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:23:28 CST