RE: NAMD 2.9 ibverbs: Charmrun crashing on checkpoint-restart

From: Tristan Croll (
Date: Tue Jul 16 2013 - 02:40:08 CDT

OK, I thought I'd try a little workaround on this, but my workaround has thrown up a new bug. I passed an environment variable PBS_WALLTIME (with a value of 24:00:00) through to NAMD, and added this TCL script to the config file to force it to stop ~10 minutes before the allotted time limit:

run 1000

set time_start [clock seconds]
set time_elapsed 0

set timelist [split $env(PBS_WALLTIME) :]

set walltime [expr [lindex $timelist 0] * 3600]
set walltime [expr [lindex $timelist 1] * 60 + $walltime]
set walltime [expr [lindex $timelist 2] + $walltime]
set walltime [expr $walltime - 600]

while {$time_elapsed <= $walltime} {
  run 10000
  set time_elapsed [expr [clock seconds] - $time_start]

Strangely, NAMD crashes with the following message:

TCL: Setting parameter clock to seconds
FATAL ERROR: Setting parameter clock from script failed!

I've checked and double-checked the syntax, and each line with a clock command runs fine in VMD TkConsole. Can't for the life of me figure out what's going wrong.

From: Tristan Croll
Sent: Tuesday, 16 July 2013 4:05 PM
To: 'Vibhor Agrawal'
Cc: 'Namd List'
Subject: RE: namd-l: NAMD 2.9 ibverbs: Charmrun crashing on checkpoint-restart

To be clear: this only crashes when it's a queued continuation dependent on the completion of a previous job, only when it starts running immediately after the previous job completes (even a few seconds in 'Q' status on PBS seems to make all the difference), and only when the previous job is killed due to exceeding its walltime. I tried putting a "sleep 60" command just before the execution of charmrun, but curiously that didn't seem to help. There's an obvious solution: stop being lazy and calculate a number of steps that will complete within the allotted walltime - but I thought I should put it out there for discussion anyway.

From: Tristan Croll
Sent: Tuesday, 16 July 2013 4:00 PM
To: 'Vibhor Agrawal'
Cc: Namd List
Subject: RE: namd-l: NAMD 2.9 ibverbs: Charmrun crashing on checkpoint-restart

Hi Vibhor,

Sure - copied below. I was using this one as a test to replicate the error, hence the short walltime.



#!/bin/bash -l
#PBS -N exon11
#PBS -l select=32:ncpus=2:mem=4G:mpiprocs=2:cputype=E5-2670
#PBS -j oe
#PBS -m bea
#PBS -l walltime=00:05:00

module load openmpi
module load namd/2.9-ibverbs


NAMD=`which namd2`

charmrun ++mpiexec ++verbose +p64 $NAMD ./equil-23_test2.namd > ./equil-23_test2.log

From: Vibhor Agrawal []
Sent: Tuesday, 16 July 2013 3:57 PM
To: Tristan Croll
Cc: Namd List
Subject: Re: namd-l: NAMD 2.9 ibverbs: Charmrun crashing on checkpoint-restart

Hello Tristan:
Can you attach your supercomputer that it may be more clear

On Jul 16, 2013 1:31 AM, "Tristan Croll" <<>> wrote:

We've just gotten NAMD 2.9 running on our local SGI Altix cluster using openMPI. In general it's running great, with near-linear scaling up to 512 cores. However, I'm running into strange crashing errors when I have multiple jobs queued up (using the "-W depend=afterany:" flag in qsub). If the continuation job starts immediately after completion of the previous job, charmrun crashes out on startup. All I get in the logfile is a core dump, while the stdout record gives me the message copied below.

In these runs, I've been doing what's always worked fine for me in the past: telling NAMD to run more steps than it can do in the allotted time, and allowing PBS to kill the job once it hits the walltime. From watching how the abovementioned fault develops with a series of smaller jobs, however, it seems the most reasonable explanation is that PBS is not giving sufficient time for some cleanup task before it starts the new run. Is this a known problem with large NAMD jobs, or is it more likely to be a cluster-specific problem?

Many thanks,


Lyra: OpenMPI (1.4.5) module loaded.
Lyra: NAMD (2.9) module loaded.
Lyra: NAMD_2.9 Linux-x86_64-ibverbs
Charmrun> charmrun started...
Charmrun> mpiexec started
Charmrun> node programs all started
Charmrun> node programs all connected
Charmrun> started all node programs in 2.911 seconds.
------------- Processor 0 Exiting: Caught Signal ------------
Signal: segmentation violation
Suggestion: Try running with '++debug', or linking with '-memory paranoid' (memory paranoid requires '+netpoll' at runtime).
------------- Processor 0 Exiting: Caught Signal ------------
Signal: segmentation violation
Suggestion: Try running with '++debug', or linking with '-memory paranoid' (memory paranoid requires '+netpoll' at runtime).
------------- Processor 0 Exiting: Caught Signal ------------
Signal: segmentation violation
Suggestion: Try running with '++debug', or linking with '-memory paranoid' (memory paranoid requires '+netpoll' at runtime).
------------- Processor 0 Exiting: Caught Signal ------------
Signal: segmentation violation
Suggestion: Try running with '++debug', or linking with '-memory paranoid' (memory paranoid requires '+netpoll' at runtime).
------------- Processor 0 Exiting: Caught Signal ------------
Signal: segmentation violation
Suggestion: Try running with '++debug', or linking with '-memory paranoid' (memory paranoid requires '+netpoll' at runtime).
------------- Processor 0 Exiting: Caught Signal ------------
Signal: segmentation violation
Suggestion: Try running with '++debug', or linking with '-memory paranoid' (memory paranoid requires '+netpoll' at runtime).
Fatal error on PE 0> segmentation violation

