Note the difference between
Charm++ : the parallelization technology used by NAMD
CHARMM : the feature-rich molecular simulation and analysis package
Make sure to give the path of charmm exe and if system is big enough compile charmm moi xxlarge
> compile charmm moi xxlarge
> To be clear: this only crashes when it’s a queued continuation
> dependent on the completion of a previous job, only when it starts
> running /immediately/ after the previous job completes (even a few
> seconds in ‘Q’ status on PBS seems to make all the difference), and
> only when the previous job is killed due to exceeding its walltime. I
> tried putting a “sleep 60” command just before the execution of
> charmrun, but curiously that didn’t seem to help. There’s an obvious
> solution: stop being lazy and calculate a number of steps that will
> complete within the allotted walltime – but I thought I should put it
> out there for discussion anyway.____
> Hi Vibhor,____
Sure – copied below. I was using this one as a test to replicate the error, hence the short walltime.
> __ __
> #!/bin/bash -l____
> #PBS -N exon11____
> #PBS -l select=32:ncpus=2:mem=4G:mpiprocs=2:cputype=E5-2670____
> #PBS -j oe____
> #PBS -m bea____
> #PBS -l walltime=00:05:00____
> module load openmpi____
> module load namd/2.9-ibverbs____
> cd $PBS_O_WORKDIR____
> NAMD=`which namd2`____
> charmrun ++mpiexec ++verbose +p64 $NAMD ./equil-23_test2.namd >
> ./equil-23_test2.log____
> __ __
> Hello Tristan:
> Hi,____
> ____
> We’ve just gotten NAMD 2.9 running on our local SGI Altix cluster
> using openMPI. In general it’s running great, with near-linear
> scaling up to 512 cores. However, I’m running into strange crashing
> errors when I have multiple jobs queued up (using the “-W
> depend=afterany:” flag in qsub). If the continuation job starts
> immediately after completion of the previous job, charmrun crashes out
> on startup. All I get in the logfile is a core dump, while the stdout
> In these runs, I’ve been doing what’s always worked fine for me in the
> past: telling NAMD to run more steps than it can do in the allotted
> time, and allowing PBS to kill the job once it hits the walltime.
> From watching how the abovementioned fault develops with a series of
> smaller jobs, however, it seems the most reasonable explanation is
> that PBS is not giving sufficient time for some cleanup task before it
> starts the new run. Is this a known problem with large NAMD jobs, or
> is it more likely to be a cluster-specific problem?____
> Lyra: OpenMPI (1.4.5) module loaded.____
> Lyra: NAMD (2.9) module loaded.____
> Lyra: NAMD_2.9 Linux-x86_64-ibverbs____
> Charmrun> charmrun started...____
> Charmrun> mpiexec started____
> Charmrun> node programs all started____
> Charmrun> node programs all connected____
> Charmrun> started all node programs in 2.911 seconds.____
> ------------- Processor 0 Exiting: Caught Signal ------------____
> Signal: segmentation violation____
> Suggestion: Try running with '++debug', or linking with '-memory
> paranoid' (memory paranoid requires '+netpoll' at runtime).____
Fatal error on PE 0> segmentation violation
> Tristan Croll____
> Lecturer____
> Faculty of Health____
> Institute of Health and Biomedical Engineering____
> Queensland University of Technology____
> 60 Musk Ave____
> Kelvin Grove QLD 4059 Australia____
> +61 7 3138 6443 <tel:%2B61%207%203138%206443>____
> ____
