NAMD 2.7b1 on Ranger/MPI Problems?

From: Jeff Wereszczynski (jmweresz_at_umich.edu)
Date: Fri Apr 17 2009 - 00:56:30 CDT

Hi All,
I have a system of ~490k atoms I am trying to run on Ranger, however it will
run for only 500-2000 steps before dying. Nothing of interest is printed in
the log file:

WRITING EXTENDED SYSTEM TO RESTART FILE AT STEP 2000
Signal 15 received.
Signal 15 received.
Signal 15 received.
TACC: MPI job exited with code: 1
TACC: Shutting down parallel environment.
TACC: Shutdown complete. Exiting.

Whereas in the job output/error file I get this:

TACC: Done.
193 - MPI_IPROBE : Communicator argument is not a valid communicator
Special bit pattern b5000000 in communicator is incorrect. May indicate an
out-of-order argument or a freed communicator
[193] [] Aborting Program!
Exit code -3 signaled from i182-206.ranger.tacc.utexas.edu
Killing remote processes...Abort signaled by rank 193: Aborting program !
MPI process terminated unexpectedly
DONE

Here is my job script:

#!/bin/bash
#$ -V # Inherit the submission environment
#$ -N namd # Job Name
#$ -j y # combine stderr & stdout into stdout
#$ -o namd # Name of the output file (eg. myMPI.oJobID)
#$ -pe 16way 256 # Requests 64 cores/node, 64 cores total
#$ -q normal # Queue name
#$ -l h_rt=2:00:00 # Run time (hh:mm:ss) - 1.5 hours

module unload mvapich2
module unload mvapich
module swap pgi intel
module load mvapich

export VIADEV_SMP_EAGERSIZE=64
export VIADEV_SMPI_LENGTH_QUEUE=256
export VIADEV_ADAPTIVE_RDMA_LIMIT=0
export VIADEV_RENDEZVOUS_THRESHOLD=50000

ibrun tacc_affinity /share/home/00288/tg455591/NAMD_2.7b1_Linux-x86_64/namd2
namd.inp >namd.log

Any ideas what I might be doing wrong? I would guess from the error message
its some sort of MPI problem. I've tried varying the number of processors
(from 64 to 1104), editing out the "export ...." lines that control MPI
parameters, and taken out the tacc_infinity part but nothing seems to help .
 I've never had these problems with smaller systems. Has anyone else had
these sort of issues? Any suggestions how to fix them?

Thanks,
Jeff

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:50:46 CST