Parallel TMD (2.7b) dies : "MPICH internal error"

From: Ali Emileh (ali.emileh_at_gmail.com)
Date: Mon Apr 20 2009 - 11:00:11 CDT

Hi,

I hope here's the right place to ask about this.
I am running a TMD run on Kraken using NAMD 2.7b (or atleast that's what I
can tell from reading the
runbatch script :

aprun -n $NUMPROCS /lustre/scratch/jphillip/NAMD_2.7b1/namd2 $CONFFILE >&
$LOGFILE

It was supposed to run for 3500 ksteps, but died after ~2554 steps with a
message like this :

======================
PE 0: MPICH internal error: Unable to find matching PUT_START_EVENT
[0] Un-matchable PUT_END event
EVENT: type = PTL_EVENT_PUT_END , link = 34, match_bits =
0x100000014400055f, rlen=128, mlen=128
aborting job:
Fatal event
[NID 11808]Apid 91158: initiated application termination
Application 91158 exit codes: 255
Application 91158 exit signals: Killed
Application 91158 resources: utime 0, stime 0
======================

Now I have had problems with parallel TMD on previous versions (always
RATTLE algorithm failure) which
I think was a known bug in them and apparently fixed in 2.7b. I'm wondering
if this can be a bug again or
I'm doing something wrong. TMD is run for 3500 ksteps with no piston :

#############################################################
TMD on
TMDk 500
TMDOutputFreq 100
TMDFile target2.pdb
TMDFirstStep 5001
TMDLastStep 3500000
TMDFinalRMSD 0

#############################################################

I'd highly appreciate any kind of input.

Thank you
Ali

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:52:38 CST