FATAL error when run Replica Exchange Method on Franklin

From: Guoxiong Su (gsu3_at_mail.uh.edu)
Date: Thu May 12 2011 - 14:03:58 CDT

Dear all,

I am trying to run REMD on Franklin, a Cray XT4 cluster of NERSC. I used 4
replicas, 8 processors per replica. The spawn command is "aprun -n
$procs_per_host $namd $conf > $log " I use csh method since ssh is not
allowed on Franklin. The following is my output file:

Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
TEMPERATURE 0: 280
TEMPERATURE 1: 287
TEMPERATURE 2: 293
TEMPERATURE 3: 300
SPAWNING output/output.job0.0.log.nrc on 08631
SPAWNING output/output.job0.1.log.nrc on 08649
SPAWNING output/output.job0.2.log.nrc on 12619
SPAWNING output/output.job0.3.log.nrc on 12636
errpipe 1: ------------- Processor 0 Exiting: Called CmiAbort ------------
errpipe 1: Reason: FATAL ERROR:
errpipe 1: while executing
errpipe 1: "socket $server_host $server_port"
errpipe 1: invoked from within
errpipe 1: "set server_channel [socket $server_host $server_port]"
errpipe 1: (file "output.job0.1.log.nrc" line 6)
errpipe 1:
errpipe 1: aborting job:
errpipe 1: application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
errpipe 1: [NID 08631] 2011-05-10 16:06:57 Apid 10545388: initiated
application termination
disconnect errpipe 1

The log file of namd is as following:

Charm++> Running on MPI version: 2.1 multi-thread support: MPI_THREAD_SINGLE
(max supported: MPI_THREAD_SINGLE)
Charm++> cpu topology info is being gathered.
Charm++> Running on 2 unique compute nodes (4-way SMP).
Info: NAMD CVS for CRAY-XT
Info:
Info: Please visit http://www.ks.uiuc.edu/Research/namd/
Info: and send feedback or bug reports to namd_at_ks.uiuc.edu
Info:
Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005)
Info: in all publications reporting results obtained with NAMD.
Info:
Info: Based on Charm++/Converse 60103 for mpi-crayxt
Info: Built Thu May 27 15:17:20 PDT 2010 by zz217 on nid00004
Info: Running on 8 processors.
Info: CPU topology information available.
Info: Charm++/Converse parallel runtime startup completed at 0.00229788 s
Info: 89.0938 MB of memory in use based on /proc/self/stat
Info: Changed directory to output
Info: Configuration file is output.job0.0.log.nrc
FATAL ERROR:
    while executing
"socket $server_host $server_port"
    invoked from within
"set server_channel [socket $server_host $server_port]"
    (file "output.job0.0.log.nrc" line 6)
[0] Stack Traceback:
  [0] [0x8a03fc]
  [1] [0x400e4a]
  [2] [0x7dad20]
  [3] [0x405024]
  [4] [0x4050ba]
  [5] [0xa5ee3c]
  [6] [0x400229]
Application 10545390 exit codes: 1
Application 10545390 exit signals: Killed
Application 10545390 resources: utime 0, stime 0

I don't know why I got this FATAL ERROR. I will apreciate it if anyone can
help me.

Yours sincerely,

Guoxiong Su
Department of Physics
University of Houston

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:20:15 CST