Error on renaming file: Permission denied

From: Stephen Hicks (sdh33_at_cornell.edu)
Date: Mon Jun 04 2007 - 00:38:28 CDT

Hi,

I keep getting a peculiar error in the middle of my computations.
After about half a day and several hundred thousand time steps (with
restarts being recorded every few hundred), my process quits with the
message

=\/=
WRITING EXTENDED SYSTEM TO RESTART FILE AT STEP 354000
ERROR: Error on renaming file relaxed_wb_eq.restart.xsc to
relaxed_wb_eq.restart.xsc.old: Permission denied
FATAL ERROR: Error opening XSC restart file relaxed_wb_eq.restart.xsc:
Permission denied
Stack Traceback:
  [0] _ZN10Controller20outputExtendedSystemEi+0x325 [0x8217891]
  [1] _ZN10Controller9integrateEv+0x5c7 [0x821bc33]
  [2] _ZN10Controller9algorithmEv+0x518 [0x8214500]
  [3] _ZN10Controller9threadRunEPS_+0xc [0x82215a0]
  [4] $HOME/local/bin/namd2 [0x831a849]
  [5] Charm++ Runtime: Converse thread (qt_args+0x72 [0x8394b92])
=/\=

My STDERR says

=\/=
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: FATAL ERROR: Error opening XSC restart file
relaxed_wb_eq.restart.xsc: Permission denied
Fatal error on PE 0> FATAL ERROR: Error opening XSC restart file
relaxed_wb_eq.restart.xsc: Permission denied
=/\=

I can't imagine why it's all of a sudden unable to write this file,
since it's clearly done it thousands of times already with no problem.
 Could this error be caused by a fickle SSH connection? Are there
workarounds I could try, or ways to investigate further what the
problem is? One troubleshooting difficulty is this it seems to occur
randomly and only very rarely, so it takes a long time to happen.

I'm running on a Dual Athlon MP cluster. I request 6 nodes with 2
processors per node using the following PBS script:

=\/=
#!/bin/bash
#PBS -l nodes=6:ppn=2,pmem=1800mb,mem=1800mb,ncpus=12,cput=32:00:00
#PBS -N NAMD
#PBS -k oe
#PBS -m abe
nodefile=namd2.nodelist
uniq $PBS_NODEFILE | sed -e 's/^/host /' -e '1s/^/group main\n/' > $nodefile
charmrun ++remote-shell ssh +p12 ++nodelist $nodefile
$HOME/local/bin/namd2 relaxed_wb_eq.conf
=/\=

The nodefile which gets written looks like this:

=\/=
group main
host node22
host node19
host node12
host node10
host node2
host node29
=/\=

and the first few lines of output from NAMD are

=\/=
Info: NAMD 2.6 for Linux-i686
Info:
Info: Please visit http://www.ks.uiuc.edu/Research/namd/
Info: and send feedback or bug reports to namd_at_ks.uiuc.edu
Info:
Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005)
Info: in all publications reporting results obtained with NAMD.
Info:
Info: Based on Charm++/Converse 50900 for net-linux-iccstatic
Info: Built Wed Aug 30 12:54:30 CDT 2006 by jim on kyoto.ks.uiuc.edu
Info: 1 NAMD 2.6 Linux-i686 12 node22 shicks
Info: Running on 12 processors.
Info: 7376 kB of memory in use.
Info: Memory usage based on mallinfo
Info: Configuration file is relaxed_wb_eq.conf
TCL: Suspending until startup complete.
=/\=

Any help is greatly appreciated!

--
Steve Hicks
PhD Candidate
Cornell Physics

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:53:25 CST