From: harish vashisth (harish.vashisth_at_gmail.com)
Date: Fri Dec 17 2010 - 10:20:10 CST
Dear NAMD users,
I posted a similar query more than a month back, but did not get
any replies. I am still stuck and writing the query in more detail
again if someone can help.
I am trying to launch an equilibration run using restart files from a
previous run using mpi-version of NAMDv2.7 compiled from source. The
problem I am facing is that the job starts out fine and runs for a few
and after that it stops producing any output in the log-file and stops
updating dcd file as well. There is no error message or warning
anywhere in the log-file. I have attached the input config file,
output log file, and the job
submission script in a tarred folder along with this email. I also
looked into individual nodes where job is running and I can see 8
namd2 processes per node, which i think is normal for a dual quad core
Running "top" on individual nodes shows ~100% user and hardly any
system usage. I ran this job more than ten times, and some other
equilibration jobs on same nodes, all of which stop producing output
at different time steps in the log file. Jobs never crash, and they
seem to be running but in a kind of frozen state. I am not sure it has
anything to do with the files, or the way NAMD was compiled or its an
In other words, NAMD processes run well for some apparently random
period of time.
Output files are updated every few minutes. At some point output files
will stop receiving updates. Processes on all nodes will be using 100%
CPU. 'strace' shows all processes running "poll" and timing out.
"netstat -ap" shows a reduced number of open sockets when compared to a
"running" job. It appears that all processes are waiting for all other
processes to speak. There are no errors logged to the NAMD output
files, no errors logged to any of the system logs, and no errors on the
Other applications using same nodes and IB run without issues.
When NAMD is running it appears to run fast.
Info: Benchmark time: 64 CPUs 0.0209669 s/step 0.121336 days/ns 314.012
Info: Benchmark time: 64 CPUs 0.0206692 s/step 0.119614 days/ns 314.012
Info: Benchmark time: 64 CPUs 0.0206114 s/step 0.119279 days/ns 314.012
Over a period of 1 week most jobs will eventually enter this state.
More details on our cluster and type of nodes and operating system are
OS openSuSE 11.1
Compiler Intel 11.1
NAMD 2.7 compiled with MPI and "-g"
Input and output directories are NFS mounted
transport: InfiniBand (0)
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 2048 (4)
This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:56:29 CST