jobs get stuck using NAMDv2.7 and stop producing output in the log-file

From: harish vashisth (harish.vashisth_at_gmail.com)
Date: Fri Dec 17 2010 - 10:20:10 CST

Dear NAMD users,
   I posted a similar query more than a month back, but did not get
any replies. I am still stuck and writing the query in more detail
again if someone can help.

 Problem description

 I am trying to launch an equilibration run using restart files from a
previous run using mpi-version of NAMDv2.7 compiled from source. The
problem I am facing is that the job starts out fine and runs for a few
hours,

and after that it stops producing any output in the log-file and stops
updating dcd file as well. There is no error message or warning
anywhere in the log-file. I have attached the input config file,
output log file, and the job

 submission script in a tarred folder along with this email. I also
looked into individual nodes where job is running and I can see 8
namd2 processes per node, which i think is normal for a dual quad core
node.

 Running "top" on individual nodes shows ~100% user and hardly any
system usage. I ran this job more than ten times, and some other
equilibration jobs on same nodes, all of which stop producing output
 at different time steps in the log file. Jobs never crash, and they
seem to be running but in a kind of frozen state. I am not sure it has
anything to do with the files, or the way NAMD was compiled or its an
hardware issue.

In other words, NAMD processes run well for some apparently random
period of time.
Output files are updated every few minutes. At some point output files

will stop receiving updates. Processes on all nodes will be using 100%
CPU. 'strace' shows all processes running "poll" and timing out.
"netstat -ap" shows a reduced number of open sockets when compared to a

"running" job. It appears that all processes are waiting for all other
processes to speak. There are no errors logged to the NAMD output
files, no errors logged to any of the system logs, and no errors on the

IB fabric.

Other applications using same nodes and IB run without issues.

When NAMD is running it appears to run fast.

Info: Benchmark time: 64 CPUs 0.0209669 s/step 0.121336 days/ns 314.012
MB memory

Info: Benchmark time: 64 CPUs 0.0206692 s/step 0.119614 days/ns 314.012
MB memory
Info: Benchmark time: 64 CPUs 0.0206114 s/step 0.119279 days/ns 314.012
MB memory

Over a period of 1 week most jobs will eventually enter this state.

More details on our cluster and type of nodes and operating system are
given below.

Cluster configuration

OS openSuSE 11.1
MPI openmpi-1.4.1
Compiler Intel 11.1
OFED 1.5
NAMD 2.7 compiled with MPI and "-g"

Input and output directories are NFS mounted

IB

hca_id: qib0
        transport: InfiniBand (0)
        fw_ver: 0.0.0
        node_guid: 0011:7500:0079:b5c0
        sys_image_guid: 0011:7500:0079:b5c0

        vendor_id: 0x1175
        vendor_part_id: 29474
        hw_ver: 0x1
        board_id: InfiniPath_QLE7340
        phys_port_cnt: 1

                port: 1
                        state: PORT_ACTIVE (4)
                        max_mtu: 4096 (5)
                        active_mtu: 2048 (4)
                        sm_lid: 1

                        port_lid: 5
                        port_lmc: 0x00

Regards,
-Harish Vashisth


This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:54:52 CST