Runaway cuda-enabled namd2 processes

Date: Tue Dec 03 2013 - 21:31:42 CST


I have got NAMD v2.9 built and installed on our GPU cluster (see below for the spec). I have found that when a NAMD job died (e.g. crashed, qdel’ed, etc), the parent process quits as expected but the child processes would still be running. For example, 16 processes on a compute node just won’t die, are in “D” state (according to “top”) and “nvidia-smi” shows that 3 processes are using all 3 GPUs on the node. The batch system is not aware of this and thus schedule the next job onto the node, causing all sorts of problem (e.g. cuda code doesn’t run, the new namd2 processes share the GPUs with the runaway processes from a previous batch job, and thus result in poor performance, etc).

In many cases, even “kill -9" can’t terminate the runaway processes.

Have you guys seen this issue and is there a solution?


Other info that may be useful:

I followed the install notes on the NAMD web site to build NAMD v2.9. Charm 6.4.0 was built by:

env MPICXX=mpicxx ./build charm++ mpi-linux-x86_64 --with-production

And NAMD was built by:

./config Linux-x86_64-icc --charm-arch mpi-linux-x86_64 --with-cuda --cuda-prefix /apps/cuda/5.5/cuda

Note that CUDACC in “arch/Linux-x86_64.cuda” was changed to:

CUDACC=$(CUDADIR)/bin/nvcc -O3 --maxrregcount 32 -gencode arch=compute_20,code=sm_20 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_35,code=compute_35 -Xcompiler "-m64" $(CUDA)

in order to support the new Kepler GPUs.

Cluster spec:

CPU: Dual Intel Xeon CPUs (16 cores on each node, 100+ nodes)
Memory: 128GB on each node
GPU: 3 Nvidia K20m GPUs on each node
Interconnect: Infiniband FDR-10
Compiler: Intel v12
MPI: Open MPI v1.4.5
Batch system: Torque + Moab

This archive was generated by hypermail 2.1.6 : Wed Dec 31 2014 - 23:21:58 CST