AW: Runaway cuda-enabled namd2 processes

From: Norman Geist (
Date: Wed Dec 04 2013 - 05:41:20 CST

Jeah, this often happens for CUDA runs. But for me, a simple "pkill -9
namd2" works always. Maybe try the nvidia-smi feature of bus-reset to
decouple the processes from the GPUs.


Norman Geist.


Von: [] Im Auftrag
Gesendet: Mittwoch, 4. Dezember 2013 04:32
Betreff: namd-l: Runaway cuda-enabled namd2 processes




I have got NAMD v2.9 built and installed on our GPU cluster (see below for
the spec). I have found that when a NAMD job died (e.g. crashed, qdel'ed,
etc), the parent process quits as expected but the child processes would
still be running. For example, 16 processes on a compute node just won't
die, are in "D" state (according to "top") and "nvidia-smi" shows that 3
processes are using all 3 GPUs on the node. The batch system is not aware of
this and thus schedule the next job onto the node, causing all sorts of
problem (e.g. cuda code doesn't run, the new namd2 processes share the GPUs
with the runaway processes from a previous batch job, and thus result in
poor performance, etc).


In many cases, even "kill -9" can't terminate the runaway processes.


Have you guys seen this issue and is there a solution?





Other info that may be useful:


I followed the install notes on the NAMD web site to build NAMD v2.9. Charm
6.4.0 was built by:


env MPICXX=mpicxx ./build charm++ mpi-linux-x86_64 --with-production


And NAMD was built by:


./config Linux-x86_64-icc --charm-arch mpi-linux-x86_64 --with-cuda
--cuda-prefix /apps/cuda/5.5/cuda


Note that CUDACC in "arch/Linux-x86_64.cuda" was changed to:


CUDACC=$(CUDADIR)/bin/nvcc -O3 --maxrregcount 32 -gencode
arch=compute_20,code=sm_20 -gencode arch=compute_30,code=sm_30 -gencode
arch=compute_35,code=sm_35 -gencode arch=compute_35,code=compute_35
-Xcompiler "-m64" $(CUDA)


in order to support the new Kepler GPUs.


Cluster spec:



CPU: Dual Intel Xeon CPUs (16 cores on each node, 100+ nodes)

Memory: 128GB on each node

GPU: 3 Nvidia K20m GPUs on each node

Interconnect: Infiniband FDR-10

Compiler: Intel v12

MPI: Open MPI v1.4.5

Batch system: Torque + Moab


Diese E-Mail ist frei von Viren und Malware, denn der avast! Antivirus Schutz ist aktiv.

This archive was generated by hypermail 2.1.6 : Wed Dec 31 2014 - 23:21:58 CST