GPUs silently stop working during simulation when oversubscribed

From: Benjamin Merget (benjamin.merget_at_uni-wuerzburg.de)
Date: Thu Feb 28 2013 - 02:51:39 CST

Hi everybody,

I have a problem with my GPU nodes: sometimes already after a few ps,
sometimes not until a few ns my 8 Tesla cards silently stop working
without the simulation explicitly crashing. If only one node with 4
Teslas is used, the same thing happens. The only hints are, that
nvidia-smi and pbsnodes show 0% utilization and, of course more severe,
no more output is written, although all CPUs still claim to run at 100%.
The only thing left to do then, is cancel the job and restart from the
last output. Oddly enough, this only happened so far, when the GPUs were
oversubscribed with 2 (or 3) tasks per GPU, but I wouldn't want to miss
the oversubscribtion, because I have a pretty decent increase in
performance.

The specifications are:
- 2x HP ProLiant sl390s with 4 Tesla M2050 and 4 Tesla M2090, respectively
- Servers run the precise pangolin server with a Torque queueing system
- Gigabit and Infiniband network

System:
- tetrameric protein-ligand-cofactor complex (solvated) with, about
139000 atoms.
- Amber FF parameters

The NAMD 2.9 version I'm running is a Linux-x86_64 build with
mvapich2-MPI and CUDA (non-SMP).

I also wrote a script, which performs automatical restarts each 100 ns,
but it still stopped writing output during the second 100 ns.

I'm pretty puzzled... Maybe someone out there has an idea.

Thanks a lot!
Benjamin

-- 
Benjamin Merget, M.Sc.
Sotriffer lab
Institute of Pharmacy and Food Chemistry
University of Wuerzburg
Am Hubland
D-97074 Wuerzburg
Tel.: +49 (931) 31-86687
E-Mail: Benjamin.Merget_at_uni-wuerzburg.de

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:23:00 CST