From: Benjamin Merget (benjamin.merget_at_uni-wuerzburg.de)
Date: Thu Feb 28 2013 - 02:51:39 CST
Hi everybody,
I have a problem with my GPU nodes: sometimes already after a few ps,
sometimes not until a few ns my 8 Tesla cards silently stop working
without the simulation explicitly crashing. If only one node with 4
Teslas is used, the same thing happens. The only hints are, that
nvidia-smi and pbsnodes show 0% utilization and, of course more severe,
no more output is written, although all CPUs still claim to run at 100%.
The only thing left to do then, is cancel the job and restart from the
last output. Oddly enough, this only happened so far, when the GPUs were
oversubscribed with 2 (or 3) tasks per GPU, but I wouldn't want to miss
the oversubscribtion, because I have a pretty decent increase in
performance.
The specifications are:
- 2x HP ProLiant sl390s with 4 Tesla M2050 and 4 Tesla M2090, respectively
- Servers run the precise pangolin server with a Torque queueing system
- Gigabit and Infiniband network
System:
- tetrameric protein-ligand-cofactor complex (solvated) with, about
139000 atoms.
- Amber FF parameters
The NAMD 2.9 version I'm running is a Linux-x86_64 build with
mvapich2-MPI and CUDA (non-SMP).
I also wrote a script, which performs automatical restarts each 100 ns,
but it still stopped writing output during the second 100 ns.
I'm pretty puzzled... Maybe someone out there has an idea.
Thanks a lot!
Benjamin
-- Benjamin Merget, M.Sc. Sotriffer lab Institute of Pharmacy and Food Chemistry University of Wuerzburg Am Hubland D-97074 Wuerzburg Tel.: +49 (931) 31-86687 E-Mail: Benjamin.Merget_at_uni-wuerzburg.de
This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:23:00 CST