AW: GPUs silently stop working during simulation when oversubscribed

From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Thu Feb 28 2013 - 03:21:00 CST

Hi benjamin,

I don't know if this hits your problem, but I know exactly the same behavior
when a GPU throws a "ECC Double Bit Uncorrectable Error" and drops all
computing tasks then, namd2 processes still run at 100% but nothing happens
to the output. You should check your output of dmesg and watch for lines
from NVRM telling something about ecc errors. If you find such messages in
the dmesg log, then you have either a broken GPU or your cooling is
insufficient which is possible as you have M-series Tesla which are
passively cooled and need a proper cooled and built case.

If this doesn't help, try updating/reinstalling the nvidia driver. ^^

Good luck

Norman Geist.

> -----Ursprüngliche Nachricht-----
> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
> Auftrag von Benjamin Merget
> Gesendet: Donnerstag, 28. Februar 2013 09:52
> An: namd-l_at_ks.uiuc.edu
> Betreff: namd-l: GPUs silently stop working during simulation when
> oversubscribed
>
> Hi everybody,
>
> I have a problem with my GPU nodes: sometimes already after a few ps,
> sometimes not until a few ns my 8 Tesla cards silently stop working
> without the simulation explicitly crashing. If only one node with 4
> Teslas is used, the same thing happens. The only hints are, that
> nvidia-smi and pbsnodes show 0% utilization and, of course more severe,
> no more output is written, although all CPUs still claim to run at
> 100%.
> The only thing left to do then, is cancel the job and restart from the
> last output. Oddly enough, this only happened so far, when the GPUs
> were
> oversubscribed with 2 (or 3) tasks per GPU, but I wouldn't want to miss
> the oversubscribtion, because I have a pretty decent increase in
> performance.
>
> The specifications are:
> - 2x HP ProLiant sl390s with 4 Tesla M2050 and 4 Tesla M2090,
> respectively
> - Servers run the precise pangolin server with a Torque queueing system
> - Gigabit and Infiniband network
>
> System:
> - tetrameric protein-ligand-cofactor complex (solvated) with, about
> 139000 atoms.
> - Amber FF parameters
>
> The NAMD 2.9 version I'm running is a Linux-x86_64 build with
> mvapich2-MPI and CUDA (non-SMP).
>
> I also wrote a script, which performs automatical restarts each 100 ns,
> but it still stopped writing output during the second 100 ns.
>
> I'm pretty puzzled... Maybe someone out there has an idea.
>
> Thanks a lot!
> Benjamin
>
> --
> Benjamin Merget, M.Sc.
>
> Sotriffer lab
> Institute of Pharmacy and Food Chemistry
> University of Wuerzburg
> Am Hubland
> D-97074 Wuerzburg
>
> Tel.: +49 (931) 31-86687
> E-Mail: Benjamin.Merget_at_uni-wuerzburg.de

This archive was generated by hypermail 2.1.6 : Wed Dec 31 2014 - 23:20:58 CST