Re: GPUs silently stop working during simulation when oversubscribed

From: Benjamin Merget (benjamin.merget_at_uni-wuerzburg.de)
Date: Thu Feb 28 2013 - 04:04:30 CST

Hi Norman,
> I don't know if this hits your problem, but I know exactly the same behavior
> when a GPU throws a "ECC Double Bit Uncorrectable Error" and drops all
> computing tasks then, namd2 processes still run at 100% but nothing happens
> to the output. You should check your output of dmesg and watch for lines
> from NVRM telling something about ecc errors.
I checked dmesg and found LOTS of messages like this one:
[163081.727684] NVRM: GPU at 0000:19:00:
GPU-b2a01352-e0b4-db8e-9878-6fbbf3298c41

Is this, what we're looking for or nothing to worry about?

> If you find such messages in
> the dmesg log, then you have either a broken GPU
I sure hope not, this thing is brandnew... :-)
> or your cooling is
> insufficient which is possible as you have M-series Tesla which are
> passively cooled and need a proper cooled and built case.
I thought about that too. Unfortunately, I cannot check the GPU
temperature for M-series cards, as far as I know, except with the HP
Cluster Management Utility, which I don't have at the moment.

> If this doesn't help, try updating/reinstalling the nvidia driver. ^^
The latest driver is already installed and also recently re-installed
during a kernel update...

Cheers,
Benjamin

This archive was generated by hypermail 2.1.6 : Wed Dec 31 2014 - 23:20:58 CST