AW: GPUs silently stop working during simulation when oversubscribed

From: Norman Geist (
Date: Thu Feb 28 2013 - 05:41:04 CST

> -----Ursprüngliche Nachricht-----
> Von: Benjamin Merget []
> Gesendet: Donnerstag, 28. Februar 2013 11:05
> An: Norman Geist
> Cc: Namd Mailing List
> Betreff: Re: namd-l: GPUs silently stop working during simulation when
> oversubscribed
> Hi Norman,
> > I don't know if this hits your problem, but I know exactly the same
> behavior
> > when a GPU throws a "ECC Double Bit Uncorrectable Error" and drops
> all
> > computing tasks then, namd2 processes still run at 100% but nothing
> happens
> > to the output. You should check your output of dmesg and watch for
> lines
> > from NVRM telling something about ecc errors.
> I checked dmesg and found LOTS of messages like this one:
> [163081.727684] NVRM: GPU at 0000:19:00:
> GPU-b2a01352-e0b4-db8e-9878-6fbbf3298c41
> Is this, what we're looking for or nothing to worry about?

No, this messages seems normal. No clue then =(
Maybe try cudamemtest or try to figure out, if it could be one specific node
or gpu.
Additionally the namd debug mode or tools like strace can point out, what
namd is currently doing when stop working.

Good luck.

> > If you find such messages in
> > the dmesg log, then you have either a broken GPU
> I sure hope not, this thing is brandnew... :-)
> > or your cooling is
> > insufficient which is possible as you have M-series Tesla which are
> > passively cooled and need a proper cooled and built case.
> I thought about that too. Unfortunately, I cannot check the GPU
> temperature for M-series cards, as far as I know, except with the HP
> Cluster Management Utility, which I don't have at the moment.
> > If this doesn't help, try updating/reinstalling the nvidia driver. ^^
> The latest driver is already installed and also recently re-installed
> during a kernel update...
> Cheers,
> Benjamin

This archive was generated by hypermail 2.1.6 : Wed Dec 31 2014 - 23:20:58 CST