AW: cuda_check_local_progress polled 1000000 times

From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Tue Jun 05 2012 - 06:42:24 CDT

> -----Urspr√ľngliche Nachricht-----
> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
> Auftrag von Francesco Pietra
> Gesendet: Dienstag, 5. Juni 2012 09:29
> An: Norman Geist; NAMD
> Betreff: Re: namd-l: cuda_check_local_progress polled 1000000 times
>
> Hi Norman:
>
> Neither applies. I used the final 2.9 and cuda-memtest did not reveal

Bad. Ok let's go on.

> anomalies with the GPUs. I'll change to cuda version 2.8. If not back
> here, it means no problems for me with cuda version 2.8.
>
> This was a known issue with 2.9 beta versions, although - as far as I
> am concerned - it was limited to minimization. This is the first time

That's what I thought, too.

But because you are the only one by now that is faced with this problem, we should keep the possibility that your GPU may be broken. I don't know how cudamemtest is working but as it is called memtest, it won't check if the GPU is working correctly and also not if the communication across the pcie is working correctly. And as the error message indicates, it's something like a communication loss, I would not went away from that too quickly. Also, I have seen and expect gpu and vram errors more to be "too fast atoms" errors. So there is left the pcie communication?

If you have the opportunity to test your simulation at another machine (same namd build of course), you should try that also.

> that I met those problems with MD. This time I am at amber parm7,
> while in the past I was at charmm27, if it is relevant at all.

Generally it is relevant as namd has to treat them a little different. But I cannot tell if it is here.

A really important question is: Is this the only one of your systems that throws this error?
As only few people (so I guess) are using namd with amber, it could be a bug of the new energy evaluation on the gpu that was hidden till now and should effect amber different from charmm.

>
> cheers
> francesco
>
> On Tue, Jun 5, 2012 at 7:29 AM, Norman Geist
> <norman.geist_at_uni-greifswald.de> wrote:
> > Hi Francesco,
> >
> > there are two possibilitys in my mind why this error occurs.
> >
> > 1. You are using a beta and the issue is fixed with the final 2.9 <--
> more likely
> > 2. You GPU is broken. <-- more unlikely
> >
> > Regards
> >
> > Norman Geist.
> >
> >> -----Urspr√ľngliche Nachricht-----
> >> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
> >> Auftrag von Francesco Pietra
> >> Gesendet: Montag, 4. Juni 2012 19:21
> >> An: NAMD
> >> Betreff: Re: namd-l: cuda_check_local_progress polled 1000000 times
> >>
> >> Hello:
> >> Now, with amber parm7 regular files, the system (protein in a water
> >> box at 0.M NaCl concentration, and a few calcium++ ions), was
> >> minimized with namd 2.9b.3 multicore, then heated gradually to 285K
> >> with namd-cuda 2.9 (20,000 steps). Equilibration at such temp,,
> 1atm,
> >> crashed with same error "namd-l: cuda_check_local_progress polled
> >> 1000000 times" at step 18400, out of planned 500,000. Setting of the
> >> conf file was the same as for successful MD/amber parm 7 with
> >> namd-cuda 2.8 in the past.
> >>
> >> francesco pietra
> >>
> >> On Sat, Jun 2, 2012 at 9:06 PM, Francesco Pietra
> >> <chiendarret_at_gmail.com> wrote:
> >> > Hello:
> >> > With namd-cuda 2.9 on a shared-mem machine with two GTX-580
> (Debian
> >> > amd64) minimization (ts 0.1fs, wrap all) on a new system of a
> protein
> >> > in a water box, crashed at step 2,296 out of planned 10,000.
> Changing
> >> > to 2.9b3 multicore, the minimization worked well, ending at grad
> 1.5.
> >> > I did not notice if this known issue at the time of beta tests had
> >> > been fixed.
> >> >
> >> > Thanks
> >> > francesco pietra
> >> >
> >> >
> >

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:21:37 CST