AW: cuda_check_local_progress polled 1000000 times

From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Wed Jun 06 2012 - 00:38:05 CDT

> -----Ursprüngliche Nachricht-----
> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
> Auftrag von Francesco Pietra
> Gesendet: Dienstag, 5. Juni 2012 15:35
> An: Norman Geist; NAMD
> Betreff: Re: namd-l: cuda_check_local_progress polled 1000000 times
>
> On Tue, Jun 5, 2012 at 1:42 PM, Norman Geist
> <norman.geist_at_uni-greifswald.de> wrote:
> >> -----Ursprüngliche Nachricht-----
> >> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
> >> Auftrag von Francesco Pietra
> >> Gesendet: Dienstag, 5. Juni 2012 09:29
> >> An: Norman Geist; NAMD
> >> Betreff: Re: namd-l: cuda_check_local_progress polled 1000000 times
> >>
> >> Hi Norman:
> >>
> >> Neither applies. I used the final 2.9 and cuda-memtest did not
> reveal
> >
> > Bad. Ok let's go on.
> >
> >> anomalies with the GPUs. I'll change to cuda version 2.8. If not
> back
> >> here, it means no problems for me with cuda version 2.8.
> >>
> >> This was a known issue with 2.9 beta versions, although - as far as
> I
> >> am concerned - it was limited to minimization. This is the first
> time
> >
> > That's what I thought, too.
> >
> > But because you are the only one by now that is faced with this
> problem, we should keep the possibility that your GPU may be broken. I
> don't know how cudamemtest is working but as it is called memtest, it
> won't check if the GPU is working correctly and also not if the
> communication across the pcie is working correctly. And as the error
> message indicates, it's something like a communication loss, I would
> not went away from that too quickly. Also, I have seen and expect gpu
> and vram errors more to be "too fast atoms" errors. So there is left
> the pcie communication?
>
> I was also one of the few that had problems with 2.9beta. I heard that
> cuda-memtest is a significant test. I am a biochemist, know little
> about hardware. At any event, the same system that crashed with v 2.9
> is running regularly on namd-cuda v 2.8.
> One point, if relevant: I use Debian amd64. Debian was always known
> for disliking any digression from the rules.
> >
> > If you have the opportunity to test your simulation at another
> machine (same namd build of course), you should try that also.
> >
> >> that I met those problems with MD. This time I am at amber parm7,
> >> while in the past I was at charmm27, if it is relevant at all.
> >
> > Generally it is relevant as namd has to treat them a little
> different. But I cannot tell if it is here.
>
> I have recently carried out very many long simulations without any
> problem with namd-cuda v 2.9 and parm7 ff for a metallo-protein (this
> is an area where there is little opportunity at present with psf). I
> had only to shift to the non-cuda for the minimization. Here, the
> system is at NaCl 0.M, might be that. In fact, there were, since the
> LEaP stage, a few Cl- out of the box. As the simulation goes on, what
> is seen out of the box are a few NaCl ion pairs. Let's how it will go
> on. I am interested as a biochemist in this system, not because of the
> 0.M NaCl, but this is the natural environment.
> >
> > A really important question is: Is this the only one of your systems
> that throws this error?
> > As only few people (so I guess) are using namd with amber, it could
> be a bug of the new energy evaluation on the gpu > that was hidden till
> now and should effect amber different from charmm.
>
> I would like to have more money. In fact, even the large cluster had
> to be stopped: electron are expensive here for our pocket. Anyway, are
> you not satisfied that, while it crashes with 2.9, it is running with
> 2.8? I cross the fingers.

Actually yes. But we shouldn't stress the namd developers to quick with problems that are not pointed out to come from the code directly. But as the same system in running at 2.8, this could really be a bug.

Does the simulation keeps crashing if you try to restart it?? Is it the same error all the time? Is it the only system?

As I observed nobody else participated in this discussion, you should maybe start a separate/new subject with a keyword like bug report and send your input files that does the problems to inform the developers about the possible bug. But check the above questions before.

Regards

Norman

>
> have a nice day
> francesco
> >
> >>
> >> cheers
> >> francesco
> >>
> >> On Tue, Jun 5, 2012 at 7:29 AM, Norman Geist
> >> <norman.geist_at_uni-greifswald.de> wrote:
> >> > Hi Francesco,
> >> >
> >> > there are two possibilitys in my mind why this error occurs.
> >> >
> >> > 1. You are using a beta and the issue is fixed with the final 2.9
> <--
> >> more likely
> >> > 2. You GPU is broken. <-- more unlikely
> >> >
> >> > Regards
> >> >
> >> > Norman Geist.
> >> >
> >> >> -----Ursprüngliche Nachricht-----
> >> >> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu]
> Im
> >> >> Auftrag von Francesco Pietra
> >> >> Gesendet: Montag, 4. Juni 2012 19:21
> >> >> An: NAMD
> >> >> Betreff: Re: namd-l: cuda_check_local_progress polled 1000000
> times
> >> >>
> >> >> Hello:
> >> >> Now, with amber parm7 regular files, the system (protein in a
> water
> >> >> box at 0.M NaCl concentration, and a few calcium++ ions), was
> >> >> minimized with namd 2.9b.3 multicore, then heated gradually to
> 285K
> >> >> with namd-cuda 2.9 (20,000 steps). Equilibration at such temp,,
> >> 1atm,
> >> >> crashed with same error "namd-l: cuda_check_local_progress polled
> >> >> 1000000 times" at step 18400, out of planned 500,000. Setting of
> the
> >> >> conf file was the same as for successful MD/amber parm 7 with
> >> >> namd-cuda 2.8 in the past.
> >> >>
> >> >> francesco pietra
> >> >>
> >> >> On Sat, Jun 2, 2012 at 9:06 PM, Francesco Pietra
> >> >> <chiendarret_at_gmail.com> wrote:
> >> >> > Hello:
> >> >> > With namd-cuda 2.9 on a shared-mem machine with two GTX-580
> >> (Debian
> >> >> > amd64) minimization (ts 0.1fs, wrap all) on a new system of a
> >> protein
> >> >> > in a water box, crashed at step 2,296 out of planned 10,000.
> >> Changing
> >> >> > to 2.9b3 multicore, the minimization worked well, ending at
> grad
> >> 1.5.
> >> >> > I did not notice if this known issue at the time of beta tests
> had
> >> >> > been fixed.
> >> >> >
> >> >> > Thanks
> >> >> > francesco pietra
> >> >> >
> >> >> >
> >> >
> >

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:21:37 CST