Re: cuda_check_local_progress polled 1000000 times

From: Francesco Pietra (chiendarret_at_gmail.com)
Date: Tue Jun 05 2012 - 08:34:47 CDT

On Tue, Jun 5, 2012 at 1:42 PM, Norman Geist
<norman.geist_at_uni-greifswald.de> wrote:
>> -----Ursprüngliche Nachricht-----
>> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
>> Auftrag von Francesco Pietra
>> Gesendet: Dienstag, 5. Juni 2012 09:29
>> An: Norman Geist; NAMD
>> Betreff: Re: namd-l: cuda_check_local_progress polled 1000000 times
>>
>> Hi Norman:
>>
>> Neither applies. I used the final 2.9 and cuda-memtest did not reveal
>
> Bad. Ok let's go on.
>
>> anomalies with the GPUs. I'll change to cuda version 2.8. If not back
>> here, it means no problems for me with cuda version 2.8.
>>
>> This was a known issue with 2.9 beta versions, although - as far as I
>> am concerned - it was limited to minimization. This is the first time
>
> That's what I thought, too.
>
> But because you are the only one by now that is faced with this problem, we should keep the possibility that your GPU may be broken. I don't know how cudamemtest is working but as it is called memtest, it won't check if the GPU is working correctly and also not if the communication across the pcie is working correctly. And as the error message indicates, it's something like a communication loss, I would not went away from that too quickly. Also, I have seen and expect gpu and vram errors more to be "too fast atoms" errors. So there is left the pcie communication?

I was also one of the few that had problems with 2.9beta. I heard that
cuda-memtest is a significant test. I am a biochemist, know little
about hardware. At any event, the same system that crashed with v 2.9
is running regularly on namd-cuda v 2.8.
One point, if relevant: I use Debian amd64. Debian was always known
for disliking any digression from the rules.
>
> If you have the opportunity to test your simulation at another machine (same namd build of course), you should try that also.
>
>> that I met those problems with MD. This time I am at amber parm7,
>> while  in the past I was at charmm27, if it is relevant at all.
>
> Generally it is relevant as namd has to treat them a little different. But I cannot tell if it is here.

I have recently carried out very many long simulations without any
problem with namd-cuda v 2.9 and parm7 ff for a metallo-protein (this
is an area where there is little opportunity at present with psf). I
had only to shift to the non-cuda for the minimization. Here, the
system is at NaCl 0.M, might be that. In fact, there were, since the
LEaP stage, a few Cl- out of the box. As the simulation goes on, what
is seen out of the box are a few NaCl ion pairs. Let's how it will go
on. I am interested as a biochemist in this system, not because of the
0.M NaCl, but this is the natural environment.
>
> A really important question is: Is this the only one of your systems that throws this error?
> As only few people (so I guess) are using namd with amber, it could be a bug of the new energy evaluation on the gpu > that was hidden till now and should effect amber different from charmm.

I would like to have more money. In fact, even the large cluster had
to be stopped: electron are expensive here for our pocket. Anyway, are
you not satisfied that, while it crashes with 2.9, it is running with
2.8? I cross the fingers.

have a nice day
francesco
>
>>
>> cheers
>> francesco
>>
>> On Tue, Jun 5, 2012 at 7:29 AM, Norman Geist
>> <norman.geist_at_uni-greifswald.de> wrote:
>> > Hi Francesco,
>> >
>> > there are two possibilitys in my mind why this error occurs.
>> >
>> > 1. You are using a beta and the issue is fixed with the final 2.9 <--
>> more likely
>> > 2. You GPU is broken. <-- more unlikely
>> >
>> > Regards
>> >
>> > Norman Geist.
>> >
>> >> -----Ursprüngliche Nachricht-----
>> >> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
>> >> Auftrag von Francesco Pietra
>> >> Gesendet: Montag, 4. Juni 2012 19:21
>> >> An: NAMD
>> >> Betreff: Re: namd-l: cuda_check_local_progress polled 1000000 times
>> >>
>> >> Hello:
>> >> Now, with amber parm7 regular files, the system (protein in a water
>> >> box at 0.M NaCl concentration, and a few calcium++ ions), was
>> >> minimized with namd 2.9b.3 multicore, then heated gradually to 285K
>> >> with namd-cuda 2.9 (20,000 steps). Equilibration at such temp,,
>> 1atm,
>> >> crashed with same error "namd-l: cuda_check_local_progress polled
>> >> 1000000 times" at step 18400, out of planned 500,000. Setting of the
>> >> conf file was the same as for successful MD/amber parm 7 with
>> >> namd-cuda 2.8 in the past.
>> >>
>> >> francesco pietra
>> >>
>> >> On Sat, Jun 2, 2012 at 9:06 PM, Francesco Pietra
>> >> <chiendarret_at_gmail.com> wrote:
>> >> > Hello:
>> >> > With namd-cuda 2.9 on a shared-mem machine with two GTX-580
>> (Debian
>> >> > amd64) minimization (ts 0.1fs, wrap all) on a new system of a
>> protein
>> >> > in a water box, crashed at step 2,296 out of planned 10,000.
>> Changing
>> >> > to 2.9b3 multicore, the minimization worked well, ending at grad
>> 1.5.
>> >> > I did not notice if this known issue at the time of beta tests had
>> >> > been fixed.
>> >> >
>> >> > Thanks
>> >> > francesco pietra
>> >> >
>> >> >
>> >
>

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:21:37 CST