From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Thu Jun 28 2012 - 02:05:11 CDT
I also observed this behavior on a external GPU cluster we use. As I have no influence on configuration there, I don't know why it happens, but I think it's a issue of one of the GPUs there. What happens seems to be that the GPUs stop working. You can do the following command on linux on the GPU node after the simulation stops writing output to see if the GPUs are working or not, I suppose they won't do anything anymore, but I don't know why.
watch "nvidia-smi -q -a | grep %"
This should show up the gpu and vram utilization.
Ps: I'm using 2.8 version
Norman Geist.
> -----Ursprüngliche Nachricht-----
> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
> Auftrag von amin_at_imtech.res.in
> Gesendet: Donnerstag, 28. Juni 2012 08:13
> An: Norman Geist
> Cc: namd-l_at_ks.uiuc.edu
> Betreff: Re: AW: AW: AW: namd-l: Atoms moving too fast only with CUDA
> version.
>
> Just after the minimization step, I have a heating step from 0 to 300K
> in 3000
> steps where I have kept the CAs restrained. This restraint is absent in
> all the
> other steps. Can this be the reason? Also I found that although the
> production
> run started well with the CUDA version, after about 1.5 million steps,
> it
> stopped writing any output but the all the processes where still
> running. I
> waited for around 4 hours and then killed the run. I restarted the run
> and this
> time I got segmentation fault after about half a million steps. I have
> restarted
> again and right now its running at around half a million steps. I hope
> it turns
> out to be a temporary issue.
> Amin.
>
>
>
>
> > I only use GPU versions of namd. For all systems, for all states of
> simulation
> > and I never observed something like that, but I could imagine that
> you could
> > have used a feature that is maybe currently broken in the GPU
> version. Have you
> > used something special that you turned off after the equilibration
> run like
> > restraints?
> >
> > Norman Geist.
> >
> >> -----Ursprüngliche Nachricht-----
> >> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
> >> Auftrag von amin_at_imtech.res.in
> >> Gesendet: Mittwoch, 27. Juni 2012 16:41
> >> An: Norman Geist
> >> Cc: broomsday_at_gmail.com; namd-l_at_ks.uiuc.edu
> >> Betreff: Re: AW: AW: namd-l: Atoms moving too fast only with CUDA
> >> version.
> >>
> >> I completed the equilibration run on CPU and then tried the
> production
> >> run using
> >> NAMD2.9-CUDA and now it works without any error.Also my GPU memory
> >> tests showed
> >> no errors. So I believe the robustness of the integrator is the
> >> key.Thanks for
> >> the replies
> >>
> >> Amin.
> >>
> >>
> >> > Also, as it is the initial step of your simulation, you could try
> to
> >> remove
> >> the
> >> > restraint stuff and constant pressure and fixed atoms if you have
> and
> >> see if
> >> it's working. I remember someone with the same problem and that was
> due
> >> to false
> >> > defined restraints.
> >> >
> >> > Norman Geist.
> >> >
> >> >
> >> >> -----Ursprüngliche Nachricht-----
> >> >> Von: Norman Geist [mailto:norman.geist_at_uni-greifswald.de]
> >> >> Gesendet: Mittwoch, 27. Juni 2012 10:23
> >> >> An: 'amin_at_imtech.res.in'
> >> >> Cc: Namd Mailing List (namd-l_at_ks.uiuc.edu)
> >> >> Betreff: AW: AW: namd-l: Atoms moving too fast only with CUDA
> >> version.
> >> >> > -----Ursprüngliche Nachricht-----
> >> >> > Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu]
> Im
> >> Auftrag
> >> von amin_at_imtech.res.in
> >> >> > Gesendet: Mittwoch, 27. Juni 2012 09:10
> >> >> > An: Norman Geist
> >> >> > Cc: namd-l_at_ks.uiuc.edu
> >> >> > Betreff: Re: AW: namd-l: Atoms moving too fast only with CUDA
> >> >> version.
> >> >> >
> >> >> > I have only one GPU. I get the error after all the minimization
> >> steps are
> >> completed, just at the first heating step.
> >> >> Yes, same for me. Minimization doesn't compute velocities, only
> >> forces and
> >> energies that get optimized. It's no real atom movement. It just
> moves
> >> atoms
> >> randomly a little amount, compute energies, see if total energy is
> >> lower than
> >> before. If it is lower it keeps the new positions, if not it goes
> back.
> >> Than it
> >> starts over. So a error computation during minimization causes only
> >> that the
> >> minimizer thinks it has done a bad move, but does not break the
> >> simulation. A
> >> too high force computed during molecular dynamic causes unusual
> >> behavior and to
> >> strong
> >> >> velocities that break the simulation. You should try that memtest
> >> thing. But
> >> if it is a GPU or PCIE-BUS(on GPU) error, the memory test won't show
> up
> >> I think.
> >> The best would be to try another GPU. Bad that you only have one.
> >> >> Also, does other molecular systems break the same way on the GPU?
> >> Maybe try
> >> some of the test systems from the namd site.
> >> >> > Thanks.
> >> >> > Amin.
> >> >> >
> >> >> >
> >> >> > > Hi,
> >> >> > >
> >> >> > >
> >> >> > >
> >> >> > > I had the same problem when I had a broken GPU. If you have
> >> >> multiple
> >> >> > GPUs, try
> >> >> > > them separately to see if it only crashes when a special GPU
> >> >> > participates.
> >> >> > >
> >> >> > > Also it would be important if you get the error directly at
> >> start
> >> >> or
> >> >> > later.
> >> >> > >
> >> >> > >
> >> >> > >
> >> >> > > Good luck
> >> >> > >
> >> >> > >
> >> >> > >
> >> >> > > Norman Geist.
> >> >> > >
> >> >> > >
> >> >> > >
> >> >> > > Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-
> l_at_ks.uiuc.edu]
> >> Im
> >> >> > Auftrag von
> >> >> > > Aron Broom
> >> >> > > Gesendet: Mittwoch, 27. Juni 2012 07:52
> >> >> > > An: amin_at_imtech.res.in
> >> >> > > Cc: namd-l_at_ks.uiuc.edu
> >> >> > > Betreff: Re: namd-l: Atoms moving too fast only with CUDA
> >> version.
> >> >> > >
> >> >> > >
> >> >> > >
> >> >> > > I'm not sure you necessarily did anything wrong. I would
> >> suggest
> >> >> > that your
> >> >> > > system even after 50,000 steps still has some kind of
> problems,
> >> but
> >> >> > the CPU
> >> >> > > integrator is robust enough to muscle through it, whereas the
> >> CUDA
> >> >> > one is not.
> >> >> > >
> >> >> > > You should consider slowly heating your system from say 100K
> or
> >> >> > something of the
> >> >> > > sort, as I would imagine you have jumped straight to 300K
> which
> >> >> > generally works,
> >> >> > > but requires a decent starting point.
> >> >> > >
> >> >> > > Keep in mind that even though the minimizer in NAMD is
> smarter
> >> than
> >> >> > just
> >> >> > > steepest descent, it will still be easily trapped in local
> >> minima,
> >> >> so
> >> >> > doing more
> >> >> > > minimization without some kind of dynamics is unlikely to get
> >> you
> >> >> > closer to the
> >> >> > > global minimum and away from whatever problems you have.
> >> >> > >
> >> >> > > Did you have a look at the structure also, and which atoms
> are
> >> >> > causing the
> >> >> > > problem?
> >> >> > >
> >> >> > > ~Aron
> >> >> > >
> >> >> > > On Wed, Jun 27, 2012 at 1:37 AM, <amin_at_imtech.res.in> wrote:
> >> >> > >
> >> >> > > Dear all,
> >> >> > > I am trying to run an equilibration using NAMD 2.9-CUDA
> on
> >> >> Linux.
> >> >> > However,
> >> >> > > I keep getting "Atoms moving too fast error".I increased the
> >> >> > minimization
> >> >> > > upto 50000 steps but it doesn't work. But when I tried to run
> >> the
> >> >> > exact
> >> >> > > same config file using the non-CUDA version it ran without
> any
> >> >> error
> >> >> > even
> >> >> > > at 10000 minimization steps.And the error is reproducible.
> Can
> >> >> > someone
> >> >> > > please tell me what may have gone wrong.
> >> >> > >
> >> >> > > Regards.
> >> >> > >
> >> >> > > Amin.
> >> >> > >
> >> >> > >
> >> >> >
> >> >>
> >>
> ______________________________________________________________________
> >> >> > > सूक्ष्मजीव प्रौद्योगिकी संस्थान (वैज्ञानिक
> >> >> औद्योगिक
> >> >> > अनुसंधान परिषद)
> >> >> > > Institute of Microbial Technology (A CONSTITUENT
> ESTABLISHMENT
> >> OF
> >> >> > CSIR)
> >> >> > > सैक्टर 39 ए, चण्डीगढ़ / Sector 39-A, Chandigarh पिन
> >> कोड/PIN CODE :160036
> >> दूरभाष/EPABX :0172 6665 201-202
> >> >> > >
> >> >> > >
> >> >> > >
> >> >> > >
> >> >> > > --
> >> >> > > Aron Broom M.Sc
> >> >> > > PhD Student
> >> >> > > Department of Chemistry
> >> >> > > University of Waterloo
> >> >> > >
> >> >> > >
> >> >> >
> >> >> >
> >> >> >
> >> >>
> >>
> ______________________________________________________________________
> >> >> > सूक्ष्मजीव प्रौद्योगिकी संस्थान (वैज्ञानिक
> >> औद्योगिक अनुसंधान परिषद)
> >> Institute of Microbial Technology (A CONSTITUENT ESTABLISHMENT OF
> >> >> CSIR)
> >> >> > सैक्टर 39 ए, चण्डीगढ़ / Sector 39-A, Chandigarh पिन
> >> कोड/PIN CODE :160036
> >> दूरभाष/EPABX :0172 6665 201-202
> >> >
> >> >
> >> >
> >>
> >>
> >>
> >>
> >>
> >>
> ______________________________________________________________________
> >> सूक्ष्मजीव प्रौद्योगिकी संस्थान (वैज्ञानिक औद्योगिक
> >> अनुसंधान परिषद)
> >> Institute of Microbial Technology (A CONSTITUENT ESTABLISHMENT OF
> CSIR)
> >> सैक्टर 39 ए, चण्डीगढ़ / Sector 39-A, Chandigarh
> >> पिन कोड/PIN CODE :160036
> >> दूरभाष/EPABX :0172 6665 201-202
> >
> >
>
>
> ______________________________________________________________________
> सूक्ष्मजीव प्रौद्योगिकी संस्थान (वैज्ञानिक औद्योगिक
> अनुसंधान परिषद)
> Institute of Microbial Technology (A CONSTITUENT ESTABLISHMENT OF CSIR)
> सैक्टर 39 ए, चण्डीगढ़ / Sector 39-A, Chandigarh
> पिन कोड/PIN CODE :160036
> दूरभाष/EPABX :0172 6665 201-202
This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:21:44 CST