AW: NAMD 2.9 with CUDA runs

From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Mon Sep 03 2012 - 01:30:46 CDT

Hi,

I have already seen this behavior on a broken GPU. It was a Tesla C2050.
All other GPUs ran fine, but one of them crashed every time with Atom
velocity and constraint failures but more randomly
And not the same with every system.

If the behavior is really like you told, it is likely that your GPU is
broken. But to be really sure it hasn't something to do with your namd
installation, could you just use a precompiled version from the namd page to
test?
Also, check the ECC error count with nvidia-smi. As you have a M-Series GPU,
make also sure that the cooling is sufficient. You can check the temperature
during the simulation also with nvidia-smi, this would be very interesting
to know, too.

Norman Geist.

> -----Ursprüngliche Nachricht-----
> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
> Auftrag von Chris Harrison
> Gesendet: Sonntag, 2. September 2012 03:02
> An: Peter Jones
> Cc: namd-l_at_ks.uiuc.edu
> Betreff: Re: namd-l: NAMD 2.9 with CUDA runs
>
> Dear Peter,
>
> Do I understand correctly:
>
> 1) On the GPU, you ran for 100M steps, restarting every 320K steps.
> You do not experience a "high atom velocity" crash if you restart every
> 320K steps. If you exceed 326K steps you get a "high atom velocity"
> crash. And this crash is REPRODUCIBLE every time at ~326K steps?
>
> 2) On the CPU, you have run for >326K steps without crash. You do not
> experience a "high atom velocity" crash at any point on the CPU. What
> is the maximum number steps the system ran on the CPU?
>
>
> Best,
> Chris
>
>
>
> Peter Jones <pm-jones_at_bigpond.com> writes:
> > Date: Sat, 1 Sep 2012 12:41:25 +1000
> > From: Peter Jones <pm-jones_at_bigpond.com>
> > To: namd-l_at_ks.uiuc.edu
> > Subject: Re: namd-l: NAMD 2.9 with CUDA runs
> > X-Mailer: Apple Mail (2.1278)
> >
> > Hi,
> >
> > I am the researcher having problems running NAMD2.9 with CUDA. I'm
> finding that for all of a number of completely different and
> independent systems (protein and water with 50K to 200K atoms), the
> simulations crash after 320K steps with errors concerning rattle
> constraints and atoms moving too fast. This occurs directly after
> writing the dcd trajectory at 325K steps, the trajectory being written
> every 5K steps. These simulations all run normally on other machines,
> although I do not have access to another gpu-accelerated machine for
> that comparison. I can run the simulations by checkpointing and ending
> at 320K steps and then resubmitting the job automatically via pbs.
> These simulations have run this way for over 100M steps without
> problems, and the trajectories all appear normal.
> >
> > Regards,
> > Peter Jones
> >
> >
> >
> >
> >
> >
> > On 01/09/2012, at 4:14 AM, Chris Harrison wrote:
> >
> > > Ashley,
> > >
> > > How reproducible is the error and does it occur on other GPU
> boards? I
> > > ask b/c if you have a system where it occurs reproducibly at ~320K
> steps
> > > or very close to that we would ask you to send us the inputs so we
> can
> > > use it to track down the problem.
> > >
> > > Best,
> > > Chris
> > >
> > >
> > >
> > > Ashley Chew <ashley.chew_at_uwa.edu.au> writes:
> > >> Date: Fri, 31 Aug 2012 17:02:48 +0800
> > >> From: Ashley Chew <ashley.chew_at_uwa.edu.au>
> > >> To: "namd-l_at_ks.uiuc.edu" <namd-l_at_ks.uiuc.edu>
> > >> Subject: namd-l: NAMD 2.9 with CUDA runs
> > >>
> > >> Hi this is my first post in regards to NAMD
> > >>
> > >> I was wondering if anyone in the community was having problems
> with NAMD built with CUDA (Using a single Tesla M2075 6gb, node has
> 72GB of Ram) once it passes a certain point (In his case pass 320k
> steps)
> > >>
> > >> In our case one of the researcher notice the errors returned in
> the output are common internal errors to do with unstable simulations
> but if they checkpoint and stop the runs prior to 320K steps, and then
> restart from the restart files internally generated by NAMD, the
> restarted simulation runs past the previous crash point.
> > >>
> > >> I have even rebuilt the NAMD from CVS 20120828 build with fftw3
> (which works) but it pretty much did the same things once it passes a
> certain point.
> > >>
> > >> Ashley Chew
> > >> HPC System Administrator
> > >> iVEC_at_UWA (MBDP: M024)
> > >> The University of Western Australia
> > >> 35 Stirling Highway
> > >> CRAWLEY WA 6009
> > >>
> > >> E:
> ashley.chew_at_uwa.edu.au<mailto:ashley.chew_at_uwa.edu.au>
> > >> P: +61 8 6488 8742
> > >> F: +61 8 6488 1015
> > >>
> > >>
> > >> CRICOS Provider Code: 00126G
> > >>
> > >> [cid:image003.png_at_01CD879A.72D35BE0]
> > >>
> > >> Confidentiality and Privacy Notice
> > >> The contents of this email are strictly private and intended only
> for the addressee. This email may contain legally privileged or
> confidential information. If you receive this communication in error,
> please notify the sender immediately by reply email and delete both
> emails and any attachments contained therein. No further disclosure,
> copying or relaying of any part of this correspondence is permitted
> without the express permission of the sender. The contents of this
> email, and any response or further correspondence, may be stored on an
> electronic filing record system pursuant to the privacy statement for
> records at The University of Western Australia. The University accepts
> no liability in connection with computer virus, data corruption, delay,
> interruption, unauthorized access or unauthorized amendment. This
> notice should not be removed.
> > >>
> > >> P Save a tree...please don't print this e-mail unless you really
> need to
> > >>
> > >
> > >
> > >
> > >
> > > Best,
> > > Chris
> > >
> > >
> > > --
> > > Chris Harrison, Ph.D.
> > > NIH Center for Macromolecular Modeling and Bioinformatics
> > > Theoretical and Computational Biophysics Group
> > > Beckman Institute for Advanced Science and Technology
> > > University of Illinois, 405 N. Mathews Ave., Urbana, IL 61801
> > >
> > > http://www.ks.uiuc.edu/Research/namd Voice: 773-570-6078
> > > http://www.ks.uiuc.edu/~char Fax: 217-244-6078
> > >
> > >
> >
> >
>
>
> Best,
> Chris
>
>
> --
> Chris Harrison, Ph.D.
> NIH Center for Macromolecular Modeling and Bioinformatics
> Theoretical and Computational Biophysics Group
> Beckman Institute for Advanced Science and Technology
> University of Illinois, 405 N. Mathews Ave., Urbana, IL 61801
>
> http://www.ks.uiuc.edu/Research/namd Voice: 773-570-6078
> http://www.ks.uiuc.edu/~char Fax: 217-244-6078

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:22:00 CST