Re: NAMD 2.9 with CUDA runs

From: Chris Harrison (charris5_at_gmail.com)
Date: Sat Sep 01 2012 - 20:02:28 CDT

Dear Peter,

Do I understand correctly:

1) On the GPU, you ran for 100M steps, restarting every 320K steps.
You do not experience a "high atom velocity" crash if you restart every
320K steps. If you exceed 326K steps you get a "high atom velocity"
crash. And this crash is REPRODUCIBLE every time at ~326K steps?

2) On the CPU, you have run for >326K steps without crash. You do not
experience a "high atom velocity" crash at any point on the CPU. What
is the maximum number steps the system ran on the CPU?

Best,
Chris

Peter Jones <pm-jones_at_bigpond.com> writes:
> Date: Sat, 1 Sep 2012 12:41:25 +1000
> From: Peter Jones <pm-jones_at_bigpond.com>
> To: namd-l_at_ks.uiuc.edu
> Subject: Re: namd-l: NAMD 2.9 with CUDA runs
> X-Mailer: Apple Mail (2.1278)
>
> Hi,
>
> I am the researcher having problems running NAMD2.9 with CUDA. I'm finding that for all of a number of completely different and independent systems (protein and water with 50K to 200K atoms), the simulations crash after 320K steps with errors concerning rattle constraints and atoms moving too fast. This occurs directly after writing the dcd trajectory at 325K steps, the trajectory being written every 5K steps. These simulations all run normally on other machines, although I do not have access to another gpu-accelerated machine for that comparison. I can run the simulations by checkpointing and ending at 320K steps and then resubmitting the job automatically via pbs. These simulations have run this way for over 100M steps without problems, and the trajectories all appear normal.
>
> Regards,
> Peter Jones
>
>
>
>
>
>
> On 01/09/2012, at 4:14 AM, Chris Harrison wrote:
>
> > Ashley,
> >
> > How reproducible is the error and does it occur on other GPU boards? I
> > ask b/c if you have a system where it occurs reproducibly at ~320K steps
> > or very close to that we would ask you to send us the inputs so we can
> > use it to track down the problem.
> >
> > Best,
> > Chris
> >
> >
> >
> > Ashley Chew <ashley.chew_at_uwa.edu.au> writes:
> >> Date: Fri, 31 Aug 2012 17:02:48 +0800
> >> From: Ashley Chew <ashley.chew_at_uwa.edu.au>
> >> To: "namd-l_at_ks.uiuc.edu" <namd-l_at_ks.uiuc.edu>
> >> Subject: namd-l: NAMD 2.9 with CUDA runs
> >>
> >> Hi this is my first post in regards to NAMD
> >>
> >> I was wondering if anyone in the community was having problems with NAMD built with CUDA (Using a single Tesla M2075 6gb, node has 72GB of Ram) once it passes a certain point (In his case pass 320k steps)
> >>
> >> In our case one of the researcher notice the errors returned in the output are common internal errors to do with unstable simulations but if they checkpoint and stop the runs prior to 320K steps, and then restart from the restart files internally generated by NAMD, the restarted simulation runs past the previous crash point.
> >>
> >> I have even rebuilt the NAMD from CVS 20120828 build with fftw3 (which works) but it pretty much did the same things once it passes a certain point.
> >>
> >> Ashley Chew
> >> HPC System Administrator
> >> iVEC_at_UWA (MBDP: M024)
> >> The University of Western Australia
> >> 35 Stirling Highway
> >> CRAWLEY WA 6009
> >>
> >> E: ashley.chew_at_uwa.edu.au<mailto:ashley.chew_at_uwa.edu.au>
> >> P: +61 8 6488 8742
> >> F: +61 8 6488 1015
> >>
> >>
> >> CRICOS Provider Code: 00126G
> >>
> >> [cid:image003.png_at_01CD879A.72D35BE0]
> >>
> >> Confidentiality and Privacy Notice
> >> The contents of this email are strictly private and intended only for the addressee. This email may contain legally privileged or confidential information. If you receive this communication in error, please notify the sender immediately by reply email and delete both emails and any attachments contained therein. No further disclosure, copying or relaying of any part of this correspondence is permitted without the express permission of the sender. The contents of this email, and any response or further correspondence, may be stored on an electronic filing record system pursuant to the privacy statement for records at The University of Western Australia. The University accepts no liability in connection with computer virus, data corruption, delay, interruption, unauthorized access or unauthorized amendment. This notice should not be removed.
> >>
> >> P Save a tree...please don't print this e-mail unless you really need to
> >>
> >
> >
> >
> >
> > Best,
> > Chris
> >
> >
> > --
> > Chris Harrison, Ph.D.
> > NIH Center for Macromolecular Modeling and Bioinformatics
> > Theoretical and Computational Biophysics Group
> > Beckman Institute for Advanced Science and Technology
> > University of Illinois, 405 N. Mathews Ave., Urbana, IL 61801
> >
> > http://www.ks.uiuc.edu/Research/namd Voice: 773-570-6078
> > http://www.ks.uiuc.edu/~char Fax: 217-244-6078
> >
> >
>
>

Best,
Chris

--
Chris Harrison, Ph.D.
NIH Center for Macromolecular Modeling and Bioinformatics
Theoretical and Computational Biophysics Group
Beckman Institute for Advanced Science and Technology
University of Illinois, 405 N. Mathews Ave., Urbana, IL 61801
http://www.ks.uiuc.edu/Research/namd       Voice: 773-570-6078
http://www.ks.uiuc.edu/~char               Fax:   217-244-6078

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:22:00 CST