Re: 4 sequential jobs work on laptops, but only first one works on supercomputer

From: JT (jtibbitt_at_odu.edu)
Date: Tue Apr 14 2009 - 01:24:25 CDT

Well Axel, you might be right regarding the possible bad starting
configuration. Since the system works fine on the laptops, but not on
the cluster, I said to myself, "No way, the system must be fine."
Then Anna suggested trying out another system using the same
procedures, and what do you know, it works. All three minimizations
and equilibration work in both serial and parallel.

For the previous system that is crashing, it is interesting that
fixing the polymer and minimizing the solvent works on the cluster.
It only crashes when attempting to minimize the inside polymer
(solvent fixed). So it looks like the initial conformation of the
polymer has to be looked at.
Jeff

On Apr 14, 2009, at 12:12 AM, JT wrote:

> Axel,
> The system seems to be alright because the jobs run fine on two
> different laptops using different NAMD binaries (OSX and XP). The
> system minimizes and equilibration runs fine. So I think the problem
> lies with the cluster environment. Perhaps it was compiled
> incorrectly or something. When running on the cluster in single
> processor mode, there is a memory leak generating many gigabytes of
> 'nan' type output unless killed, while in parallel mode, it at least
> crashes sooner with a 'segmentation violation' issue in the cluster
> log output. I just found a thread from a couple years back reporting
> a similar problem:
>
> http://www.ks.uiuc.edu/Research/namd/mailing_list/namd-l/5770.html
>
> This sort of stuff is all new to me. Amit, who handles the cluster
> compiles here is helping out. We may try compiling again on one of
> the other supercomputers, or recompiling again on this one. Or maybe
> 2.7b1 has a fix?
> Jeff
>
>
>
> On Apr 13, 2009, at 11:36 PM, Axel Kohlmeyer wrote:
>
>> jeff,
>>
>> have you looked whether the configuration that you
>> are using is actually reasonable?
>>
>> all the symptoms you describe would happen when
>> you have an extremely high energy structure that
>> minimization cannot get you out of, e.g. with
>> "entangled bonds".
>>
>> all the overflow output and "nan"s are typical
>> in such cases and they mess up everything including
>> neighborlists. in order to keep performance high,
>> codes like NAMD have no error checking in their
>> innerloops, so once you have a "bad" number, it
>> keeps spreading throughout the whole system.
>>
>> i would try a few steps MD with writing out every
>> MD step and see if some atoms become "ballistic".
>>
>> cheers,
>> axel.
>>
>> On Mon, 2009-04-13 at 22:59 -0400, JT wrote:
>>> Here is a little more information about the crash occurring during
>>> minimization. Running using only single processor mode creates some
>>> sort of memory leak. It writes an enormous amount of junk to the
>>> NAMD
>>> output that looks like:
>>>
>>> Info: Finished startup with 27120 kB of memory in use.
>>> TCL: Minimizing for 2000 steps
>>> ETITLE: TS BOND ANGLE DIHED
>>> IMPRP
>>> ELECT VDW BOUNDARY MISC
>>> KINETIC
>>> TOTAL TEMP TOTAL2 TOTAL3
>>> TEMPAVG
>>> PRESSURE GPRESSURE VOLUME PRESSAVG
>>> GPRESSAVG
>>>
>>> ENERGY: 0 491.9037 99999999.9999 76.3226
>>> 8.5673
>>> -428179.8422 1346.5543 0.0000 0.0000
>>> 0.0000
>>> 99999999.9999 0.0000 99999999.9999 99999999.9999
>>> 0.0000
>>> 99999999.9999 99999999.9999 272212.3858 99999999.9999
>>> 99999999.9999
>>>
>>> INITIAL STEP: 1e-06
>>> GRADIENT TOLERANCE: nan
>>> BRACKET: 0 nan nan nan nan
>>> RESTARTING CONJUGATE GRADIENT ALGORITHM
>>> INITIAL STEP: 5e-07
>>> GRADIENT TOLERANCE: nan
>>> BRACKET: 0 nan nan nan nan
>>> RESTARTING CONJUGATE GRADIENT ALGORITHM
>>> INITIAL STEP: 2.5e-07
>>> GRADIENT TOLERANCE: nan
>>> BRACKET: 0 nan nan nan nan
>>> NEW SEARCH DIRECTION
>>
>> --
>> =
>> =
>> =====================================================================
>> Axel Kohlmeyer akohlmey_at_cmm.chem.upenn.edu http://
>> www.cmm.upenn.edu
>> Center for Molecular Modeling -- University of Pennsylvania
>> Department of Chemistry, 231 S.34th Street, Philadelphia, PA
>> 19104-6323
>> tel: 1-215-898-1582, fax: 1-215-573-6233, office-tel:
>> 1-215-898-5425
>> =
>> =
>> =====================================================================
>> If you make something idiot-proof, the universe creates a better
>> idiot.
>>
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:52:36 CST