Re: 4 sequential jobs work on laptops, but only first one works on supercomputer

From: JT (JTibbitt_at_odu.edu)
Date: Mon Apr 13 2009 - 23:12:06 CDT

Axel,
The system seems to be alright because the jobs run fine on two
different laptops using different NAMD binaries (OSX and XP). The
system minimizes and equilibration runs fine. So I think the problem
lies with the cluster environment. Perhaps it was compiled
incorrectly or something. When running on the cluster in single
processor mode, there is a memory leak generating many gigabytes of
'nan' type output unless killed, while in parallel mode, it at least
crashes sooner with a 'segmentation violation' issue in the cluster
log output. I just found a thread from a couple years back reporting
a similar problem:

http://www.ks.uiuc.edu/Research/namd/mailing_list/namd-l/5770.html

This sort of stuff is all new to me. Amit, who handles the cluster
compiles here is helping out. We may try compiling again on one of
the other supercomputers, or recompiling again on this one. Or maybe
2.7b1 has a fix?
Jeff

On Apr 13, 2009, at 11:36 PM, Axel Kohlmeyer wrote:

> jeff,
>
> have you looked whether the configuration that you
> are using is actually reasonable?
>
> all the symptoms you describe would happen when
> you have an extremely high energy structure that
> minimization cannot get you out of, e.g. with
> "entangled bonds".
>
> all the overflow output and "nan"s are typical
> in such cases and they mess up everything including
> neighborlists. in order to keep performance high,
> codes like NAMD have no error checking in their
> innerloops, so once you have a "bad" number, it
> keeps spreading throughout the whole system.
>
> i would try a few steps MD with writing out every
> MD step and see if some atoms become "ballistic".
>
> cheers,
> axel.
>
> On Mon, 2009-04-13 at 22:59 -0400, JT wrote:
>> Here is a little more information about the crash occurring during
>> minimization. Running using only single processor mode creates some
>> sort of memory leak. It writes an enormous amount of junk to the
>> NAMD
>> output that looks like:
>>
>> Info: Finished startup with 27120 kB of memory in use.
>> TCL: Minimizing for 2000 steps
>> ETITLE: TS BOND ANGLE DIHED
>> IMPRP
>> ELECT VDW BOUNDARY MISC
>> KINETIC
>> TOTAL TEMP TOTAL2 TOTAL3
>> TEMPAVG
>> PRESSURE GPRESSURE VOLUME PRESSAVG
>> GPRESSAVG
>>
>> ENERGY: 0 491.9037 99999999.9999 76.3226
>> 8.5673
>> -428179.8422 1346.5543 0.0000 0.0000
>> 0.0000
>> 99999999.9999 0.0000 99999999.9999 99999999.9999
>> 0.0000
>> 99999999.9999 99999999.9999 272212.3858 99999999.9999
>> 99999999.9999
>>
>> INITIAL STEP: 1e-06
>> GRADIENT TOLERANCE: nan
>> BRACKET: 0 nan nan nan nan
>> RESTARTING CONJUGATE GRADIENT ALGORITHM
>> INITIAL STEP: 5e-07
>> GRADIENT TOLERANCE: nan
>> BRACKET: 0 nan nan nan nan
>> RESTARTING CONJUGATE GRADIENT ALGORITHM
>> INITIAL STEP: 2.5e-07
>> GRADIENT TOLERANCE: nan
>> BRACKET: 0 nan nan nan nan
>> NEW SEARCH DIRECTION
>
> --
> =
> ======================================================================
> Axel Kohlmeyer akohlmey_at_cmm.chem.upenn.edu http://
> www.cmm.upenn.edu
> Center for Molecular Modeling -- University of Pennsylvania
> Department of Chemistry, 231 S.34th Street, Philadelphia, PA
> 19104-6323
> tel: 1-215-898-1582, fax: 1-215-573-6233, office-tel: 1-215-898-5425
> =
> ======================================================================
> If you make something idiot-proof, the universe creates a better
> idiot.
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:52:36 CST