Re: System minimization: fail after 199 steps

From: Brian Bennion (brian_at_youkai.llnl.gov)
Date: Wed Nov 03 2004 - 13:05:56 CST

Hello Charles,

A little background...
Namd requires charm++ to compile correctly, so the natural order is that
charm++ is compiled first and then namd is compiled against it. The fact
that namd runs at all on your system would suggest that charm++ has been
compiled at some point.

I am not familiar with the sun sparc setup, but charmrun maybe used here
to propagate the job through the nodes.

Can anyone comment here?

Steps that I can recommend....
Try just minimizing the protein alone in vacuo for 200+ steps?
Are you sure that the total system is 58,236 atoms? That seems small for
such a complex box.

Can you send the whole log file from startup to crash?
There just might not be enough memory? But I would think that this would
manifest itself earlier.

Thanks
Brian

On Wed, 3 Nov 2004, Charles Danko wrote:

> Hi,
>
> Thanks to Brian and Dr. Valencia for their help.
>
> The machines are a cluster of Sun SPARC 64 bit processors running
> Solaris 7. I am using bsub for multithreading. My administrator may
> let me run charm+ if you think that it may solve the problem, but
> there may be some good reason that it wasn't used before (namd was
> compiled by a colleague of mine, and I am not sure of specific issues
> he faced when putting it together).
>
> The system is a protein, lipid, and water system, in total, 58,236
> atoms constructed from a protein homology model. The system was
> assembled using VMD, the membrane 1.0 plug-in, and solvate 1.2 (to
> solvate the top and bottom where the protein was sticking out of the
> pre-equilibrated lipid-water system constructed by membrane). I
> deleted all atoms within 1A of the protein and am now trying to
> minimize the system.
>
> Based on Dr. Valencia and Dr. Bennion's suggestions I changed the
> script file. I adapted the one intended to heat the system after the
> minimization. I have included the new script file as an attachment.
> The run still crashes after 199 steps, but this time it returns a
> malloc error. Short by 2GB?
> The last part of the output is pasted below. Many of the forces are
> positive again.
>
> I have tried to fix the protein and minimize the water/lipids; the
> output is pasted below. The system lasted for 299 steps this time,
> but received the same malloc error.
>
> I have NOT deleted the atoms which fall outside of my periodic
> boundary. If you recommend I will do this and try to run the new
> script again. I am acting under the assumption that these atoms will
> be ignored.
> Is this coorect?
>
> Because the problem seems to be a memory allocation error, I am
> thinking that the next step will to be trying to convince my
> administrator to compile charm+.
> Any thoughts or suggestions?
> Do I need to recompile all of namd, or can I just compile charm+ without it?
>
> Thanks again for all of the help,
> Charles
>
> Output files:
>
> New script, no atoms fixed.
>
> BRACKET: 6.57916e-07 652.946 -2.45009e+09 -8.45313e+07 9.29531e+08
> ENERGY: 198 522579.9239 151303.8494 10858.5910 1446.8211
> -80557.5403 481695.6698 0.0000 0.0000 0.0000
> 1087327.3149 0.0000 1087327.3149 1087327.3149 0.0000
> 188642.4062 235104.9289 576000.0000 188642.4062 235104.9289
>
> BRACKET: 1.6835e-07 70.1294 -8.45313e+07 1.17645e+07 9.29531e+08
> ENERGY: 199 522585.2964 151303.6975 10858.5915 1446.8152
> -80557.4059 481690.3089 0.0000 0.0000 0.0000
> 1087327.3036 0.0000 1087327.3036 1087327.3036 0.0000
> 188639.2161 235101.2267 576000.0000 188639.2161 235101.2267
>
> LDB: LOAD: AVG 231.478 MAX 291.895 MSGS: TOTAL 184 MAXC 20 MAXP 5 None
> LDB: LOAD: AVG 231.478 MAX 255.756 MSGS: TOTAL 184 MAXC 20 MAXP 5 Alg7
> LDB: LOAD: AVG 231.478 MAX 236.106 MSGS: TOTAL 184 MAXC 20 MAXP 5 Alg7
> Could not malloc() 2118274080 bytes--are we out of memory?Fatal error, aborting.
> Rtasks fail:
> Rtask(s) 1 : exited with signal <6>
> Rtask(s) 3 2 4 5 8 6 7 10 9 : exited with signal <15>
> Rtask(s) 1 : coredump
> >
>
> New Script, Fixed Protein
>
> BRACKET: 1.64649e-05 26875.6 -8.15248e+09 -2.11699e+09 7.56124e+09
> ENERGY: 298 246811.3244 127002.1868 7801.5138 776.0334
> -110553.6799 343151.2350 0.0000 0.0000 0.0000
> 614988.6135 0.0000 614988.6135 614988.6135 0.0000
> 156657.9925 177933.1586 576000.0000 156657.9925 177933.1586
>
> BRACKET: 8.23246e-06 12090.3 -2.11699e+09 -9.70546e+08 7.56124e+09
> ENERGY: 299 245766.2529 126976.5313 7802.2252 775.5543
> -110592.3262 343704.2276 0.0000 0.0000 0.0000
> 614432.4651 0.0000 614432.4651 614432.4651 0.0000
> 156870.7170 178776.2517 576000.0000 156870.7170 178776.2517
>
> LDB: LOAD: AVG 212.831 MAX 217.851 MSGS: TOTAL 184 MAXC 20 MAXP 5 None
> LDB: LOAD: AVG 212.831 MAX 216.577 MSGS: TOTAL 184 MAXC 20 MAXP 5 Refine
> Could not malloc()--are we out of memory?Fatal error, aborting.
> Rtasks fail:
> Rtask(s) 1 : exited with signal <6>
> Rtask(s) 3 2 4 5 6 8 7 9 10 : exited with signal <15>
> Rtask(s) 1 : coredump
> >
>
>
>
>
> On Tue, 02 Nov 2004 13:22:50 -0600 (CST), J. Valencia
> <jonathan_at_ibt.unam.mx> wrote:
> > Also, for par_all27_prot_lipid.prm the suggested cutoff scheme is:
> > switchdist 10.0
> > cutoff 12.0
> > pairlistdist 14.0
> > This is stated almost at the end of the file.
> >
> > Good luck!
> >
> > J. Valencia.
> >
>

*****************************************************************
**Brian Bennion, Ph.D. **
**Computational and Systems Biology Division **
**Biology and Biotechnology Research Program **
**Lawrence Livermore National Laboratory **
**P.O. Box 808, L-448 bennion1_at_llnl.gov **
**7000 East Avenue phone: (925) 422-5722 **
**Livermore, CA 94550 fax: (925) 424-6605 **
*****************************************************************

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:38:58 CST