Re: System minimization: fail after 199 steps

From: Brian Bennion (brian_at_youkai.llnl.gov)
Date: Thu Nov 11 2004 - 10:18:49 CST

Hello Charles,

I don't want to rain on your parade, I am glad that you have had some
success, but the fact that turning this off allows the program to run is a
wee bit scary!

A good portion of the speed of NAMD comes from balancing its load
across the processes. I would predict that after a certain number of
steps the calculations will become execeedingly slow.

Jim, the lead developer is away at SC2004 in pittsburg. Hopefully he can
address this issue when he returns.

I am still wondering if this is not a problem with charm++. Did you get a
chance to recompile it or to check the megatests across >5 nodes?

Regards
Brian

On Thu, 11 Nov 2004, Charles Danko wrote:

> Hi, Brian,
>
> Yes! Turning off load balancing seems to work fine! None of the
> other algorithims seem to work. I assume that the program I am using
> to start NAMD will perform load balancing, but I am going to check
> with my administrator to be sure. Excellent suggestion!! Thanks
> again!
>
> Best wishes,
> Charles
>
> For those who are having the same problem, load balancing is an
> undocumented feature. Those using the SUN OS and bsub to start NAMD
> will most likely need to turn this feature off. In NAMD 2.5, it can
> be turned off like so:
> ldbStrategy none
>
> The other acceptable parameters are:
> refineonly
> alg7
> orb
> neighbor
> other - this seemed to be the same as alg7
>
> Good luck!
>
>
> On Tue, 9 Nov 2004 13:40:39 -0800 (PST), Brian Bennion
> <brian_at_youkai.llnl.gov> wrote:
> > Hello Charles
> >
> > You might be able to set the output timing to a larger number, but I don't
> > know if that will stop the initial timing entry in the log.
> >
> > The number of steps before load balancing occurs can be changed and is at
> > last knowledge and undocumented feature. You can turn it off, change the
> > type of algorithm, as well as the number of steps between loadbalancing
> > efforts.
> >
> > Look at the source code on the namd website under simparameters.C
> > Brian
> >
> >
> >
> >
> > On Tue, 9 Nov 2004, Charles Danko wrote:
> >
> > > Hi,
> > >
> > > I do not believe that the problem is in the amount of memory. The
> > > NAMD users guide says that the program draws, at maximum, 300MB for a
> > > system over 100,000 atoms. My system is approximately half that size.
> > > Raising the "pairlistminprocs" parameter does not seem to alter the
> > > error message. Finally, on our system each task is allowed to take
> > > 4GB of memory. Even if each processor draws 300MB, a 10 processor
> > > simulation will only draw 3GB - safely below the maximum.
> > >
> > > The system seems to die when NAMD is run with more than 4 processors
> > > for this particular system. It always seems to die when the program
> > > is estimating the memory usage, and the length of time for 1ns of
> > > simulation. Is there some way that I can force it to skip these
> > > steps? I have looked through the manual but haven't seen anything
> > > useful.
> > >
> > > Simulation output for 4 processors can be found here (3MB file):
> > > http://www.campbellferrara.com/heating-4proc.out
> > >
> > > and 10 here (smaller file):
> > > http://www.campbellferrara.com/heating-10proc.out
> > >
> > > The script file that was used to run the simulation can be found here
> > > (smaller file):
> > > http://www.campbellferrara.com/heating.namd
> > >
> > > I would be would be grateful for any other suggestions.
> > >
> > > Thanks,
> > > Charles
> > >
> > > The details of what I have tried to reach this conclusion:
> > > 1. Minimization on the protein in a vacuum as Brian recommended. This
> > > ran through 2k steps and completed successfully.
> > > 2. I realized that there may be some overlapping water molecules since
> > > solvate does not delete these using the minmax option. I deleted all
> > > of them and then minimized the system. The minimization ran for 2k
> > > steps and completed successfully. The system minimized to a gradient
> > > of ~35. I was trying to get it under 5 as has been recommended on
> > > this mailing list, so I load the restart files to minimize for another
> > > thousand steps, and the system dies after 199 steps.
> > > 3. The previous step had ASCII save coordinates. I wanted to see what
> > > would happen if I tried binary. I restarted a simulation from the
> > > 1500 step binary save file. The simulation died after 299 steps.
> > > 4. I tried running the simulation from the beginning for 5000 steps
> > > (the same as had just ran successfully) and the system died after 199
> > > steps.
> > > 5. Loading the minimized system coordinates in to VMD, the system
> > > looks inside out. I fixed the periodic boundary conditions settings
> > > so that the box that fits around the entire system (measured minmax
> > > and center in VMD). I start the minimization from the beginning. It
> > > dies after 199 steps.
> > > 6. I spoke to my administrator about memory and he said that no
> > > application can draw more than 4GB.
> > > 7. I tried the minimization with 1 processor (generally I use 10). It
> > > took forever, but worked.
> > > 8. A heating script works on 4 processors, but dies after step 199 of
> > > 10 processors.
> > > 9. Similarly, an equilibration script seems to work with 4 processors,
> > > but dies with any higher than 5.
> > >
> > >
> > >
> > > On Wed, 3 Nov 2004 11:05:56 -0800 (PST), Brian Bennion
> > > <brian_at_youkai.llnl.gov> wrote:
> > > > Hello Charles,
> > > >
> > > > A little background...
> > > > Namd requires charm++ to compile correctly, so the natural order is that
> > > > charm++ is compiled first and then namd is compiled against it. The fact
> > > > that namd runs at all on your system would suggest that charm++ has been
> > > > compiled at some point.
> > > >
> > > > I am not familiar with the sun sparc setup, but charmrun maybe used here
> > > > to propagate the job through the nodes.
> > > >
> > > > Can anyone comment here?
> > > >
> > > > Steps that I can recommend....
> > > > Try just minimizing the protein alone in vacuo for 200+ steps?
> > > > Are you sure that the total system is 58,236 atoms? That seems small for
> > > > such a complex box.
> > > >
> > > > Can you send the whole log file from startup to crash?
> > > > There just might not be enough memory? But I would think that this would
> > > > manifest itself earlier.
> > > >
> > > > Thanks
> > > > Brian
> > > >
> > > >
> > > >
> > > > On Wed, 3 Nov 2004, Charles Danko wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Thanks to Brian and Dr. Valencia for their help.
> > > > >
> > > > > The machines are a cluster of Sun SPARC 64 bit processors running
> > > > > Solaris 7. I am using bsub for multithreading. My administrator may
> > > > > let me run charm+ if you think that it may solve the problem, but
> > > > > there may be some good reason that it wasn't used before (namd was
> > > > > compiled by a colleague of mine, and I am not sure of specific issues
> > > > > he faced when putting it together).
> > > > >
> > > > > The system is a protein, lipid, and water system, in total, 58,236
> > > > > atoms constructed from a protein homology model. The system was
> > > > > assembled using VMD, the membrane 1.0 plug-in, and solvate 1.2 (to
> > > > > solvate the top and bottom where the protein was sticking out of the
> > > > > pre-equilibrated lipid-water system constructed by membrane). I
> > > > > deleted all atoms within 1A of the protein and am now trying to
> > > > > minimize the system.
> > > > >
> > > > > Based on Dr. Valencia and Dr. Bennion's suggestions I changed the
> > > > > script file. I adapted the one intended to heat the system after the
> > > > > minimization. I have included the new script file as an attachment.
> > > > > The run still crashes after 199 steps, but this time it returns a
> > > > > malloc error. Short by 2GB?
> > > > > The last part of the output is pasted below. Many of the forces are
> > > > > positive again.
> > > > >
> > > > > I have tried to fix the protein and minimize the water/lipids; the
> > > > > output is pasted below. The system lasted for 299 steps this time,
> > > > > but received the same malloc error.
> > > > >
> > > > > I have NOT deleted the atoms which fall outside of my periodic
> > > > > boundary. If you recommend I will do this and try to run the new
> > > > > script again. I am acting under the assumption that these atoms will
> > > > > be ignored.
> > > > > Is this coorect?
> > > > >
> > > > > Because the problem seems to be a memory allocation error, I am
> > > > > thinking that the next step will to be trying to convince my
> > > > > administrator to compile charm+.
> > > > > Any thoughts or suggestions?
> > > > > Do I need to recompile all of namd, or can I just compile charm+ without it?
> > > > >
> > > > > Thanks again for all of the help,
> > > > > Charles
> > > > >
> > > > > Output files:
> > > > >
> > > > > New script, no atoms fixed.
> > > > >
> > > > > BRACKET: 6.57916e-07 652.946 -2.45009e+09 -8.45313e+07 9.29531e+08
> > > > > ENERGY: 198 522579.9239 151303.8494 10858.5910 1446.8211
> > > > > -80557.5403 481695.6698 0.0000 0.0000 0.0000
> > > > > 1087327.3149 0.0000 1087327.3149 1087327.3149 0.0000
> > > > > 188642.4062 235104.9289 576000.0000 188642.4062 235104.9289
> > > > >
> > > > > BRACKET: 1.6835e-07 70.1294 -8.45313e+07 1.17645e+07 9.29531e+08
> > > > > ENERGY: 199 522585.2964 151303.6975 10858.5915 1446.8152
> > > > > -80557.4059 481690.3089 0.0000 0.0000 0.0000
> > > > > 1087327.3036 0.0000 1087327.3036 1087327.3036 0.0000
> > > > > 188639.2161 235101.2267 576000.0000 188639.2161 235101.2267
> > > > >
> > > > > LDB: LOAD: AVG 231.478 MAX 291.895 MSGS: TOTAL 184 MAXC 20 MAXP 5 None
> > > > > LDB: LOAD: AVG 231.478 MAX 255.756 MSGS: TOTAL 184 MAXC 20 MAXP 5 Alg7
> > > > > LDB: LOAD: AVG 231.478 MAX 236.106 MSGS: TOTAL 184 MAXC 20 MAXP 5 Alg7
> > > > > Could not malloc() 2118274080 bytes--are we out of memory?Fatal error, aborting.
> > > > > Rtasks fail:
> > > > > Rtask(s) 1 : exited with signal <6>
> > > > > Rtask(s) 3 2 4 5 8 6 7 10 9 : exited with signal <15>
> > > > > Rtask(s) 1 : coredump
> > > > > >
> > > > >
> > > > > New Script, Fixed Protein
> > > > >
> > > > > BRACKET: 1.64649e-05 26875.6 -8.15248e+09 -2.11699e+09 7.56124e+09
> > > > > ENERGY: 298 246811.3244 127002.1868 7801.5138 776.0334
> > > > > -110553.6799 343151.2350 0.0000 0.0000 0.0000
> > > > > 614988.6135 0.0000 614988.6135 614988.6135 0.0000
> > > > > 156657.9925 177933.1586 576000.0000 156657.9925 177933.1586
> > > > >
> > > > > BRACKET: 8.23246e-06 12090.3 -2.11699e+09 -9.70546e+08 7.56124e+09
> > > > > ENERGY: 299 245766.2529 126976.5313 7802.2252 775.5543
> > > > > -110592.3262 343704.2276 0.0000 0.0000 0.0000
> > > > > 614432.4651 0.0000 614432.4651 614432.4651 0.0000
> > > > > 156870.7170 178776.2517 576000.0000 156870.7170 178776.2517
> > > > >
> > > > > LDB: LOAD: AVG 212.831 MAX 217.851 MSGS: TOTAL 184 MAXC 20 MAXP 5 None
> > > > > LDB: LOAD: AVG 212.831 MAX 216.577 MSGS: TOTAL 184 MAXC 20 MAXP 5 Refine
> > > > > Could not malloc()--are we out of memory?Fatal error, aborting.
> > > > > Rtasks fail:
> > > > > Rtask(s) 1 : exited with signal <6>
> > > > > Rtask(s) 3 2 4 5 6 8 7 9 10 : exited with signal <15>
> > > > > Rtask(s) 1 : coredump
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Tue, 02 Nov 2004 13:22:50 -0600 (CST), J. Valencia
> > > > > <jonathan_at_ibt.unam.mx> wrote:
> > > > > > Also, for par_all27_prot_lipid.prm the suggested cutoff scheme is:
> > > > > > switchdist 10.0
> > > > > > cutoff 12.0
> > > > > > pairlistdist 14.0
> > > > > > This is stated almost at the end of the file.
> > > > > >
> > > > > > Good luck!
> > > > > >
> > > > > > J. Valencia.
> > > > > >
> > > > >
> > > >
> > > > *****************************************************************
> > > > **Brian Bennion, Ph.D. **
> > > > **Computational and Systems Biology Division **
> > > > **Biology and Biotechnology Research Program **
> > > > **Lawrence Livermore National Laboratory **
> > > > **P.O. Box 808, L-448 bennion1_at_llnl.gov **
> > > > **7000 East Avenue phone: (925) 422-5722 **
> > > > **Livermore, CA 94550 fax: (925) 424-6605 **
> > > > *****************************************************************
> > > >
> > > >
> > >
> >
> > *****************************************************************
> >
> >
> > **Brian Bennion, Ph.D. **
> > **Computational and Systems Biology Division **
> > **Biology and Biotechnology Research Program **
> > **Lawrence Livermore National Laboratory **
> > **P.O. Box 808, L-448 bennion1_at_llnl.gov **
> > **7000 East Avenue phone: (925) 422-5722 **
> > **Livermore, CA 94550 fax: (925) 424-6605 **
> > *****************************************************************
> >
> >
>

*****************************************************************
**Brian Bennion, Ph.D. **
**Computational and Systems Biology Division **
**Biology and Biotechnology Research Program **
**Lawrence Livermore National Laboratory **
**P.O. Box 808, L-448 bennion1_at_llnl.gov **
**7000 East Avenue phone: (925) 422-5722 **
**Livermore, CA 94550 fax: (925) 424-6605 **
*****************************************************************

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:37:59 CST