Re: System minimization: fail after 199 steps

From: Brian Bennion (brian_at_youkai.llnl.gov)
Date: Tue Nov 23 2004 - 12:46:15 CST

Hi Charles,

This problem may be a environment problem. Are you running on a sun
cluster?
In any case check the ulimit -a command out. Depending on the OS you are
running certain resources might be maxing out according to rules set in
your environment.
Depending on your system the signal <30> could mean a cpu resource limit
has been reached or a user limit has beeen reached etc......

Send a post of our ulimit -a if possible.
regards
Brian

 On Tue, 23 Nov 2004, Charles Danko wrote:

> Hi, Brian
>
> That is indeed all that it said. It has happened several more times
> since your last message. NAMD seems to be running, it runs for on the
> range of 200-400k steps, and crashes with message 30. The end of an
> output file run on 26 processors is pasted below. In this run, the
> first step was set to 360k.
>
> Let me know if you have any other ingenous ideas. After today, I will
> be away (again) until the 30th.
>
> Best wishes,
> Charles
>
> ENERGY: 640500 17882.6626 21449.7292 7025.7441 294.4760
> -149589.2591 4163.6862 0.0000 0.0000 52368.5558
> -46404.4053 302.3977 -45769.1384 -45763.2852 301.1117
> 844.4251 -236.8373 763614.3676 -179.5831 -180.7280
>
> ENERGY: 641000 18000.3987 21489.2007 6929.6029 319.5929
> -149870.1117 4392.0121 0.0000 0.0000 51818.9544
> -46920.3500 299.2241 -46288.4331 -46279.3096 300.5616
> 1035.1246 -182.4193 763614.3676 -232.1634 -231.9390
>
> WRITING COORDINATES TO DCD FILE AT STEP 641000
> ENERGY: 641500 17947.6275 21659.5130 6908.2926 298.7991
> -149726.8557 4085.8256 0.0000 0.0000 51720.9877
> -47105.8102 298.6584 -46469.4343 -46473.1075 299.4405
> 683.6453 -321.9687 763614.3676 -223.9394 -222.7303
>
> Rtasks fail:
> Rtask(s) 2 3 4 5 7 10 8 9 11 12 6 13 1 15 17 19 18 14 16 20 21 23 22 24 25 26 :
> exited with signal <30>
> >
>
>
>
> On Mon, 15 Nov 2004 14:18:20 -0800 (PST), Brian Bennion
> <brian_at_youkai.llnl.gov> wrote:
> > Hello Charles,
> >
> > My prediction was based on tens of nanoseconds. The fact you have an
> > error at 220k steps is also curious. What is message 30? Is that all the
> > log says?
> >
> > interesting indeed.
> >
> > Brian
> >
> >
> >
> >
> > On Mon, 15 Nov 2004, Charles Danko wrote:
> >
> > > Hi, Brian,
> > >
> > > I havent recompiled charm++, and I am still using bsub to submit the
> > > job. I did talk to the person who compiled namd and charm++, and he
> > > said that when using charmrun to submit the job the program died
> > > without starting the first step. He did not keep the output files.
> > > If you would like to see the output of this for debuging NAMD then I
> > > would be happy to find the installation of chamrun and try it.
> > >
> > > However, for our short-term purposes, turnning off load balancing
> > > seems to be sufficient. Namd ran for 220k steps before it died with
> > > message (30). For the duration of the 220k step run, the program did
> > > not noticabley slow down. For lack of a more rigrous way to track
> > > simulation speed I directed the standard output into a file, and
> > > tracked the change in file size over time. Since NAMD spits out
> > > roughly the same amount of information on each step, one expects a
> > > liner relationship between time and file size for a simulation of
> > > constant speed. Over the 220k steps the time/ file size relationship
> > > was linear (R^2 = 0.9993). The fit with an exponential was no better.
> > > The time points I used are pasted below (from excel).
> > >
> > > Since molecular dynamics is not my main project I do not have much
> > > time to devote every day to optomizing NAMD on my system. Thus,
> > > unless more problems come up later, I am happy with the solution of
> > > truning off the load balancing feature. With that said, I would be
> > > happy to cooperate with the developers if they want any information
> > > from my system to help them optomize NAMD for other users having
> > > similar problems.
> > >
> > > Thanks again for all the help!
> > > Charles
> > >
> > > Time File Size
> > > 11/11/04 15:16 180185
> > > 11/11/04 15:32 532519
> > > 11/11/04 16:04 1233043
> > > 11/11/04 16:17 1513125
> > > 11/11/04 16:55 2307435
> > > 11/11/04 17:45 3357264
> > > 11/11/04 21:54 8281933
> > > 11/11/04 23:35 10230066
> > > 11/12/04 10:28 22479821
> > > 11/12/04 11:12 23272536
> > > 11/12/04 12:03 24239744
> > > 11/12/04 13:43 26083328
> > > 11/12/04 15:19 27885568
> > > 11/12/04 17:44 30579787
> > > 11/14/04 11:25 74581205
> > > 11/14/04 16:06 79353189
> > >
> > >
> > > On Thu, 11 Nov 2004 08:18:49 -0800 (PST), Brian Bennion
> > > <brian_at_youkai.llnl.gov> wrote:
> > > > Hello Charles,
> > > >
> > > > I don't want to rain on your parade, I am glad that you have had some
> > > > success, but the fact that turning this off allows the program to run is a
> > > > wee bit scary!
> > > >
> > > > A good portion of the speed of NAMD comes from balancing its load
> > > > across the processes. I would predict that after a certain number of
> > > > steps the calculations will become execeedingly slow.
> > > >
> > > > Jim, the lead developer is away at SC2004 in pittsburg. Hopefully he can
> > > > address this issue when he returns.
> > > >
> > > > I am still wondering if this is not a problem with charm++. Did you get a
> > > > chance to recompile it or to check the megatests across >5 nodes?
> > > >
> > > > Regards
> > > > Brian
> > > >
> > > >
> > > >
> > > >
> > > > On Thu, 11 Nov 2004, Charles Danko wrote:
> > > >
> > > > > Hi, Brian,
> > > > >
> > > > > Yes! Turning off load balancing seems to work fine! None of the
> > > > > other algorithims seem to work. I assume that the program I am using
> > > > > to start NAMD will perform load balancing, but I am going to check
> > > > > with my administrator to be sure. Excellent suggestion!! Thanks
> > > > > again!
> > > > >
> > > > > Best wishes,
> > > > > Charles
> > > > >
> > > > > For those who are having the same problem, load balancing is an
> > > > > undocumented feature. Those using the SUN OS and bsub to start NAMD
> > > > > will most likely need to turn this feature off. In NAMD 2.5, it can
> > > > > be turned off like so:
> > > > > ldbStrategy none
> > > > >
> > > > > The other acceptable parameters are:
> > > > > refineonly
> > > > > alg7
> > > > > orb
> > > > > neighbor
> > > > > other - this seemed to be the same as alg7
> > > > >
> > > > > Good luck!
> > > > >
> > > > >
> > > > > On Tue, 9 Nov 2004 13:40:39 -0800 (PST), Brian Bennion
> > > > > <brian_at_youkai.llnl.gov> wrote:
> > > > > > Hello Charles
> > > > > >
> > > > > > You might be able to set the output timing to a larger number, but I don't
> > > > > > know if that will stop the initial timing entry in the log.
> > > > > >
> > > > > > The number of steps before load balancing occurs can be changed and is at
> > > > > > last knowledge and undocumented feature. You can turn it off, change the
> > > > > > type of algorithm, as well as the number of steps between loadbalancing
> > > > > > efforts.
> > > > > >
> > > > > > Look at the source code on the namd website under simparameters.C
> > > > > > Brian
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, 9 Nov 2004, Charles Danko wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > I do not believe that the problem is in the amount of memory. The
> > > > > > > NAMD users guide says that the program draws, at maximum, 300MB for a
> > > > > > > system over 100,000 atoms. My system is approximately half that size.
> > > > > > > Raising the "pairlistminprocs" parameter does not seem to alter the
> > > > > > > error message. Finally, on our system each task is allowed to take
> > > > > > > 4GB of memory. Even if each processor draws 300MB, a 10 processor
> > > > > > > simulation will only draw 3GB - safely below the maximum.
> > > > > > >
> > > > > > > The system seems to die when NAMD is run with more than 4 processors
> > > > > > > for this particular system. It always seems to die when the program
> > > > > > > is estimating the memory usage, and the length of time for 1ns of
> > > > > > > simulation. Is there some way that I can force it to skip these
> > > > > > > steps? I have looked through the manual but haven't seen anything
> > > > > > > useful.
> > > > > > >
> > > > > > > Simulation output for 4 processors can be found here (3MB file):
> > > > > > > http://www.campbellferrara.com/heating-4proc.out
> > > > > > >
> > > > > > > and 10 here (smaller file):
> > > > > > > http://www.campbellferrara.com/heating-10proc.out
> > > > > > >
> > > > > > > The script file that was used to run the simulation can be found here
> > > > > > > (smaller file):
> > > > > > > http://www.campbellferrara.com/heating.namd
> > > > > > >
> > > > > > > I would be would be grateful for any other suggestions.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Charles
> > > > > > >
> > > > > > > The details of what I have tried to reach this conclusion:
> > > > > > > 1. Minimization on the protein in a vacuum as Brian recommended. This
> > > > > > > ran through 2k steps and completed successfully.
> > > > > > > 2. I realized that there may be some overlapping water molecules since
> > > > > > > solvate does not delete these using the minmax option. I deleted all
> > > > > > > of them and then minimized the system. The minimization ran for 2k
> > > > > > > steps and completed successfully. The system minimized to a gradient
> > > > > > > of ~35. I was trying to get it under 5 as has been recommended on
> > > > > > > this mailing list, so I load the restart files to minimize for another
> > > > > > > thousand steps, and the system dies after 199 steps.
> > > > > > > 3. The previous step had ASCII save coordinates. I wanted to see what
> > > > > > > would happen if I tried binary. I restarted a simulation from the
> > > > > > > 1500 step binary save file. The simulation died after 299 steps.
> > > > > > > 4. I tried running the simulation from the beginning for 5000 steps
> > > > > > > (the same as had just ran successfully) and the system died after 199
> > > > > > > steps.
> > > > > > > 5. Loading the minimized system coordinates in to VMD, the system
> > > > > > > looks inside out. I fixed the periodic boundary conditions settings
> > > > > > > so that the box that fits around the entire system (measured minmax
> > > > > > > and center in VMD). I start the minimization from the beginning. It
> > > > > > > dies after 199 steps.
> > > > > > > 6. I spoke to my administrator about memory and he said that no
> > > > > > > application can draw more than 4GB.
> > > > > > > 7. I tried the minimization with 1 processor (generally I use 10). It
> > > > > > > took forever, but worked.
> > > > > > > 8. A heating script works on 4 processors, but dies after step 199 of
> > > > > > > 10 processors.
> > > > > > > 9. Similarly, an equilibration script seems to work with 4 processors,
> > > > > > > but dies with any higher than 5.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Wed, 3 Nov 2004 11:05:56 -0800 (PST), Brian Bennion
> > > > > > > <brian_at_youkai.llnl.gov> wrote:
> > > > > > > > Hello Charles,
> > > > > > > >
> > > > > > > > A little background...
> > > > > > > > Namd requires charm++ to compile correctly, so the natural order is that
> > > > > > > > charm++ is compiled first and then namd is compiled against it. The fact
> > > > > > > > that namd runs at all on your system would suggest that charm++ has been
> > > > > > > > compiled at some point.
> > > > > > > >
> > > > > > > > I am not familiar with the sun sparc setup, but charmrun maybe used here
> > > > > > > > to propagate the job through the nodes.
> > > > > > > >
> > > > > > > > Can anyone comment here?
> > > > > > > >
> > > > > > > > Steps that I can recommend....
> > > > > > > > Try just minimizing the protein alone in vacuo for 200+ steps?
> > > > > > > > Are you sure that the total system is 58,236 atoms? That seems small for
> > > > > > > > such a complex box.
> > > > > > > >
> > > > > > > > Can you send the whole log file from startup to crash?
> > > > > > > > There just might not be enough memory? But I would think that this would
> > > > > > > > manifest itself earlier.
> > > > > > > >
> > > > > > > > Thanks
> > > > > > > > Brian
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Wed, 3 Nov 2004, Charles Danko wrote:
> > > > > > > >
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > Thanks to Brian and Dr. Valencia for their help.
> > > > > > > > >
> > > > > > > > > The machines are a cluster of Sun SPARC 64 bit processors running
> > > > > > > > > Solaris 7. I am using bsub for multithreading. My administrator may
> > > > > > > > > let me run charm+ if you think that it may solve the problem, but
> > > > > > > > > there may be some good reason that it wasn't used before (namd was
> > > > > > > > > compiled by a colleague of mine, and I am not sure of specific issues
> > > > > > > > > he faced when putting it together).
> > > > > > > > >
> > > > > > > > > The system is a protein, lipid, and water system, in total, 58,236
> > > > > > > > > atoms constructed from a protein homology model. The system was
> > > > > > > > > assembled using VMD, the membrane 1.0 plug-in, and solvate 1.2 (to
> > > > > > > > > solvate the top and bottom where the protein was sticking out of the
> > > > > > > > > pre-equilibrated lipid-water system constructed by membrane). I
> > > > > > > > > deleted all atoms within 1A of the protein and am now trying to
> > > > > > > > > minimize the system.
> > > > > > > > >
> > > > > > > > > Based on Dr. Valencia and Dr. Bennion's suggestions I changed the
> > > > > > > > > script file. I adapted the one intended to heat the system after the
> > > > > > > > > minimization. I have included the new script file as an attachment.
> > > > > > > > > The run still crashes after 199 steps, but this time it returns a
> > > > > > > > > malloc error. Short by 2GB?
> > > > > > > > > The last part of the output is pasted below. Many of the forces are
> > > > > > > > > positive again.
> > > > > > > > >
> > > > > > > > > I have tried to fix the protein and minimize the water/lipids; the
> > > > > > > > > output is pasted below. The system lasted for 299 steps this time,
> > > > > > > > > but received the same malloc error.
> > > > > > > > >
> > > > > > > > > I have NOT deleted the atoms which fall outside of my periodic
> > > > > > > > > boundary. If you recommend I will do this and try to run the new
> > > > > > > > > script again. I am acting under the assumption that these atoms will
> > > > > > > > > be ignored.
> > > > > > > > > Is this coorect?
> > > > > > > > >
> > > > > > > > > Because the problem seems to be a memory allocation error, I am
> > > > > > > > > thinking that the next step will to be trying to convince my
> > > > > > > > > administrator to compile charm+.
> > > > > > > > > Any thoughts or suggestions?
> > > > > > > > > Do I need to recompile all of namd, or can I just compile charm+ without it?
> > > > > > > > >
> > > > > > > > > Thanks again for all of the help,
> > > > > > > > > Charles
> > > > > > > > >
> > > > > > > > > Output files:
> > > > > > > > >
> > > > > > > > > New script, no atoms fixed.
> > > > > > > > >
> > > > > > > > > BRACKET: 6.57916e-07 652.946 -2.45009e+09 -8.45313e+07 9.29531e+08
> > > > > > > > > ENERGY: 198 522579.9239 151303.8494 10858.5910 1446.8211
> > > > > > > > > -80557.5403 481695.6698 0.0000 0.0000 0.0000
> > > > > > > > > 1087327.3149 0.0000 1087327.3149 1087327.3149 0.0000
> > > > > > > > > 188642.4062 235104.9289 576000.0000 188642.4062 235104.9289
> > > > > > > > >
> > > > > > > > > BRACKET: 1.6835e-07 70.1294 -8.45313e+07 1.17645e+07 9.29531e+08
> > > > > > > > > ENERGY: 199 522585.2964 151303.6975 10858.5915 1446.8152
> > > > > > > > > -80557.4059 481690.3089 0.0000 0.0000 0.0000
> > > > > > > > > 1087327.3036 0.0000 1087327.3036 1087327.3036 0.0000
> > > > > > > > > 188639.2161 235101.2267 576000.0000 188639.2161 235101.2267
> > > > > > > > >
> > > > > > > > > LDB: LOAD: AVG 231.478 MAX 291.895 MSGS: TOTAL 184 MAXC 20 MAXP 5 None
> > > > > > > > > LDB: LOAD: AVG 231.478 MAX 255.756 MSGS: TOTAL 184 MAXC 20 MAXP 5 Alg7
> > > > > > > > > LDB: LOAD: AVG 231.478 MAX 236.106 MSGS: TOTAL 184 MAXC 20 MAXP 5 Alg7
> > > > > > > > > Could not malloc() 2118274080 bytes--are we out of memory?Fatal error, aborting.
> > > > > > > > > Rtasks fail:
> > > > > > > > > Rtask(s) 1 : exited with signal <6>
> > > > > > > > > Rtask(s) 3 2 4 5 8 6 7 10 9 : exited with signal <15>
> > > > > > > > > Rtask(s) 1 : coredump
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > New Script, Fixed Protein
> > > > > > > > >
> > > > > > > > > BRACKET: 1.64649e-05 26875.6 -8.15248e+09 -2.11699e+09 7.56124e+09
> > > > > > > > > ENERGY: 298 246811.3244 127002.1868 7801.5138 776.0334
> > > > > > > > > -110553.6799 343151.2350 0.0000 0.0000 0.0000
> > > > > > > > > 614988.6135 0.0000 614988.6135 614988.6135 0.0000
> > > > > > > > > 156657.9925 177933.1586 576000.0000 156657.9925 177933.1586
> > > > > > > > >
> > > > > > > > > BRACKET: 8.23246e-06 12090.3 -2.11699e+09 -9.70546e+08 7.56124e+09
> > > > > > > > > ENERGY: 299 245766.2529 126976.5313 7802.2252 775.5543
> > > > > > > > > -110592.3262 343704.2276 0.0000 0.0000 0.0000
> > > > > > > > > 614432.4651 0.0000 614432.4651 614432.4651 0.0000
> > > > > > > > > 156870.7170 178776.2517 576000.0000 156870.7170 178776.2517
> > > > > > > > >
> > > > > > > > > LDB: LOAD: AVG 212.831 MAX 217.851 MSGS: TOTAL 184 MAXC 20 MAXP 5 None
> > > > > > > > > LDB: LOAD: AVG 212.831 MAX 216.577 MSGS: TOTAL 184 MAXC 20 MAXP 5 Refine
> > > > > > > > > Could not malloc()--are we out of memory?Fatal error, aborting.
> > > > > > > > > Rtasks fail:
> > > > > > > > > Rtask(s) 1 : exited with signal <6>
> > > > > > > > > Rtask(s) 3 2 4 5 6 8 7 9 10 : exited with signal <15>
> > > > > > > > > Rtask(s) 1 : coredump
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Tue, 02 Nov 2004 13:22:50 -0600 (CST), J. Valencia
> > > > > > > > > <jonathan_at_ibt.unam.mx> wrote:
> > > > > > > > > > Also, for par_all27_prot_lipid.prm the suggested cutoff scheme is:
> > > > > > > > > > switchdist 10.0
> > > > > > > > > > cutoff 12.0
> > > > > > > > > > pairlistdist 14.0
> > > > > > > > > > This is stated almost at the end of the file.
> > > > > > > > > >
> > > > > > > > > > Good luck!
> > > > > > > > > >
> > > > > > > > > > J. Valencia.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > > *****************************************************************
> > > > > > > > **Brian Bennion, Ph.D. **
> > > > > > > > **Computational and Systems Biology Division **
> > > > > > > > **Biology and Biotechnology Research Program **
> > > > > > > > **Lawrence Livermore National Laboratory **
> > > > > > > > **P.O. Box 808, L-448 bennion1_at_llnl.gov **
> > > > > > > > **7000 East Avenue phone: (925) 422-5722 **
> > > > > > > > **Livermore, CA 94550 fax: (925) 424-6605 **
> > > > > > > > *****************************************************************
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > > *****************************************************************
> > > > > >
> > > > > >
> > > > > > **Brian Bennion, Ph.D. **
> > > > > > **Computational and Systems Biology Division **
> > > > > > **Biology and Biotechnology Research Program **
> > > > > > **Lawrence Livermore National Laboratory **
> > > > > > **P.O. Box 808, L-448 bennion1_at_llnl.gov **
> > > > > > **7000 East Avenue phone: (925) 422-5722 **
> > > > > > **Livermore, CA 94550 fax: (925) 424-6605 **
> > > > > > *****************************************************************
> > > > > >
> > > > > >
> > > > >
> > > >
> > > > *****************************************************************
> > > >
> > > >
> > > > **Brian Bennion, Ph.D. **
> > > > **Computational and Systems Biology Division **
> > > > **Biology and Biotechnology Research Program **
> > > > **Lawrence Livermore National Laboratory **
> > > > **P.O. Box 808, L-448 bennion1_at_llnl.gov **
> > > > **7000 East Avenue phone: (925) 422-5722 **
> > > > **Livermore, CA 94550 fax: (925) 424-6605 **
> > > > *****************************************************************
> > > >
> > > >
> > >
> >
> > *****************************************************************
> >
> >
> > **Brian Bennion, Ph.D. **
> > **Computational and Systems Biology Division **
> > **Biology and Biotechnology Research Program **
> > **Lawrence Livermore National Laboratory **
> > **P.O. Box 808, L-448 bennion1_at_llnl.gov **
> > **7000 East Avenue phone: (925) 422-5722 **
> > **Livermore, CA 94550 fax: (925) 424-6605 **
> > *****************************************************************
> >
> >
>

*****************************************************************
**Brian Bennion, Ph.D. **
**Computational and Systems Biology Division **
**Biology and Biotechnology Research Program **
**Lawrence Livermore National Laboratory **
**P.O. Box 808, L-448 bennion1_at_llnl.gov **
**7000 East Avenue phone: (925) 422-5722 **
**Livermore, CA 94550 fax: (925) 424-6605 **
*****************************************************************

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:38:01 CST