Re: System minimization: fail after 199 steps

From: Charles Danko (dankoc_at_gmail.com)
Date: Mon Nov 15 2004 - 11:10:40 CST

Next message: Charles McCallum: "Re: gcc and G5 Xserves"
Previous message: Brian Bennion: "Re: fatal error running namd"
In reply to: Brian Bennion: "Re: System minimization: fail after 199 steps"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

Hi, Brian,

I havent recompiled charm++, and I am still using bsub to submit the
job. I did talk to the person who compiled namd and charm++, and he
said that when using charmrun to submit the job the program died
without starting the first step. He did not keep the output files.
If you would like to see the output of this for debuging NAMD then I
would be happy to find the installation of chamrun and try it.

However, for our short-term purposes, turnning off load balancing
seems to be sufficient. Namd ran for 220k steps before it died with
message (30). For the duration of the 220k step run, the program did
not noticabley slow down. For lack of a more rigrous way to track
simulation speed I directed the standard output into a file, and
tracked the change in file size over time. Since NAMD spits out
roughly the same amount of information on each step, one expects a
liner relationship between time and file size for a simulation of
constant speed. Over the 220k steps the time/ file size relationship
was linear (R^2 = 0.9993). The fit with an exponential was no better.
The time points I used are pasted below (from excel).

Since molecular dynamics is not my main project I do not have much
time to devote every day to optomizing NAMD on my system. Thus,
unless more problems come up later, I am happy with the solution of
truning off the load balancing feature. With that said, I would be
happy to cooperate with the developers if they want any information
from my system to help them optomize NAMD for other users having
similar problems.

Thanks again for all the help!
Charles

Time File Size
11/11/04 15:16 180185
11/11/04 15:32 532519
11/11/04 16:04 1233043
11/11/04 16:17 1513125
11/11/04 16:55 2307435
11/11/04 17:45 3357264
11/11/04 21:54 8281933
11/11/04 23:35 10230066
11/12/04 10:28 22479821
11/12/04 11:12 23272536
11/12/04 12:03 24239744
11/12/04 13:43 26083328
11/12/04 15:19 27885568
11/12/04 17:44 30579787
11/14/04 11:25 74581205
11/14/04 16:06 79353189

On Thu, 11 Nov 2004 08:18:49 -0800 (PST), Brian Bennion
<brian_at_youkai.llnl.gov> wrote:
> Hello Charles,
>
> I don't want to rain on your parade, I am glad that you have had some
> success, but the fact that turning this off allows the program to run is a
> wee bit scary!
>
> A good portion of the speed of NAMD comes from balancing its load
> across the processes. I would predict that after a certain number of
> steps the calculations will become execeedingly slow.
>
> Jim, the lead developer is away at SC2004 in pittsburg. Hopefully he can
> address this issue when he returns.
>
> I am still wondering if this is not a problem with charm++. Did you get a
> chance to recompile it or to check the megatests across >5 nodes?
>
> Regards
> Brian
>
>
>
>
> On Thu, 11 Nov 2004, Charles Danko wrote:
>
> > Hi, Brian,
> >
> > Yes! Turning off load balancing seems to work fine! None of the
> > other algorithims seem to work. I assume that the program I am using
> > to start NAMD will perform load balancing, but I am going to check
> > with my administrator to be sure. Excellent suggestion!! Thanks
> > again!
> >
> > Best wishes,
> > Charles
> >
> > For those who are having the same problem, load balancing is an
> > undocumented feature. Those using the SUN OS and bsub to start NAMD
> > will most likely need to turn this feature off. In NAMD 2.5, it can
> > be turned off like so:
> > ldbStrategy none
> >
> > The other acceptable parameters are:
> > refineonly
> > alg7
> > orb
> > neighbor
> > other - this seemed to be the same as alg7
> >
> > Good luck!
> >
> >
> > On Tue, 9 Nov 2004 13:40:39 -0800 (PST), Brian Bennion
> > <brian_at_youkai.llnl.gov> wrote:
> > > Hello Charles
> > >
> > > You might be able to set the output timing to a larger number, but I don't
> > > know if that will stop the initial timing entry in the log.
> > >
> > > The number of steps before load balancing occurs can be changed and is at
> > > last knowledge and undocumented feature. You can turn it off, change the
> > > type of algorithm, as well as the number of steps between loadbalancing
> > > efforts.
> > >
> > > Look at the source code on the namd website under simparameters.C
> > > Brian
> > >
> > >
> > >
> > >
> > > On Tue, 9 Nov 2004, Charles Danko wrote:
> > >
> > > > Hi,
> > > >
> > > > I do not believe that the problem is in the amount of memory. The
> > > > NAMD users guide says that the program draws, at maximum, 300MB for a
> > > > system over 100,000 atoms. My system is approximately half that size.
> > > > Raising the "pairlistminprocs" parameter does not seem to alter the
> > > > error message. Finally, on our system each task is allowed to take
> > > > 4GB of memory. Even if each processor draws 300MB, a 10 processor
> > > > simulation will only draw 3GB - safely below the maximum.
> > > >
> > > > The system seems to die when NAMD is run with more than 4 processors
> > > > for this particular system. It always seems to die when the program
> > > > is estimating the memory usage, and the length of time for 1ns of
> > > > simulation. Is there some way that I can force it to skip these
> > > > steps? I have looked through the manual but haven't seen anything
> > > > useful.
> > > >
> > > > Simulation output for 4 processors can be found here (3MB file):
> > > > http://www.campbellferrara.com/heating-4proc.out
> > > >
> > > > and 10 here (smaller file):
> > > > http://www.campbellferrara.com/heating-10proc.out
> > > >
> > > > The script file that was used to run the simulation can be found here
> > > > (smaller file):
> > > > http://www.campbellferrara.com/heating.namd
> > > >
> > > > I would be would be grateful for any other suggestions.
> > > >
> > > > Thanks,
> > > > Charles
> > > >
> > > > The details of what I have tried to reach this conclusion:
> > > > 1. Minimization on the protein in a vacuum as Brian recommended. This
> > > > ran through 2k steps and completed successfully.
> > > > 2. I realized that there may be some overlapping water molecules since
> > > > solvate does not delete these using the minmax option. I deleted all
> > > > of them and then minimized the system. The minimization ran for 2k
> > > > steps and completed successfully. The system minimized to a gradient
> > > > of ~35. I was trying to get it under 5 as has been recommended on
> > > > this mailing list, so I load the restart files to minimize for another
> > > > thousand steps, and the system dies after 199 steps.
> > > > 3. The previous step had ASCII save coordinates. I wanted to see what
> > > > would happen if I tried binary. I restarted a simulation from the
> > > > 1500 step binary save file. The simulation died after 299 steps.
> > > > 4. I tried running the simulation from the beginning for 5000 steps
> > > > (the same as had just ran successfully) and the system died after 199
> > > > steps.
> > > > 5. Loading the minimized system coordinates in to VMD, the system
> > > > looks inside out. I fixed the periodic boundary conditions settings
> > > > so that the box that fits around the entire system (measured minmax
> > > > and center in VMD). I start the minimization from the beginning. It
> > > > dies after 199 steps.
> > > > 6. I spoke to my administrator about memory and he said that no
> > > > application can draw more than 4GB.
> > > > 7. I tried the minimization with 1 processor (generally I use 10). It
> > > > took forever, but worked.
> > > > 8. A heating script works on 4 processors, but dies after step 199 of
> > > > 10 processors.
> > > > 9. Similarly, an equilibration script seems to work with 4 processors,
> > > > but dies with any higher than 5.
> > > >
> > > >
> > > >
> > > > On Wed, 3 Nov 2004 11:05:56 -0800 (PST), Brian Bennion
> > > > <brian_at_youkai.llnl.gov> wrote:
> > > > > Hello Charles,
> > > > >
> > > > > A little background...
> > > > > Namd requires charm++ to compile correctly, so the natural order is that
> > > > > charm++ is compiled first and then namd is compiled against it. The fact
> > > > > that namd runs at all on your system would suggest that charm++ has been
> > > > > compiled at some point.
> > > > >
> > > > > I am not familiar with the sun sparc setup, but charmrun maybe used here
> > > > > to propagate the job through the nodes.
> > > > >
> > > > > Can anyone comment here?
> > > > >
> > > > > Steps that I can recommend....
> > > > > Try just minimizing the protein alone in vacuo for 200+ steps?
> > > > > Are you sure that the total system is 58,236 atoms? That seems small for
> > > > > such a complex box.
> > > > >
> > > > > Can you send the whole log file from startup to crash?
> > > > > There just might not be enough memory? But I would think that this would
> > > > > manifest itself earlier.
> > > > >
> > > > > Thanks
> > > > > Brian
> > > > >
> > > > >
> > > > >
> > > > > On Wed, 3 Nov 2004, Charles Danko wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > Thanks to Brian and Dr. Valencia for their help.
> > > > > >
> > > > > > The machines are a cluster of Sun SPARC 64 bit processors running
> > > > > > Solaris 7. I am using bsub for multithreading. My administrator may
> > > > > > let me run charm+ if you think that it may solve the problem, but
> > > > > > there may be some good reason that it wasn't used before (namd was
> > > > > > compiled by a colleague of mine, and I am not sure of specific issues
> > > > > > he faced when putting it together).
> > > > > >
> > > > > > The system is a protein, lipid, and water system, in total, 58,236
> > > > > > atoms constructed from a protein homology model. The system was
> > > > > > assembled using VMD, the membrane 1.0 plug-in, and solvate 1.2 (to
> > > > > > solvate the top and bottom where the protein was sticking out of the
> > > > > > pre-equilibrated lipid-water system constructed by membrane). I
> > > > > > deleted all atoms within 1A of the protein and am now trying to
> > > > > > minimize the system.
> > > > > >
> > > > > > Based on Dr. Valencia and Dr. Bennion's suggestions I changed the
> > > > > > script file. I adapted the one intended to heat the system after the
> > > > > > minimization. I have included the new script file as an attachment.
> > > > > > The run still crashes after 199 steps, but this time it returns a
> > > > > > malloc error. Short by 2GB?
> > > > > > The last part of the output is pasted below. Many of the forces are
> > > > > > positive again.
> > > > > >
> > > > > > I have tried to fix the protein and minimize the water/lipids; the
> > > > > > output is pasted below. The system lasted for 299 steps this time,
> > > > > > but received the same malloc error.
> > > > > >
> > > > > > I have NOT deleted the atoms which fall outside of my periodic
> > > > > > boundary. If you recommend I will do this and try to run the new
> > > > > > script again. I am acting under the assumption that these atoms will
> > > > > > be ignored.
> > > > > > Is this coorect?
> > > > > >
> > > > > > Because the problem seems to be a memory allocation error, I am
> > > > > > thinking that the next step will to be trying to convince my
> > > > > > administrator to compile charm+.
> > > > > > Any thoughts or suggestions?
> > > > > > Do I need to recompile all of namd, or can I just compile charm+ without it?
> > > > > >
> > > > > > Thanks again for all of the help,
> > > > > > Charles
> > > > > >
> > > > > > Output files:
> > > > > >
> > > > > > New script, no atoms fixed.
> > > > > >
> > > > > > BRACKET: 6.57916e-07 652.946 -2.45009e+09 -8.45313e+07 9.29531e+08
> > > > > > ENERGY: 198 522579.9239 151303.8494 10858.5910 1446.8211
> > > > > > -80557.5403 481695.6698 0.0000 0.0000 0.0000
> > > > > > 1087327.3149 0.0000 1087327.3149 1087327.3149 0.0000
> > > > > > 188642.4062 235104.9289 576000.0000 188642.4062 235104.9289
> > > > > >
> > > > > > BRACKET: 1.6835e-07 70.1294 -8.45313e+07 1.17645e+07 9.29531e+08
> > > > > > ENERGY: 199 522585.2964 151303.6975 10858.5915 1446.8152
> > > > > > -80557.4059 481690.3089 0.0000 0.0000 0.0000
> > > > > > 1087327.3036 0.0000 1087327.3036 1087327.3036 0.0000
> > > > > > 188639.2161 235101.2267 576000.0000 188639.2161 235101.2267
> > > > > >
> > > > > > LDB: LOAD: AVG 231.478 MAX 291.895 MSGS: TOTAL 184 MAXC 20 MAXP 5 None
> > > > > > LDB: LOAD: AVG 231.478 MAX 255.756 MSGS: TOTAL 184 MAXC 20 MAXP 5 Alg7
> > > > > > LDB: LOAD: AVG 231.478 MAX 236.106 MSGS: TOTAL 184 MAXC 20 MAXP 5 Alg7
> > > > > > Could not malloc() 2118274080 bytes--are we out of memory?Fatal error, aborting.
> > > > > > Rtasks fail:
> > > > > > Rtask(s) 1 : exited with signal <6>
> > > > > > Rtask(s) 3 2 4 5 8 6 7 10 9 : exited with signal <15>
> > > > > > Rtask(s) 1 : coredump
> > > > > > >
> > > > > >
> > > > > > New Script, Fixed Protein
> > > > > >
> > > > > > BRACKET: 1.64649e-05 26875.6 -8.15248e+09 -2.11699e+09 7.56124e+09
> > > > > > ENERGY: 298 246811.3244 127002.1868 7801.5138 776.0334
> > > > > > -110553.6799 343151.2350 0.0000 0.0000 0.0000
> > > > > > 614988.6135 0.0000 614988.6135 614988.6135 0.0000
> > > > > > 156657.9925 177933.1586 576000.0000 156657.9925 177933.1586
> > > > > >
> > > > > > BRACKET: 8.23246e-06 12090.3 -2.11699e+09 -9.70546e+08 7.56124e+09
> > > > > > ENERGY: 299 245766.2529 126976.5313 7802.2252 775.5543
> > > > > > -110592.3262 343704.2276 0.0000 0.0000 0.0000
> > > > > > 614432.4651 0.0000 614432.4651 614432.4651 0.0000
> > > > > > 156870.7170 178776.2517 576000.0000 156870.7170 178776.2517
> > > > > >
> > > > > > LDB: LOAD: AVG 212.831 MAX 217.851 MSGS: TOTAL 184 MAXC 20 MAXP 5 None
> > > > > > LDB: LOAD: AVG 212.831 MAX 216.577 MSGS: TOTAL 184 MAXC 20 MAXP 5 Refine
> > > > > > Could not malloc()--are we out of memory?Fatal error, aborting.
> > > > > > Rtasks fail:
> > > > > > Rtask(s) 1 : exited with signal <6>
> > > > > > Rtask(s) 3 2 4 5 6 8 7 9 10 : exited with signal <15>
> > > > > > Rtask(s) 1 : coredump
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, 02 Nov 2004 13:22:50 -0600 (CST), J. Valencia
> > > > > > <jonathan_at_ibt.unam.mx> wrote:
> > > > > > > Also, for par_all27_prot_lipid.prm the suggested cutoff scheme is:
> > > > > > > switchdist 10.0
> > > > > > > cutoff 12.0
> > > > > > > pairlistdist 14.0
> > > > > > > This is stated almost at the end of the file.
> > > > > > >
> > > > > > > Good luck!
> > > > > > >
> > > > > > > J. Valencia.
> > > > > > >
> > > > > >
> > > > >
> > > > > *****************************************************************
> > > > > **Brian Bennion, Ph.D. **
> > > > > **Computational and Systems Biology Division **
> > > > > **Biology and Biotechnology Research Program **
> > > > > **Lawrence Livermore National Laboratory **
> > > > > **P.O. Box 808, L-448 bennion1_at_llnl.gov **
> > > > > **7000 East Avenue phone: (925) 422-5722 **
> > > > > **Livermore, CA 94550 fax: (925) 424-6605 **
> > > > > *****************************************************************
> > > > >
> > > > >
> > > >
> > >
> > > *****************************************************************
> > >
> > >
> > > **Brian Bennion, Ph.D. **
> > > **Computational and Systems Biology Division **
> > > **Biology and Biotechnology Research Program **
> > > **Lawrence Livermore National Laboratory **
> > > **P.O. Box 808, L-448 bennion1_at_llnl.gov **
> > > **7000 East Avenue phone: (925) 422-5722 **
> > > **Livermore, CA 94550 fax: (925) 424-6605 **
> > > *****************************************************************
> > >
> > >
> >
>
> *****************************************************************
>
>
> **Brian Bennion, Ph.D. **
> **Computational and Systems Biology Division **
> **Biology and Biotechnology Research Program **
> **Lawrence Livermore National Laboratory **
> **P.O. Box 808, L-448 bennion1_at_llnl.gov **
> **7000 East Avenue phone: (925) 422-5722 **
> **Livermore, CA 94550 fax: (925) 424-6605 **
> *****************************************************************
>
>

Next message: Charles McCallum: "Re: gcc and G5 Xserves"
Previous message: Brian Bennion: "Re: fatal error running namd"
In reply to: Brian Bennion: "Re: System minimization: fail after 199 steps"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:38:59 CST