From: Brian Bennion (brian_at_youkai.llnl.gov)
Date: Tue Nov 09 2004 - 15:40:39 CST
Hello Charles
You might be able to set the output timing to a larger number, but I don't
know if that will stop the initial timing entry in the log.
The number of steps before load balancing occurs can be changed and is at
last knowledge and undocumented feature. You can turn it off, change the
type of algorithm, as well as the number of steps between loadbalancing
efforts.
Look at the source code on the namd website under simparameters.C
Brian
On Tue, 9 Nov 2004, Charles Danko wrote:
> Hi,
>
> I do not believe that the problem is in the amount of memory. The
> NAMD users guide says that the program draws, at maximum, 300MB for a
> system over 100,000 atoms. My system is approximately half that size.
> Raising the "pairlistminprocs" parameter does not seem to alter the
> error message. Finally, on our system each task is allowed to take
> 4GB of memory. Even if each processor draws 300MB, a 10 processor
> simulation will only draw 3GB - safely below the maximum.
>
> The system seems to die when NAMD is run with more than 4 processors
> for this particular system. It always seems to die when the program
> is estimating the memory usage, and the length of time for 1ns of
> simulation. Is there some way that I can force it to skip these
> steps? I have looked through the manual but haven't seen anything
> useful.
>
> Simulation output for 4 processors can be found here (3MB file):
> http://www.campbellferrara.com/heating-4proc.out
>
> and 10 here (smaller file):
> http://www.campbellferrara.com/heating-10proc.out
>
> The script file that was used to run the simulation can be found here
> (smaller file):
> http://www.campbellferrara.com/heating.namd
>
> I would be would be grateful for any other suggestions.
>
> Thanks,
> Charles
>
> The details of what I have tried to reach this conclusion:
> 1. Minimization on the protein in a vacuum as Brian recommended. This
> ran through 2k steps and completed successfully.
> 2. I realized that there may be some overlapping water molecules since
> solvate does not delete these using the minmax option. I deleted all
> of them and then minimized the system. The minimization ran for 2k
> steps and completed successfully. The system minimized to a gradient
> of ~35. I was trying to get it under 5 as has been recommended on
> this mailing list, so I load the restart files to minimize for another
> thousand steps, and the system dies after 199 steps.
> 3. The previous step had ASCII save coordinates. I wanted to see what
> would happen if I tried binary. I restarted a simulation from the
> 1500 step binary save file. The simulation died after 299 steps.
> 4. I tried running the simulation from the beginning for 5000 steps
> (the same as had just ran successfully) and the system died after 199
> steps.
> 5. Loading the minimized system coordinates in to VMD, the system
> looks inside out. I fixed the periodic boundary conditions settings
> so that the box that fits around the entire system (measured minmax
> and center in VMD). I start the minimization from the beginning. It
> dies after 199 steps.
> 6. I spoke to my administrator about memory and he said that no
> application can draw more than 4GB.
> 7. I tried the minimization with 1 processor (generally I use 10). It
> took forever, but worked.
> 8. A heating script works on 4 processors, but dies after step 199 of
> 10 processors.
> 9. Similarly, an equilibration script seems to work with 4 processors,
> but dies with any higher than 5.
>
>
>
> On Wed, 3 Nov 2004 11:05:56 -0800 (PST), Brian Bennion
> <brian_at_youkai.llnl.gov> wrote:
> > Hello Charles,
> >
> > A little background...
> > Namd requires charm++ to compile correctly, so the natural order is that
> > charm++ is compiled first and then namd is compiled against it. The fact
> > that namd runs at all on your system would suggest that charm++ has been
> > compiled at some point.
> >
> > I am not familiar with the sun sparc setup, but charmrun maybe used here
> > to propagate the job through the nodes.
> >
> > Can anyone comment here?
> >
> > Steps that I can recommend....
> > Try just minimizing the protein alone in vacuo for 200+ steps?
> > Are you sure that the total system is 58,236 atoms? That seems small for
> > such a complex box.
> >
> > Can you send the whole log file from startup to crash?
> > There just might not be enough memory? But I would think that this would
> > manifest itself earlier.
> >
> > Thanks
> > Brian
> >
> >
> >
> > On Wed, 3 Nov 2004, Charles Danko wrote:
> >
> > > Hi,
> > >
> > > Thanks to Brian and Dr. Valencia for their help.
> > >
> > > The machines are a cluster of Sun SPARC 64 bit processors running
> > > Solaris 7. I am using bsub for multithreading. My administrator may
> > > let me run charm+ if you think that it may solve the problem, but
> > > there may be some good reason that it wasn't used before (namd was
> > > compiled by a colleague of mine, and I am not sure of specific issues
> > > he faced when putting it together).
> > >
> > > The system is a protein, lipid, and water system, in total, 58,236
> > > atoms constructed from a protein homology model. The system was
> > > assembled using VMD, the membrane 1.0 plug-in, and solvate 1.2 (to
> > > solvate the top and bottom where the protein was sticking out of the
> > > pre-equilibrated lipid-water system constructed by membrane). I
> > > deleted all atoms within 1A of the protein and am now trying to
> > > minimize the system.
> > >
> > > Based on Dr. Valencia and Dr. Bennion's suggestions I changed the
> > > script file. I adapted the one intended to heat the system after the
> > > minimization. I have included the new script file as an attachment.
> > > The run still crashes after 199 steps, but this time it returns a
> > > malloc error. Short by 2GB?
> > > The last part of the output is pasted below. Many of the forces are
> > > positive again.
> > >
> > > I have tried to fix the protein and minimize the water/lipids; the
> > > output is pasted below. The system lasted for 299 steps this time,
> > > but received the same malloc error.
> > >
> > > I have NOT deleted the atoms which fall outside of my periodic
> > > boundary. If you recommend I will do this and try to run the new
> > > script again. I am acting under the assumption that these atoms will
> > > be ignored.
> > > Is this coorect?
> > >
> > > Because the problem seems to be a memory allocation error, I am
> > > thinking that the next step will to be trying to convince my
> > > administrator to compile charm+.
> > > Any thoughts or suggestions?
> > > Do I need to recompile all of namd, or can I just compile charm+ without it?
> > >
> > > Thanks again for all of the help,
> > > Charles
> > >
> > > Output files:
> > >
> > > New script, no atoms fixed.
> > >
> > > BRACKET: 6.57916e-07 652.946 -2.45009e+09 -8.45313e+07 9.29531e+08
> > > ENERGY: 198 522579.9239 151303.8494 10858.5910 1446.8211
> > > -80557.5403 481695.6698 0.0000 0.0000 0.0000
> > > 1087327.3149 0.0000 1087327.3149 1087327.3149 0.0000
> > > 188642.4062 235104.9289 576000.0000 188642.4062 235104.9289
> > >
> > > BRACKET: 1.6835e-07 70.1294 -8.45313e+07 1.17645e+07 9.29531e+08
> > > ENERGY: 199 522585.2964 151303.6975 10858.5915 1446.8152
> > > -80557.4059 481690.3089 0.0000 0.0000 0.0000
> > > 1087327.3036 0.0000 1087327.3036 1087327.3036 0.0000
> > > 188639.2161 235101.2267 576000.0000 188639.2161 235101.2267
> > >
> > > LDB: LOAD: AVG 231.478 MAX 291.895 MSGS: TOTAL 184 MAXC 20 MAXP 5 None
> > > LDB: LOAD: AVG 231.478 MAX 255.756 MSGS: TOTAL 184 MAXC 20 MAXP 5 Alg7
> > > LDB: LOAD: AVG 231.478 MAX 236.106 MSGS: TOTAL 184 MAXC 20 MAXP 5 Alg7
> > > Could not malloc() 2118274080 bytes--are we out of memory?Fatal error, aborting.
> > > Rtasks fail:
> > > Rtask(s) 1 : exited with signal <6>
> > > Rtask(s) 3 2 4 5 8 6 7 10 9 : exited with signal <15>
> > > Rtask(s) 1 : coredump
> > > >
> > >
> > > New Script, Fixed Protein
> > >
> > > BRACKET: 1.64649e-05 26875.6 -8.15248e+09 -2.11699e+09 7.56124e+09
> > > ENERGY: 298 246811.3244 127002.1868 7801.5138 776.0334
> > > -110553.6799 343151.2350 0.0000 0.0000 0.0000
> > > 614988.6135 0.0000 614988.6135 614988.6135 0.0000
> > > 156657.9925 177933.1586 576000.0000 156657.9925 177933.1586
> > >
> > > BRACKET: 8.23246e-06 12090.3 -2.11699e+09 -9.70546e+08 7.56124e+09
> > > ENERGY: 299 245766.2529 126976.5313 7802.2252 775.5543
> > > -110592.3262 343704.2276 0.0000 0.0000 0.0000
> > > 614432.4651 0.0000 614432.4651 614432.4651 0.0000
> > > 156870.7170 178776.2517 576000.0000 156870.7170 178776.2517
> > >
> > > LDB: LOAD: AVG 212.831 MAX 217.851 MSGS: TOTAL 184 MAXC 20 MAXP 5 None
> > > LDB: LOAD: AVG 212.831 MAX 216.577 MSGS: TOTAL 184 MAXC 20 MAXP 5 Refine
> > > Could not malloc()--are we out of memory?Fatal error, aborting.
> > > Rtasks fail:
> > > Rtask(s) 1 : exited with signal <6>
> > > Rtask(s) 3 2 4 5 6 8 7 9 10 : exited with signal <15>
> > > Rtask(s) 1 : coredump
> > > >
> > >
> > >
> > >
> > >
> > > On Tue, 02 Nov 2004 13:22:50 -0600 (CST), J. Valencia
> > > <jonathan_at_ibt.unam.mx> wrote:
> > > > Also, for par_all27_prot_lipid.prm the suggested cutoff scheme is:
> > > > switchdist 10.0
> > > > cutoff 12.0
> > > > pairlistdist 14.0
> > > > This is stated almost at the end of the file.
> > > >
> > > > Good luck!
> > > >
> > > > J. Valencia.
> > > >
> > >
> >
> > *****************************************************************
> > **Brian Bennion, Ph.D. **
> > **Computational and Systems Biology Division **
> > **Biology and Biotechnology Research Program **
> > **Lawrence Livermore National Laboratory **
> > **P.O. Box 808, L-448 bennion1_at_llnl.gov **
> > **7000 East Avenue phone: (925) 422-5722 **
> > **Livermore, CA 94550 fax: (925) 424-6605 **
> > *****************************************************************
> >
> >
>
*****************************************************************
**Brian Bennion, Ph.D. **
**Computational and Systems Biology Division **
**Biology and Biotechnology Research Program **
**Lawrence Livermore National Laboratory **
**P.O. Box 808, L-448 bennion1_at_llnl.gov **
**7000 East Avenue phone: (925) 422-5722 **
**Livermore, CA 94550 fax: (925) 424-6605 **
*****************************************************************
This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:38:59 CST