From: Brian Bennion (brian_at_youkai.llnl.gov)
Date: Thu Nov 11 2004 - 10:18:49 CST
Hello Charles,
I don't want to rain on your parade, I am glad that you have had some
success, but the fact that turning this off allows the program to run is a
wee bit scary!
A good portion of the speed of NAMD comes from balancing its load
across the processes.  I would predict that after a certain number of
steps the calculations will become execeedingly slow.
Jim, the lead developer is away at SC2004 in pittsburg. Hopefully he can
address this issue when he returns.
I am still wondering if this is not a problem with charm++.  Did you get a
chance to recompile it or to check the megatests across >5 nodes?
Regards
Brian
On Thu, 11 Nov 2004, Charles Danko wrote:
> Hi, Brian,
>
> Yes!  Turning off load balancing seems to work fine!  None of the
> other algorithims seem to work.  I assume that the program I am using
> to start NAMD will perform load balancing, but I am going to check
> with my administrator to be sure.  Excellent suggestion!! Thanks
> again!
>
> Best wishes,
> Charles
>
> For those who are having the same problem, load balancing is an
> undocumented feature.  Those using the SUN OS and bsub to start NAMD
> will most likely need to turn this feature off.  In NAMD 2.5, it can
> be turned off like so:
> ldbStrategy none
>
> The other acceptable parameters are:
> refineonly
> alg7
> orb
> neighbor
> other - this seemed to be the same as alg7
>
> Good luck!
>
>
> On Tue, 9 Nov 2004 13:40:39 -0800 (PST), Brian Bennion
> <brian_at_youkai.llnl.gov> wrote:
> > Hello Charles
> >
> > You might be able to set the output timing to a larger number, but I don't
> > know if that will stop the initial timing entry in the log.
> >
> > The number of steps before load balancing occurs can be changed and is at
> > last knowledge and undocumented feature.  You can turn it off, change the
> > type of algorithm, as well as the number of steps between loadbalancing
> > efforts.
> >
> > Look at the source code on the namd website under simparameters.C
> > Brian
> >
> >
> >
> >
> > On Tue, 9 Nov 2004, Charles Danko wrote:
> >
> > > Hi,
> > >
> > > I do not believe that the problem is in the amount of memory.  The
> > > NAMD users guide says that the program draws, at maximum, 300MB for a
> > > system over 100,000 atoms.  My system is approximately half that size.
> > >  Raising the "pairlistminprocs" parameter does not seem to alter the
> > > error message.  Finally, on our system each task is allowed to take
> > > 4GB of memory.  Even if each processor draws 300MB, a 10 processor
> > > simulation will only draw 3GB - safely below the maximum.
> > >
> > > The system seems to die when NAMD is run with more than 4 processors
> > > for this particular system.  It always seems to die when the program
> > > is estimating the memory usage, and the length of time for 1ns of
> > > simulation.  Is there some way that I can force it to skip these
> > > steps?  I have looked through the manual but haven't seen anything
> > > useful.
> > >
> > > Simulation output for 4 processors can be found here (3MB file):
> > > http://www.campbellferrara.com/heating-4proc.out
> > >
> > > and 10 here (smaller file):
> > > http://www.campbellferrara.com/heating-10proc.out
> > >
> > > The script file that was used to run the simulation can be found here
> > > (smaller file):
> > > http://www.campbellferrara.com/heating.namd
> > >
> > > I would be would be grateful for any other suggestions.
> > >
> > > Thanks,
> > > Charles
> > >
> > > The details of what I have tried to reach this conclusion:
> > > 1. Minimization on the protein in a vacuum as Brian recommended.  This
> > > ran through 2k steps and completed successfully.
> > > 2. I realized that there may be some overlapping water molecules since
> > > solvate does not delete these using the minmax option.  I deleted all
> > > of them and then minimized the system.  The minimization ran for 2k
> > > steps and completed successfully.  The system minimized to a gradient
> > > of ~35.  I was trying to get it under 5 as has been recommended on
> > > this mailing list, so I load the restart files to minimize for another
> > > thousand steps, and the system dies after 199 steps.
> > > 3. The previous step had ASCII save coordinates.  I wanted to see what
> > > would happen if I tried binary.  I restarted a simulation from the
> > > 1500 step binary save file.  The simulation died after 299 steps.
> > > 4. I tried running the simulation from the beginning for 5000 steps
> > > (the same as had just ran successfully) and the system died after 199
> > > steps.
> > > 5. Loading the minimized system coordinates in to VMD, the system
> > > looks inside out.  I fixed the periodic boundary conditions settings
> > > so that the box that fits around the entire system (measured minmax
> > > and center in VMD).  I start the minimization from the beginning.  It
> > > dies after 199 steps.
> > > 6. I spoke to my administrator about memory and he said that no
> > > application can draw more than 4GB.
> > > 7. I tried the minimization with 1 processor (generally I use 10).  It
> > > took forever, but worked.
> > > 8. A heating script works on 4 processors, but dies after step 199 of
> > > 10 processors.
> > > 9. Similarly, an equilibration script seems to work with 4 processors,
> > > but dies with any higher than 5.
> > >
> > >
> > >
> > > On Wed, 3 Nov 2004 11:05:56 -0800 (PST), Brian Bennion
> > > <brian_at_youkai.llnl.gov> wrote:
> > > > Hello Charles,
> > > >
> > > > A little background...
> > > > Namd requires charm++ to compile correctly, so the natural order is that
> > > > charm++ is compiled first and then namd is compiled against it.  The fact
> > > > that namd runs at all on your system would suggest that charm++ has been
> > > > compiled at some point.
> > > >
> > > > I am not familiar with the sun sparc setup, but charmrun maybe used here
> > > > to propagate the job through the nodes.
> > > >
> > > > Can anyone comment here?
> > > >
> > > > Steps that I can recommend....
> > > > Try just minimizing the protein alone in vacuo for 200+ steps?
> > > > Are you sure that the total system is 58,236 atoms?  That seems small for
> > > > such a complex box.
> > > >
> > > > Can you send the whole log file from startup to crash?
> > > > There just might not be enough memory?  But I would think that this would
> > > > manifest itself earlier.
> > > >
> > > > Thanks
> > > > Brian
> > > >
> > > >
> > > >
> > > > On Wed, 3 Nov 2004, Charles Danko wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Thanks to Brian and Dr. Valencia for their help.
> > > > >
> > > > > The machines are a cluster of Sun SPARC 64 bit processors running
> > > > > Solaris 7.  I am using bsub for multithreading.  My administrator may
> > > > > let me run charm+ if you think that it may solve the problem, but
> > > > > there may be some good reason that it wasn't used before (namd was
> > > > > compiled by a colleague of mine, and I am not sure of specific issues
> > > > > he faced when putting it together).
> > > > >
> > > > > The system is a protein, lipid, and water system, in total, 58,236
> > > > > atoms constructed from a protein homology model.  The system was
> > > > > assembled using VMD, the membrane 1.0 plug-in, and solvate 1.2 (to
> > > > > solvate the top and bottom where the protein was sticking out of the
> > > > > pre-equilibrated lipid-water system constructed by membrane).  I
> > > > > deleted all atoms within 1A of the protein and am now trying to
> > > > > minimize the system.
> > > > >
> > > > > Based on Dr. Valencia and Dr. Bennion's suggestions I changed the
> > > > > script file.  I adapted the one intended to heat the system after the
> > > > > minimization.  I have included the new script file as an attachment.
> > > > > The run still crashes after 199 steps, but this time it returns a
> > > > > malloc error.  Short by 2GB?
> > > > > The last part of the output is pasted below.  Many of the forces are
> > > > > positive again.
> > > > >
> > > > > I have tried to fix the protein and minimize the water/lipids; the
> > > > > output is pasted below.  The system lasted for 299 steps this time,
> > > > > but received the same malloc error.
> > > > >
> > > > > I have NOT deleted the atoms which fall outside of my periodic
> > > > > boundary.  If you recommend I will do this and try to run the new
> > > > > script again.  I am acting under the assumption that these atoms will
> > > > > be ignored.
> > > > > Is this coorect?
> > > > >
> > > > > Because the problem seems to be a memory allocation error, I am
> > > > > thinking that the next step will to be trying to convince my
> > > > > administrator to compile charm+.
> > > > > Any thoughts or suggestions?
> > > > > Do I need to recompile all of namd, or can I just compile charm+ without it?
> > > > >
> > > > > Thanks again for all of the help,
> > > > > Charles
> > > > >
> > > > > Output files:
> > > > >
> > > > > New script, no atoms fixed.
> > > > >
> > > > > BRACKET: 6.57916e-07 652.946 -2.45009e+09 -8.45313e+07 9.29531e+08
> > > > > ENERGY:     198    522579.9239    151303.8494     10858.5910      1446.8211
> > > > >    -80557.5403    481695.6698         0.0000         0.0000         0.0000
> > > > >   1087327.3149         0.0000   1087327.3149   1087327.3149         0.0000
> > > > >    188642.4062    235104.9289    576000.0000    188642.4062    235104.9289
> > > > >
> > > > > BRACKET: 1.6835e-07 70.1294 -8.45313e+07 1.17645e+07 9.29531e+08
> > > > > ENERGY:     199    522585.2964    151303.6975     10858.5915      1446.8152
> > > > >    -80557.4059    481690.3089         0.0000         0.0000         0.0000
> > > > >   1087327.3036         0.0000   1087327.3036   1087327.3036         0.0000
> > > > >    188639.2161    235101.2267    576000.0000    188639.2161    235101.2267
> > > > >
> > > > > LDB:  LOAD: AVG 231.478 MAX 291.895  MSGS: TOTAL 184 MAXC 20 MAXP 5  None
> > > > > LDB:  LOAD: AVG 231.478 MAX 255.756  MSGS: TOTAL 184 MAXC 20 MAXP 5  Alg7
> > > > > LDB:  LOAD: AVG 231.478 MAX 236.106  MSGS: TOTAL 184 MAXC 20 MAXP 5  Alg7
> > > > > Could not malloc() 2118274080 bytes--are we out of memory?Fatal error, aborting.
> > > > > Rtasks fail:
> > > > > Rtask(s) 1 : exited with signal <6>
> > > > > Rtask(s) 3 2 4 5 8 6 7 10 9 : exited with signal <15>
> > > > > Rtask(s) 1  : coredump
> > > > > >
> > > > >
> > > > > New Script, Fixed Protein
> > > > >
> > > > > BRACKET: 1.64649e-05 26875.6 -8.15248e+09 -2.11699e+09 7.56124e+09
> > > > > ENERGY:     298    246811.3244    127002.1868      7801.5138       776.0334
> > > > >   -110553.6799    343151.2350         0.0000         0.0000         0.0000
> > > > >    614988.6135         0.0000    614988.6135    614988.6135         0.0000
> > > > >    156657.9925    177933.1586    576000.0000    156657.9925    177933.1586
> > > > >
> > > > > BRACKET: 8.23246e-06 12090.3 -2.11699e+09 -9.70546e+08 7.56124e+09
> > > > > ENERGY:     299    245766.2529    126976.5313      7802.2252       775.5543
> > > > >   -110592.3262    343704.2276         0.0000         0.0000         0.0000
> > > > >    614432.4651         0.0000    614432.4651    614432.4651         0.0000
> > > > >    156870.7170    178776.2517    576000.0000    156870.7170    178776.2517
> > > > >
> > > > > LDB:  LOAD: AVG 212.831 MAX 217.851  MSGS: TOTAL 184 MAXC 20 MAXP 5  None
> > > > > LDB:  LOAD: AVG 212.831 MAX 216.577  MSGS: TOTAL 184 MAXC 20 MAXP 5  Refine
> > > > > Could not malloc()--are we out of memory?Fatal error, aborting.
> > > > > Rtasks fail:
> > > > > Rtask(s) 1 : exited with signal <6>
> > > > > Rtask(s) 3 2 4 5 6 8 7 9 10 : exited with signal <15>
> > > > > Rtask(s) 1  : coredump
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Tue, 02 Nov 2004 13:22:50 -0600 (CST), J. Valencia
> > > > > <jonathan_at_ibt.unam.mx> wrote:
> > > > > >   Also, for par_all27_prot_lipid.prm the suggested cutoff scheme is:
> > > > > > switchdist      10.0
> > > > > > cutoff          12.0
> > > > > > pairlistdist    14.0
> > > > > > This is stated almost at the end of the file.
> > > > > >
> > > > > > Good luck!
> > > > > >
> > > > > > J. Valencia.
> > > > > >
> > > > >
> > > >
> > > > *****************************************************************
> > > > **Brian Bennion, Ph.D.                                         **
> > > > **Computational and Systems Biology Division                   **
> > > > **Biology and Biotechnology Research Program                   **
> > > > **Lawrence Livermore National Laboratory                       **
> > > > **P.O. Box 808, L-448    bennion1_at_llnl.gov                     **
> > > > **7000 East Avenue       phone: (925) 422-5722                 **
> > > > **Livermore, CA  94550   fax:   (925) 424-6605                 **
> > > > *****************************************************************
> > > >
> > > >
> > >
> >
> > *****************************************************************
> >
> >
> > **Brian Bennion, Ph.D.                                         **
> > **Computational and Systems Biology Division                   **
> > **Biology and Biotechnology Research Program                   **
> > **Lawrence Livermore National Laboratory                       **
> > **P.O. Box 808, L-448    bennion1_at_llnl.gov                     **
> > **7000 East Avenue       phone: (925) 422-5722                 **
> > **Livermore, CA  94550   fax:   (925) 424-6605                 **
> > *****************************************************************
> >
> >
>
*****************************************************************
**Brian Bennion, Ph.D.                                         **
**Computational and Systems Biology Division                   **
**Biology and Biotechnology Research Program                   **
**Lawrence Livermore National Laboratory                       **
**P.O. Box 808, L-448    bennion1_at_llnl.gov                     **
**7000 East Avenue       phone: (925) 422-5722                 **
**Livermore, CA  94550   fax:   (925) 424-6605                 **
*****************************************************************
This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:38:59 CST