Re: Re: Re: clarification(s) Re: namd 2.62b FATAL ERROR: Memory allocation failed on processor 0 or higher

From: Tom Caulfield (tom.r.caulfield_at_gmail.com)
Date: Sat Sep 09 2006 - 16:35:50 CDT

Next message: Anahita Tafvizi: "question regarding the use of reaction coordinate "distance-com" in ABF method"
Previous message: Jerome Henin: "Re: History file of an ABF smiulation"
Maybe in reply to: Tom Caulfield: "Re: clarification(s) Re: namd 2.62b FATAL ERROR: Memory allocation failed on processor 0 or higher"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

I don't think the swapping memory or virtual memory are getting
utilized, it is going right up to the physical memory limit of 1028
MB, but the total job needs about 1.47 GB according to what I am
running on the mini-cluster. Is there something I can do to get the
virtual memory to be used as well?

I did a single node, single processor job on the cluster (120) to see
how it was using the memory real-time, using top. This is right
before it crashed.

node 2:
38 processes: 35 sleeping, 3 running, 0 zombie, 0 stopped
CPU0 states: 3.4% user, 12.3% system, 0.0% nice, 83.1% idle
CPU1 states: 75.1% user, 3.0% system, 0.0% nice, 21.2% idle
Mem: 1031332K av, 1021816K used, 9516K free, 0K shrd,
1028K buff <-------this is right before it crashes.
Swap: 2048276K av, 2016K used, 2046260K free 518480K cached
<---------------- notice that the SWAPPING (Virtual) Memory is never
used.

  PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND
11397 roland 16 0 478M 478M 1380 R 82.4 47.5 0:54 namd2
  687 root 9 0 0 0 0 SW 2.1 0.0 0:12 rpciod
    5 root 9 0 0 0 0 RW 0.3 0.0 0:05 kswapd
11450 roland 9 0 1040 1040 848 R 0.1 0.1 0:00 top
    1 root 8 0 464 420 408 S 0.0 0.0 0:13 init
    2 root 9 0 0 0 0 SW 0.0 0.0 0:00 keventd
    3 root 18 19 0 0 0 SWN 0.0 0.0 0:00 ksoftirqd_CPU0
    4 root 18 19 0 0 0 SWN 0.0 0.0 0:00 ksoftirqd_CPU1
    6 root 9 0 0 0 0 SW 0.0 0.0 0:00 bdflush
    7 root 9 0 0 0 0 SW 0.0 0.0 0:03 kupdated
    8 root 9 0 0 0 0 SW 0.0 0.0 0:00 jfsIO
    9 root 9 0 0 0 0 SW 0.0 0.0 0:00 jfsCommit
   10 root 9 0 0 0 0 SW 0.0 0.0 0:00 jfsSync
  etc

Thanks,
-Tom

On 9/9/06, Tom Caulfield <tom.r.caulfield_at_gmail.com> wrote:
> It is saying unlimited when I check it. Which I am inferring to mean
> the maximum amount for what I have in each node. I used ulimit to
> check it.
> -Tom
>
>
> On 9/8/06, Jim Phillips <jim_at_ks.uiuc.edu> wrote:
> >
> > Check what "limit" returns for datasize when running a batch job.
> >
> > -Jim
> >
> > On Thu, 7 Sep 2006, Tom Caulfield wrote:
> >
> > > Hi Jim,
> > >
> > > Below is an email exchange I sent to our sys admin. I was able to get
> > > the single-process version of the namd2 to run on the cluster, which I
> > > was hoping was going to free up enough memory to run there. It works
> > > fine on our Itanium mini, but the larger cluster is preferred
> > > (evolucity linuxnetworx; 120 nodes free there). I am still crashing
> > > out at about 900MB. Perhaps the swapping memory is not being utilized
> > > here?
> > >
> > > I am noticing for the cluster that there is only about 1GB physical
> > > per node. Of which there is about 0.9GB free (since some is consumed
> > > by the OS. The swapping has 2BG free, but maybe it is unavailable?
> > >
> > > Here is the memory per node as I see it when logged directly onto that node:
> > >
> > > Mem: 1031332K av, 121792K used, 909540K free, 0K shrd, 1076K buff
> > > Swap: 2048276K av, 1932K used, 2046344K free 26296K
> > > cached
> > >
> > > I did configure for single processor NAMD jobs as Jim Phillips advised
> > > and there is still a crash occuring as follows: (see below this)
> > >
> > > Thanks,
> > >
> > > -Tom
> > >
> > > PS I was successful in getting 1 process per node (during start
> > > up...it always crashed before it could get underway).
> > >
> > > NODE n02:
> > > USER PID %CPU %MEM TIME CMD
> > > roland 25201 67.4 85.7 00:01:22 /usr/local/bin/namd2 minFix.namd
> > > NODE n03:
> > > USER PID %CPU %MEM TIME CMD
> > > roland 25133 34.9 2.1 00:00:42 /usr/local/bin/namd2 minFix.namd
> > > NODE n04:
> > > USER PID %CPU %MEM TIME CMD
> > > roland 24955 35.2 2.1 00:00:42 /usr/local/bin/namd2 minFix.namd
> > >
> > > .... up to node 120
> > >
> > >
> > > But crashed at:
> > > Info: ****************************
> > > Info: STRUCTURE SUMMARY:
> > > Info: 2524826 ATOMS
> > > Info: 1774106 BONDS
> > > Info: 1224931 ANGLES
> > > Info: 691809 DIHEDRALS
> > > Info: 43710 IMPROPERS
> > > Info: 0 EXCLUSIONS
> > > Info: 250859 FIXED ATOMS
> > > Info: 6821901 DEGREES OF FREEDOM
> > > Info: 911489 HYDROGEN GROUPS
> > > Info: 148790 HYDROGEN GROUPS WITH ALL ATOMS FIXED
> > > Info: TOTAL MASS = 1.60067e+07 amu
> > > Info: TOTAL CHARGE = 19.9999 e
> > > Info: *****************************
> > > Info: Entering startup phase 0 with 685545 kB of memory in use.
> > > Info: Entering startup phase 1 with 685545 kB of memory in use.
> > > Info: Entering startup phase 2 with 884001 kB of memory in use.
> > > Info: Entering startup phase 3 with 903729 kB of memory in use.
> > > Info: PATCH GRID IS 19 BY 21 BY 18
> > > FATAL ERROR: Memory allocation failed on processor 0.
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > <old email below>
> > >
> > > On Wed, 6 Sep 2006, Thomas Caulfield wrote:
> > >
> > > You mentioned putting smp in the list of flags. I think that means
> > > that I cannot use the binaries, but have to install a new version from
> > > source? Then I can use the ++ppn 1 command (as in
> > > /usr/local/bin/charmrun ++nodelist nodelist ++ppn 1 +p 120
> > > /usr/local/bin/namd2 Config.file > logfile & ) Am I barking up the
> > > wrong tree? I haven't compiled namd from source before, but I have
> > > installed other things on the cluster (such as spider)...where is the
> > > charm-5.9 configure command line (please pardon my ignorance).
> > >
> > > Take a look in the building part of the release notes for full
> > > instructions. You'll want something like "net-linux smp tcp". The
> > > ++ppn option is really "threads per process" so you would want ++ppn 2
> > > +p 120 to run one two-thread process on each node.
> > >
> > > If you just run one process per node with a normal binary it should work.
> > >
> > > -Jim
> > >
> > > Thanks again for your input.
> > >
> > > Regards,
> > >
> > > -Tom
> > >
> > > On Sep 3, 2006, at 11:30 PM, Jim Phillips wrote:
> > >
> > > Yes, you are most certainly running out of memory because of the
> > > system size (2.5 million atoms). The molecular structure is
> > > replicated on all nodes, so running on more processors doesn't help.
> > > If you can force process 0 of NAMD to always run on the same node,
> > > then you might get away with just bumping that node up to 2 GB. Try
> > > running one process per node so every process will have 1 GB to work
> > > with. If that works then you can try building NAMD on top of the
> > > net-linux-smp version of Charm++ (just add "smp" to the list of flags
> > > on the charm-5.9 configure command line) to use the second processor
> > > on the node without using too much extra memory (run with charmrun
> > > +p120 ++ppn 2 ...).
> > > -Jim
> > > On Sun, 3 Sep 2006, Thomas Caulfield wrote:
> > > Hello All (NAMD community):
> > > For a large system, run on LinuX NetworX Evolocity II cluster with 60
> > > nodes (120 processors). My question relates to whether this is a
> > > hardware problem, or if it is a software problem.
> > > I am running into a memory error. When I ran a smaller simulation
> > > that was scaling up to this full system one (which had 1,000,000
> > > atoms) there were no problems. Sometimes it gets to processor 6 or 7
> > > before the crash occurs.
> > > Each slave node has the following:
> > > *Evolocity (.8U wide) Intel Rackmount Compute Module, incl P/S
> > > *EIDE hard drive (120GB) 7200RPM 120GB PATA 7200 RPM
> > > * 2 Pentium Xeon 2.8 GHz, PC533 processor, 512k L2 Cache
> > > * 2 512MB PC2700 DDR Memory ECC REG Incl
> > > * 1 Super Micro X5DPRâˆ'8G2+, 6 DIMM slots Dual Intel Xeon (533/400MHz FSB)
> > > * Intel E7501 chipset
> > > * (1) 64âˆ'bit 133MHz PCIâˆ'X
> > > * Adaptec AICâˆ'7902 Ultra320 SCSI controller
> > > * Intel 82546EB dual port Gigabit
> > > * ATI Rage XL 8MB PCI graphic controller
> > > HERE is an OVERVIEW of the ERROR: (I am assuming that this system
> > > size is just exceeding the memory capacity per node?)
> > > For Full System:
> > > Info: ****************************
> > > Info: STRUCTURE SUMMARY:
> > > Info: 2524826 ATOMS
> > > Info: 1774106 BONDS
> > > Info: 1224931 ANGLES
> > > Info: 691809 DIHEDRALS
> > > Info: 43710 IMPROPERS
> > > Info: 0 EXCLUSIONS
> > > Info: 250859 FIXED ATOMS
> > > Info: 6821901 DEGREES OF FREEDOM
> > > Info: 911489 HYDROGEN GROUPS
> > > Info: 148790 HYDROGEN GROUPS WITH ALL ATOMS FIXED
> > > Info: TOTAL MASS = 1.60067e+07 amu
> > > Info: TOTAL CHARGE = 19.9999 e
> > > Info: *****************************
> > > Info: Entering startup phase 0 with 685641 kB of memory in use.
> > > Info: Entering startup phase 1 with 685641 kB of memory in use.
> > > FATAL ERROR: Memory allocation failed on processor 0.
> > > It did work for the partial system below though:
> > > Info: ****************************
> > > Info: STRUCTURE SUMMARY:
> > > Info: 251459 ATOMS
> > > Info: 262671 BONDS
> > > Info: 470453 ANGLES
> > > Info: 693542 DIHEDRALS
> > > Info: 43830 IMPROPERS
> > > Info: 0 EXCLUSIONS
> > > Info: 106193 FIXED ATOMS
> > > Info: 435798 DEGREES OF FREEDOM
> > > Info: 149190 HYDROGEN GROUPS
> > > Info: 53056 HYDROGEN GROUPS WITH ALL ATOMS FIXED
> > > Info: TOTAL MASS = 2.21721e+06 amu
> > > Info: TOTAL CHARGE = -3835 e
> > > Info: *****************************
> > > Info: Entering startup phase 0 with 88793 kB of memory in use.
> > > Info: Entering startup phase 1 with 88793 kB of memory in use.
> > > Info: Entering startup phase 2 with 174897 kB of memory in use.
> > > Info: Entering startup phase 3 with 174897 kB of memory in use.
> > > Info: PATCH GRID IS 13 BY 11 BY 9
> > > Info: REMOVING COM VELOCITY 0 0 0
> > > Info: Entering startup phase 4 with 194193 kB of memory in use.
> > > Info: Entering startup phase 5 with 194193 kB of memory in use.
> > > Info: Entering startup phase 6 with 194193 kB of memory in use.
> > > Info: Entering startup phase 7 with 194193 kB of memory in use.
> > > Info: COULOMB TABLE R-SQUARED SPACING: 0.0625
> > > Info: COULOMB TABLE SIZE: 2309 POINTS
> > > Info: Entering startup phase 8 with 194193 kB of memory in use.
> > > Info: Finished startup with 194193 kB of memory in use.
> > > TCL: Minimizing for 50 steps
> > > ETITLE: TS BOND ANGLE DIHED IMPRP
> > > ELECT VDW BOUNDARY MISC
> > > KINETIC TOTAL TEMP
> > > <More Output continues.....aka it works in this case>
> > > Thanks for any valuable insights in advance.
> > > Best regards,
> > > -Tom Caulfield
> > > ****************************************
> > > Tom Caulfield, Ph.D. Candidate
> > > School of Chemistry & Biochemistry
> > > Cherry Emerson Bldg., RM 329
> > > Georgia Institute
> > > of Technology
> > > Atlanta, GA 30332-0400
> > > Harvey Laboratory:
> > > http://rumour.biology.gatech.edu
> > > ****************************************
> > >
> > >
> > > ****************************************
> > > Tom Caulfield, Ph.D. Candidate
> > > School of Chemistry & Biochemistry
> > > Cherry Emerson Bldg., RM 329
> > > Georgia Institute
> > > ofTech nology
> > > Atlanta, GA 30332-0400
> > > Harvey Laboratory:
> > > http://rumour.biology.gatech.edu
> > > ****************************************
> > >
> >
>

Next message: Anahita Tafvizi: "question regarding the use of reaction coordinate "distance-com" in ABF method"
Previous message: Jerome Henin: "Re: History file of an ABF smiulation"
Maybe in reply to: Tom Caulfield: "Re: clarification(s) Re: namd 2.62b FATAL ERROR: Memory allocation failed on processor 0 or higher"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:42:34 CST