Re: clarification(s) Re: namd 2.62b FATAL ERROR: Memory allocation failed on processor 0 or higher

From: Jim Phillips (jim_at_ks.uiuc.edu)
Date: Fri Sep 08 2006 - 17:23:12 CDT

Check what "limit" returns for datasize when running a batch job.

-Jim

On Thu, 7 Sep 2006, Tom Caulfield wrote:

> Hi Jim,
>
> Below is an email exchange I sent to our sys admin. I was able to get
> the single-process version of the namd2 to run on the cluster, which I
> was hoping was going to free up enough memory to run there. It works
> fine on our Itanium mini, but the larger cluster is preferred
> (evolucity linuxnetworx; 120 nodes free there). I am still crashing
> out at about 900MB. Perhaps the swapping memory is not being utilized
> here?
>
> I am noticing for the cluster that there is only about 1GB physical
> per node. Of which there is about 0.9GB free (since some is consumed
> by the OS. The swapping has 2BG free, but maybe it is unavailable?
>
> Here is the memory per node as I see it when logged directly onto that node:
>
> Mem: 1031332K av, 121792K used, 909540K free, 0K shrd, 1076K buff
> Swap: 2048276K av, 1932K used, 2046344K free 26296K
> cached
>
> I did configure for single processor NAMD jobs as Jim Phillips advised
> and there is still a crash occuring as follows: (see below this)
>
> Thanks,
>
> -Tom
>
> PS I was successful in getting 1 process per node (during start
> up...it always crashed before it could get underway).
>
> NODE n02:
> USER PID %CPU %MEM TIME CMD
> roland 25201 67.4 85.7 00:01:22 /usr/local/bin/namd2 minFix.namd
> NODE n03:
> USER PID %CPU %MEM TIME CMD
> roland 25133 34.9 2.1 00:00:42 /usr/local/bin/namd2 minFix.namd
> NODE n04:
> USER PID %CPU %MEM TIME CMD
> roland 24955 35.2 2.1 00:00:42 /usr/local/bin/namd2 minFix.namd
>
> .... up to node 120
>
>
> But crashed at:
> Info: ****************************
> Info: STRUCTURE SUMMARY:
> Info: 2524826 ATOMS
> Info: 1774106 BONDS
> Info: 1224931 ANGLES
> Info: 691809 DIHEDRALS
> Info: 43710 IMPROPERS
> Info: 0 EXCLUSIONS
> Info: 250859 FIXED ATOMS
> Info: 6821901 DEGREES OF FREEDOM
> Info: 911489 HYDROGEN GROUPS
> Info: 148790 HYDROGEN GROUPS WITH ALL ATOMS FIXED
> Info: TOTAL MASS = 1.60067e+07 amu
> Info: TOTAL CHARGE = 19.9999 e
> Info: *****************************
> Info: Entering startup phase 0 with 685545 kB of memory in use.
> Info: Entering startup phase 1 with 685545 kB of memory in use.
> Info: Entering startup phase 2 with 884001 kB of memory in use.
> Info: Entering startup phase 3 with 903729 kB of memory in use.
> Info: PATCH GRID IS 19 BY 21 BY 18
> FATAL ERROR: Memory allocation failed on processor 0.
>
>
>
>
>
>
>
>
> <old email below>
>
> On Wed, 6 Sep 2006, Thomas Caulfield wrote:
>
> You mentioned putting smp in the list of flags. I think that means
> that I cannot use the binaries, but have to install a new version from
> source? Then I can use the ++ppn 1 command (as in
> /usr/local/bin/charmrun ++nodelist nodelist ++ppn 1 +p 120
> /usr/local/bin/namd2 Config.file > logfile & ) Am I barking up the
> wrong tree? I haven't compiled namd from source before, but I have
> installed other things on the cluster (such as spider)...where is the
> charm-5.9 configure command line (please pardon my ignorance).
>
> Take a look in the building part of the release notes for full
> instructions. You'll want something like "net-linux smp tcp". The
> ++ppn option is really "threads per process" so you would want ++ppn 2
> +p 120 to run one two-thread process on each node.
>
> If you just run one process per node with a normal binary it should work.
>
> -Jim
>
> Thanks again for your input.
>
> Regards,
>
> -Tom
>
> On Sep 3, 2006, at 11:30 PM, Jim Phillips wrote:
>
> Yes, you are most certainly running out of memory because of the
> system size (2.5 million atoms). The molecular structure is
> replicated on all nodes, so running on more processors doesn't help.
> If you can force process 0 of NAMD to always run on the same node,
> then you might get away with just bumping that node up to 2 GB. Try
> running one process per node so every process will have 1 GB to work
> with. If that works then you can try building NAMD on top of the
> net-linux-smp version of Charm++ (just add "smp" to the list of flags
> on the charm-5.9 configure command line) to use the second processor
> on the node without using too much extra memory (run with charmrun
> +p120 ++ppn 2 ...).
> -Jim
> On Sun, 3 Sep 2006, Thomas Caulfield wrote:
> Hello All (NAMD community):
> For a large system, run on LinuX NetworX Evolocity II cluster with 60
> nodes (120 processors). My question relates to whether this is a
> hardware problem, or if it is a software problem.
> I am running into a memory error. When I ran a smaller simulation
> that was scaling up to this full system one (which had 1,000,000
> atoms) there were no problems. Sometimes it gets to processor 6 or 7
> before the crash occurs.
> Each slave node has the following:
> *Evolocity (.8U wide) Intel Rackmount Compute Module, incl P/S
> *EIDE hard drive (120GB) 7200RPM 120GB PATA 7200 RPM
> * 2 Pentium Xeon 2.8 GHz, PC533 processor, 512k L2 Cache
> * 2 512MB PC2700 DDR Memory ECC REG Incl
> * 1 Super Micro X5DPRâˆ'8G2+, 6 DIMM slots Dual Intel Xeon (533/400MHz FSB)
> * Intel E7501 chipset
> * (1) 64âˆ'bit 133MHz PCIâˆ'X
> * Adaptec AICâˆ'7902 Ultra320 SCSI controller
> * Intel 82546EB dual port Gigabit
> * ATI Rage XL 8MB PCI graphic controller
> HERE is an OVERVIEW of the ERROR: (I am assuming that this system
> size is just exceeding the memory capacity per node?)
> For Full System:
> Info: ****************************
> Info: STRUCTURE SUMMARY:
> Info: 2524826 ATOMS
> Info: 1774106 BONDS
> Info: 1224931 ANGLES
> Info: 691809 DIHEDRALS
> Info: 43710 IMPROPERS
> Info: 0 EXCLUSIONS
> Info: 250859 FIXED ATOMS
> Info: 6821901 DEGREES OF FREEDOM
> Info: 911489 HYDROGEN GROUPS
> Info: 148790 HYDROGEN GROUPS WITH ALL ATOMS FIXED
> Info: TOTAL MASS = 1.60067e+07 amu
> Info: TOTAL CHARGE = 19.9999 e
> Info: *****************************
> Info: Entering startup phase 0 with 685641 kB of memory in use.
> Info: Entering startup phase 1 with 685641 kB of memory in use.
> FATAL ERROR: Memory allocation failed on processor 0.
> It did work for the partial system below though:
> Info: ****************************
> Info: STRUCTURE SUMMARY:
> Info: 251459 ATOMS
> Info: 262671 BONDS
> Info: 470453 ANGLES
> Info: 693542 DIHEDRALS
> Info: 43830 IMPROPERS
> Info: 0 EXCLUSIONS
> Info: 106193 FIXED ATOMS
> Info: 435798 DEGREES OF FREEDOM
> Info: 149190 HYDROGEN GROUPS
> Info: 53056 HYDROGEN GROUPS WITH ALL ATOMS FIXED
> Info: TOTAL MASS = 2.21721e+06 amu
> Info: TOTAL CHARGE = -3835 e
> Info: *****************************
> Info: Entering startup phase 0 with 88793 kB of memory in use.
> Info: Entering startup phase 1 with 88793 kB of memory in use.
> Info: Entering startup phase 2 with 174897 kB of memory in use.
> Info: Entering startup phase 3 with 174897 kB of memory in use.
> Info: PATCH GRID IS 13 BY 11 BY 9
> Info: REMOVING COM VELOCITY 0 0 0
> Info: Entering startup phase 4 with 194193 kB of memory in use.
> Info: Entering startup phase 5 with 194193 kB of memory in use.
> Info: Entering startup phase 6 with 194193 kB of memory in use.
> Info: Entering startup phase 7 with 194193 kB of memory in use.
> Info: COULOMB TABLE R-SQUARED SPACING: 0.0625
> Info: COULOMB TABLE SIZE: 2309 POINTS
> Info: Entering startup phase 8 with 194193 kB of memory in use.
> Info: Finished startup with 194193 kB of memory in use.
> TCL: Minimizing for 50 steps
> ETITLE: TS BOND ANGLE DIHED IMPRP
> ELECT VDW BOUNDARY MISC
> KINETIC TOTAL TEMP
> <More Output continues.....aka it works in this case>
> Thanks for any valuable insights in advance.
> Best regards,
> -Tom Caulfield
> ****************************************
> Tom Caulfield, Ph.D. Candidate
> School of Chemistry & Biochemistry
> Cherry Emerson Bldg., RM 329
> Georgia Institute
> of Technology
> Atlanta, GA 30332-0400
> Harvey Laboratory:
> http://rumour.biology.gatech.edu
> ****************************************
>
>
> ****************************************
> Tom Caulfield, Ph.D. Candidate
> School of Chemistry & Biochemistry
> Cherry Emerson Bldg., RM 329
> Georgia Institute
> ofTech nology
> Atlanta, GA 30332-0400
> Harvey Laboratory:
> http://rumour.biology.gatech.edu
> ****************************************
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:42:34 CST