Re: namd 2.62b FATAL ERROR: Memory allocation failed on processor 0 or higher

From: Jim Phillips (jim_at_ks.uiuc.edu)
Date: Sun Sep 03 2006 - 22:30:13 CDT

Yes, you are most certainly running out of memory because of the system
size (2.5 million atoms). The molecular structure is replicated on all
nodes, so running on more processors doesn't help.

If you can force process 0 of NAMD to always run on the same node, then
you might get away with just bumping that node up to 2 GB. Try running
one process per node so every process will have 1 GB to work with. If
that works then you can try building NAMD on top of the net-linux-smp
version of Charm++ (just add "smp" to the list of flags on the charm-5.9
configure command line) to use the second processor on the node without
using too much extra memory (run with charmrun +p120 ++ppn 2 ...).

-Jim

On Sun, 3 Sep 2006, Thomas Caulfield wrote:

> Hello All (NAMD community):
>
> For a large system, run on LinuX NetworX Evolocity II cluster with 60 nodes
> (120 processors). My question relates to whether this is a hardware problem,
> or if it is a software problem.
>
> I am running into a memory error. When I ran a smaller simulation that was
> scaling up to this full system one (which had 1,000,000 atoms) there were no
> problems. Sometimes it gets to processor 6 or 7 before the crash occurs.
>
> Each slave node has the following:
>
> *Evolocity (.8U wide) Intel Rackmount Compute Module, incl P/S
> *EIDE hard drive (120GB) 7200RPM 120GB PATA 7200 RPM
> * 2 Pentium Xeon 2.8 GHz, PC533 processor, 512k L2 Cache
> * 2 512MB PC2700 DDR Memory ECC REG Incl
> * 1 Super Micro X5DPR−8G2+, 6 DIMM slots Dual Intel Xeon (533/400MHz FSB)
> * Intel E7501 chipset
> * (1) 64−bit 133MHz PCI−X
> * Adaptec AIC−7902 Ultra320 SCSI controller
> * Intel 82546EB dual port Gigabit
> * ATI Rage XL 8MB PCI graphic controller
>
>
> HERE is an OVERVIEW of the ERROR: (I am assuming that this system size is
> just exceeding the memory capacity per node?)
>
> For Full System:
> Info: ****************************
> Info: STRUCTURE SUMMARY:
> Info: 2524826 ATOMS
> Info: 1774106 BONDS
> Info: 1224931 ANGLES
> Info: 691809 DIHEDRALS
> Info: 43710 IMPROPERS
> Info: 0 EXCLUSIONS
> Info: 250859 FIXED ATOMS
> Info: 6821901 DEGREES OF FREEDOM
> Info: 911489 HYDROGEN GROUPS
> Info: 148790 HYDROGEN GROUPS WITH ALL ATOMS FIXED
> Info: TOTAL MASS = 1.60067e+07 amu
> Info: TOTAL CHARGE = 19.9999 e
> Info: *****************************
> Info: Entering startup phase 0 with 685641 kB of memory in use.
> Info: Entering startup phase 1 with 685641 kB of memory in use.
> FATAL ERROR: Memory allocation failed on processor 0.
>
>
> It did work for the partial system below though:
> Info: ****************************
> Info: STRUCTURE SUMMARY:
> Info: 251459 ATOMS
> Info: 262671 BONDS
> Info: 470453 ANGLES
> Info: 693542 DIHEDRALS
> Info: 43830 IMPROPERS
> Info: 0 EXCLUSIONS
> Info: 106193 FIXED ATOMS
> Info: 435798 DEGREES OF FREEDOM
> Info: 149190 HYDROGEN GROUPS
> Info: 53056 HYDROGEN GROUPS WITH ALL ATOMS FIXED
> Info: TOTAL MASS = 2.21721e+06 amu
> Info: TOTAL CHARGE = -3835 e
> Info: *****************************
> Info: Entering startup phase 0 with 88793 kB of memory in use.
> Info: Entering startup phase 1 with 88793 kB of memory in use.
> Info: Entering startup phase 2 with 174897 kB of memory in use.
> Info: Entering startup phase 3 with 174897 kB of memory in use.
> Info: PATCH GRID IS 13 BY 11 BY 9
> Info: REMOVING COM VELOCITY 0 0 0
> Info: Entering startup phase 4 with 194193 kB of memory in use.
> Info: Entering startup phase 5 with 194193 kB of memory in use.
> Info: Entering startup phase 6 with 194193 kB of memory in use.
> Info: Entering startup phase 7 with 194193 kB of memory in use.
> Info: COULOMB TABLE R-SQUARED SPACING: 0.0625
> Info: COULOMB TABLE SIZE: 2309 POINTS
> Info: Entering startup phase 8 with 194193 kB of memory in use.
> Info: Finished startup with 194193 kB of memory in use.
> TCL: Minimizing for 50 steps
> ETITLE: TS BOND ANGLE DIHED IMPRP ELECT
> VDW BOUNDARY MISC
> KINETIC TOTAL TEMP
> <More Output continues.....aka it works in this case>
>
>
> Thanks for any valuable insights in advance.
>
> Best regards,
>
> -Tom Caulfield
> ****************************************
> Tom Caulfield, Ph.D. Candidate
> School of Chemistry & Biochemistry
> Cherry Emerson Bldg., RM 329
> Georgia Institute
> of Technology
> Atlanta, GA 30332-0400
> Harvey Laboratory:
> http://rumour.biology.gatech.edu
> ****************************************
>
>
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:42:32 CST