Re: namd 2.62b FATAL ERROR: Memory allocation failed on processor 0 or higher

From: Thomas Caulfield (thomas.caulfield_at_chemistry.gatech.edu)
Date: Wed Sep 06 2006 - 11:55:15 CDT

Interesting. I thought that the job split the memory over the
nodes. But it sounds like you are saying an exact image is required
on each node, so there is no advantage to shared memory then. Which
of course means that for larger jobs there will reach a memory "bump"
since not many labs are going to install 4-12 GB of memory per node.
Other software can share the memory of a job, each node is allocated
a space to the memory "map" and can swap, exchange, whatever (albeit
there may be some overlap).

Is there any plan to implement NAMD to shared memory?

-Tom

PS I am going to try running it on our mini cluster (6 Altix Itaniums
with a total of 18GB of memory just to see how much it is requiring
to load).

On Sep 3, 2006, at 11:30 PM, Jim Phillips wrote:

>
> Yes, you are most certainly running out of memory because of the
> system size (2.5 million atoms). The molecular structure is
> replicated on all nodes, so running on more processors doesn't help.
>
> If you can force process 0 of NAMD to always run on the same node,
> then you might get away with just bumping that node up to 2 GB.
> Try running one process per node so every process will have 1 GB to
> work with. If that works then you can try building NAMD on top of
> the net-linux-smp version of Charm++ (just add "smp" to the list of
> flags on the charm-5.9 configure command line) to use the second
> processor on the node without using too much extra memory (run with
> charmrun +p120 ++ppn 2 ...).
>
> -Jim
>
>
> On Sun, 3 Sep 2006, Thomas Caulfield wrote:
>
>> Hello All (NAMD community):
>>
>> For a large system, run on LinuX NetworX Evolocity II cluster with
>> 60 nodes (120 processors). My question relates to whether this is
>> a hardware problem, or if it is a software problem.
>>
>> I am running into a memory error. When I ran a smaller simulation
>> that was scaling up to this full system one (which had 1,000,000
>> atoms) there were no problems. Sometimes it gets to processor 6
>> or 7 before the crash occurs.
>>
>> Each slave node has the following:
>>
>> *Evolocity (.8U wide) Intel Rackmount Compute Module, incl P/S
>> *EIDE hard drive (120GB) 7200RPM 120GB PATA 7200 RPM
>> * 2 Pentium Xeon 2.8 GHz, PC533 processor, 512k L2 Cache
>> * 2 512MB PC2700 DDR Memory ECC REG Incl
>> * 1 Super Micro X5DPR‚~H~R8G2+, 6 DIMM slots Dual Intel Xeon
>> (533/400MHz FSB)
>> * Intel E7501 chipset
>> * (1) 64‚~H~Rbit 133MHz PCI‚~H~RX
>> * Adaptec AIC‚~H~R7902 Ultra320 SCSI controller
>> * Intel 82546EB dual port Gigabit
>> * ATI Rage XL 8MB PCI graphic controller
>>
>>
>> HERE is an OVERVIEW of the ERROR: (I am assuming that this system
>> size is just exceeding the memory capacity per node?)
>>
>> For Full System:
>> Info: ****************************
>> Info: STRUCTURE SUMMARY:
>> Info: 2524826 ATOMS
>> Info: 1774106 BONDS
>> Info: 1224931 ANGLES
>> Info: 691809 DIHEDRALS
>> Info: 43710 IMPROPERS
>> Info: 0 EXCLUSIONS
>> Info: 250859 FIXED ATOMS
>> Info: 6821901 DEGREES OF FREEDOM
>> Info: 911489 HYDROGEN GROUPS
>> Info: 148790 HYDROGEN GROUPS WITH ALL ATOMS FIXED
>> Info: TOTAL MASS = 1.60067e+07 amu
>> Info: TOTAL CHARGE = 19.9999 e
>> Info: *****************************
>> Info: Entering startup phase 0 with 685641 kB of memory in use.
>> Info: Entering startup phase 1 with 685641 kB of memory in use.
>> FATAL ERROR: Memory allocation failed on processor 0.
>>
>>
>> It did work for the partial system below though:
>> Info: ****************************
>> Info: STRUCTURE SUMMARY:
>> Info: 251459 ATOMS
>> Info: 262671 BONDS
>> Info: 470453 ANGLES
>> Info: 693542 DIHEDRALS
>> Info: 43830 IMPROPERS
>> Info: 0 EXCLUSIONS
>> Info: 106193 FIXED ATOMS
>> Info: 435798 DEGREES OF FREEDOM
>> Info: 149190 HYDROGEN GROUPS
>> Info: 53056 HYDROGEN GROUPS WITH ALL ATOMS FIXED
>> Info: TOTAL MASS = 2.21721e+06 amu
>> Info: TOTAL CHARGE = -3835 e
>> Info: *****************************
>> Info: Entering startup phase 0 with 88793 kB of memory in use.
>> Info: Entering startup phase 1 with 88793 kB of memory in use.
>> Info: Entering startup phase 2 with 174897 kB of memory in use.
>> Info: Entering startup phase 3 with 174897 kB of memory in use.
>> Info: PATCH GRID IS 13 BY 11 BY 9
>> Info: REMOVING COM VELOCITY 0 0 0
>> Info: Entering startup phase 4 with 194193 kB of memory in use.
>> Info: Entering startup phase 5 with 194193 kB of memory in use.
>> Info: Entering startup phase 6 with 194193 kB of memory in use.
>> Info: Entering startup phase 7 with 194193 kB of memory in use.
>> Info: COULOMB TABLE R-SQUARED SPACING: 0.0625
>> Info: COULOMB TABLE SIZE: 2309 POINTS
>> Info: Entering startup phase 8 with 194193 kB of memory in use.
>> Info: Finished startup with 194193 kB of memory in use.
>> TCL: Minimizing for 50 steps
>> ETITLE: TS BOND ANGLE DIHED IMPRP
>> ELECT VDW BOUNDARY MISC
>> KINETIC TOTAL TEMP
>> <More Output continues.....aka it works in this case>
>>
>>
>> Thanks for any valuable insights in advance.
>>
>> Best regards,
>>
>> -Tom Caulfield
>> ****************************************
>> Tom Caulfield, Ph.D. Candidate
>> School of Chemistry & Biochemistry
>> Cherry Emerson Bldg., RM 329
>> Georgia Institute
>> of Technology
>> Atlanta, GA 30332-0400
>> Harvey Laboratory:
>> http://rumour.biology.gatech.edu
>> ****************************************
>>
>>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:42:34 CST