Re: namd 2.62b FATAL ERROR: Memory allocation failed on processor 0 or higher

From: Jim Phillips (jim_at_ks.uiuc.edu)
Date: Thu Sep 07 2006 - 01:54:39 CDT

The pairlist memory is distributed across nodesJ, but the molecular data
is not. If you build an SMP version then for each shared-memory node you
can have a single process sharing the large molecular data with a separate
worker thread for each process. That'll get you a factor of 2 or 4. This
will probably be the case for distributed binaries in the next release.

Actually distributing the molecular data is going to be more complicated.

-Jim

On Wed, 6 Sep 2006, Thomas Caulfield wrote:

> Interesting. I thought that the job split the memory over the nodes. But it
> sounds like you are saying an exact image is required on each node, so there
> is no advantage to shared memory then. Which of course means that for larger
> jobs there will reach a memory "bump" since not many labs are going to
> install 4-12 GB of memory per node. Other software can share the memory of a
> job, each node is allocated a space to the memory "map" and can swap,
> exchange, whatever (albeit there may be some overlap).
>
> Is there any plan to implement NAMD to shared memory?
>
> -Tom
>
> PS I am going to try running it on our mini cluster (6 Altix Itaniums with a
> total of 18GB of memory just to see how much it is requiring to load).
>
>
> On Sep 3, 2006, at 11:30 PM, Jim Phillips wrote:
>
>>
>> Yes, you are most certainly running out of memory because of the system
>> size (2.5 million atoms). The molecular structure is replicated on all
>> nodes, so running on more processors doesn't help.
>>
>> If you can force process 0 of NAMD to always run on the same node, then you
>> might get away with just bumping that node up to 2 GB. Try running one
>> process per node so every process will have 1 GB to work with. If that
>> works then you can try building NAMD on top of the net-linux-smp version of
>> Charm++ (just add "smp" to the list of flags on the charm-5.9 configure
>> command line) to use the second processor on the node without using too
>> much extra memory (run with charmrun +p120 ++ppn 2 ...).
>>
>> -Jim
>>
>>
>> On Sun, 3 Sep 2006, Thomas Caulfield wrote:
>>
>>> Hello All (NAMD community):
>>>
>>> For a large system, run on LinuX NetworX Evolocity II cluster with 60
>>> nodes (120 processors). My question relates to whether this is a hardware
>>> problem, or if it is a software problem.
>>>
>>> I am running into a memory error. When I ran a smaller simulation that
>>> was scaling up to this full system one (which had 1,000,000 atoms) there
>>> were no problems. Sometimes it gets to processor 6 or 7 before the crash
>>> occurs.
>>>
>>> Each slave node has the following:
>>>
>>> *Evolocity (.8U wide) Intel Rackmount Compute Module, incl P/S
>>> *EIDE hard drive (120GB) 7200RPM 120GB PATA 7200 RPM
>>> * 2 Pentium Xeon 2.8 GHz, PC533 processor, 512k L2 Cache
>>> * 2 512MB PC2700 DDR Memory ECC REG Incl
>>> * 1 Super Micro X5DPR‚~H~R8G2+, 6 DIMM slots Dual Intel Xeon (533/400MHz
>>> FSB)
>>> * Intel E7501 chipset
>>> * (1) 64‚~H~Rbit 133MHz PCI‚~H~RX
>>> * Adaptec AIC‚~H~R7902 Ultra320 SCSI controller
>>> * Intel 82546EB dual port Gigabit
>>> * ATI Rage XL 8MB PCI graphic controller
>>>
>>>
>>> HERE is an OVERVIEW of the ERROR: (I am assuming that this system size is
>>> just exceeding the memory capacity per node?)
>>>
>>> For Full System:
>>> Info: ****************************
>>> Info: STRUCTURE SUMMARY:
>>> Info: 2524826 ATOMS
>>> Info: 1774106 BONDS
>>> Info: 1224931 ANGLES
>>> Info: 691809 DIHEDRALS
>>> Info: 43710 IMPROPERS
>>> Info: 0 EXCLUSIONS
>>> Info: 250859 FIXED ATOMS
>>> Info: 6821901 DEGREES OF FREEDOM
>>> Info: 911489 HYDROGEN GROUPS
>>> Info: 148790 HYDROGEN GROUPS WITH ALL ATOMS FIXED
>>> Info: TOTAL MASS = 1.60067e+07 amu
>>> Info: TOTAL CHARGE = 19.9999 e
>>> Info: *****************************
>>> Info: Entering startup phase 0 with 685641 kB of memory in use.
>>> Info: Entering startup phase 1 with 685641 kB of memory in use.
>>> FATAL ERROR: Memory allocation failed on processor 0.
>>>
>>>
>>> It did work for the partial system below though:
>>> Info: ****************************
>>> Info: STRUCTURE SUMMARY:
>>> Info: 251459 ATOMS
>>> Info: 262671 BONDS
>>> Info: 470453 ANGLES
>>> Info: 693542 DIHEDRALS
>>> Info: 43830 IMPROPERS
>>> Info: 0 EXCLUSIONS
>>> Info: 106193 FIXED ATOMS
>>> Info: 435798 DEGREES OF FREEDOM
>>> Info: 149190 HYDROGEN GROUPS
>>> Info: 53056 HYDROGEN GROUPS WITH ALL ATOMS FIXED
>>> Info: TOTAL MASS = 2.21721e+06 amu
>>> Info: TOTAL CHARGE = -3835 e
>>> Info: *****************************
>>> Info: Entering startup phase 0 with 88793 kB of memory in use.
>>> Info: Entering startup phase 1 with 88793 kB of memory in use.
>>> Info: Entering startup phase 2 with 174897 kB of memory in use.
>>> Info: Entering startup phase 3 with 174897 kB of memory in use.
>>> Info: PATCH GRID IS 13 BY 11 BY 9
>>> Info: REMOVING COM VELOCITY 0 0 0
>>> Info: Entering startup phase 4 with 194193 kB of memory in use.
>>> Info: Entering startup phase 5 with 194193 kB of memory in use.
>>> Info: Entering startup phase 6 with 194193 kB of memory in use.
>>> Info: Entering startup phase 7 with 194193 kB of memory in use.
>>> Info: COULOMB TABLE R-SQUARED SPACING: 0.0625
>>> Info: COULOMB TABLE SIZE: 2309 POINTS
>>> Info: Entering startup phase 8 with 194193 kB of memory in use.
>>> Info: Finished startup with 194193 kB of memory in use.
>>> TCL: Minimizing for 50 steps
>>> ETITLE: TS BOND ANGLE DIHED IMPRP ELECT
>>> VDW BOUNDARY MISC
>>> KINETIC TOTAL TEMP
>>> <More Output continues.....aka it works in this case>
>>>
>>>
>>> Thanks for any valuable insights in advance.
>>>
>>> Best regards,
>>>
>>> -Tom Caulfield
>>> ****************************************
>>> Tom Caulfield, Ph.D. Candidate
>>> School of Chemistry & Biochemistry
>>> Cherry Emerson Bldg., RM 329
>>> Georgia Institute
>>> of Technology
>>> Atlanta, GA 30332-0400
>>> Harvey Laboratory:
>>> http://rumour.biology.gatech.edu
>>> ****************************************
>>>
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:42:34 CST