Re: clarification(s) Re: namd 2.62b FATAL ERROR: Memory allocation failed on processor 0 or higher

From: Tom Caulfield (tom.r.caulfield_at_gmail.com)
Date: Thu Sep 07 2006 - 20:25:27 CDT

Hi Jim,

Below is an email exchange I sent to our sys admin. I was able to get
the single-process version of the namd2 to run on the cluster, which I
was hoping was going to free up enough memory to run there. It works
fine on our Itanium mini, but the larger cluster is preferred
(evolucity linuxnetworx; 120 nodes free there). I am still crashing
out at about 900MB. Perhaps the swapping memory is not being utilized
here?

I am noticing for the cluster that there is only about 1GB physical
per node. Of which there is about 0.9GB free (since some is consumed
by the OS. The swapping has 2BG free, but maybe it is unavailable?

Here is the memory per node as I see it when logged directly onto that node:

Mem: 1031332K av, 121792K used, 909540K free, 0K shrd, 1076K buff
Swap: 2048276K av, 1932K used, 2046344K free 26296K cached

I did configure for single processor NAMD jobs as Jim Phillips advised
and there is still a crash occuring as follows: (see below this)

Thanks,

-Tom

PS I was successful in getting 1 process per node (during start
up...it always crashed before it could get underway).

NODE n02:
USER PID %CPU %MEM TIME CMD
roland 25201 67.4 85.7 00:01:22 /usr/local/bin/namd2 minFix.namd
NODE n03:
USER PID %CPU %MEM TIME CMD
roland 25133 34.9 2.1 00:00:42 /usr/local/bin/namd2 minFix.namd
NODE n04:
USER PID %CPU %MEM TIME CMD
roland 24955 35.2 2.1 00:00:42 /usr/local/bin/namd2 minFix.namd

.... up to node 120

But crashed at:
Info: ****************************
Info: STRUCTURE SUMMARY:
Info: 2524826 ATOMS
Info: 1774106 BONDS
Info: 1224931 ANGLES
Info: 691809 DIHEDRALS
Info: 43710 IMPROPERS
Info: 0 EXCLUSIONS
Info: 250859 FIXED ATOMS
Info: 6821901 DEGREES OF FREEDOM
Info: 911489 HYDROGEN GROUPS
Info: 148790 HYDROGEN GROUPS WITH ALL ATOMS FIXED
Info: TOTAL MASS = 1.60067e+07 amu
Info: TOTAL CHARGE = 19.9999 e
Info: *****************************
Info: Entering startup phase 0 with 685545 kB of memory in use.
Info: Entering startup phase 1 with 685545 kB of memory in use.
Info: Entering startup phase 2 with 884001 kB of memory in use.
Info: Entering startup phase 3 with 903729 kB of memory in use.
Info: PATCH GRID IS 19 BY 21 BY 18
FATAL ERROR: Memory allocation failed on processor 0.

<old email below>

On Wed, 6 Sep 2006, Thomas Caulfield wrote:

You mentioned putting smp in the list of flags. I think that means
that I cannot use the binaries, but have to install a new version from
source? Then I can use the ++ppn 1 command (as in
/usr/local/bin/charmrun ++nodelist nodelist ++ppn 1 +p 120
/usr/local/bin/namd2 Config.file > logfile & ) Am I barking up the
wrong tree? I haven't compiled namd from source before, but I have
installed other things on the cluster (such as spider)...where is the
charm-5.9 configure command line (please pardon my ignorance).

Take a look in the building part of the release notes for full
instructions. You'll want something like "net-linux smp tcp". The
++ppn option is really "threads per process" so you would want ++ppn 2
+p 120 to run one two-thread process on each node.

If you just run one process per node with a normal binary it should work.

-Jim

Thanks again for your input.

Regards,

-Tom

On Sep 3, 2006, at 11:30 PM, Jim Phillips wrote:

Yes, you are most certainly running out of memory because of the
system size (2.5 million atoms). The molecular structure is
replicated on all nodes, so running on more processors doesn't help.
If you can force process 0 of NAMD to always run on the same node,
then you might get away with just bumping that node up to 2 GB. Try
running one process per node so every process will have 1 GB to work
with. If that works then you can try building NAMD on top of the
net-linux-smp version of Charm++ (just add "smp" to the list of flags
on the charm-5.9 configure command line) to use the second processor
on the node without using too much extra memory (run with charmrun
+p120 ++ppn 2 ...).
-Jim
On Sun, 3 Sep 2006, Thomas Caulfield wrote:
Hello All (NAMD community):
For a large system, run on LinuX NetworX Evolocity II cluster with 60
nodes (120 processors). My question relates to whether this is a
hardware problem, or if it is a software problem.
I am running into a memory error. When I ran a smaller simulation
that was scaling up to this full system one (which had 1,000,000
atoms) there were no problems. Sometimes it gets to processor 6 or 7
before the crash occurs.
Each slave node has the following:
*Evolocity (.8U wide) Intel Rackmount Compute Module, incl P/S
*EIDE hard drive (120GB) 7200RPM 120GB PATA 7200 RPM
* 2 Pentium Xeon 2.8 GHz, PC533 processor, 512k L2 Cache
* 2 512MB PC2700 DDR Memory ECC REG Incl
* 1 Super Micro X5DPR'8G2+, 6 DIMM slots Dual Intel Xeon (533/400MHz FSB)
* Intel E7501 chipset
* (1) 64'bit 133MHz PCI'X
* Adaptec AIC'7902 Ultra320 SCSI controller
* Intel 82546EB dual port Gigabit
* ATI Rage XL 8MB PCI graphic controller
HERE is an OVERVIEW of the ERROR: (I am assuming that this system
size is just exceeding the memory capacity per node?)
For Full System:
Info: ****************************
Info: STRUCTURE SUMMARY:
Info: 2524826 ATOMS
Info: 1774106 BONDS
Info: 1224931 ANGLES
Info: 691809 DIHEDRALS
Info: 43710 IMPROPERS
Info: 0 EXCLUSIONS
Info: 250859 FIXED ATOMS
Info: 6821901 DEGREES OF FREEDOM
Info: 911489 HYDROGEN GROUPS
Info: 148790 HYDROGEN GROUPS WITH ALL ATOMS FIXED
Info: TOTAL MASS = 1.60067e+07 amu
Info: TOTAL CHARGE = 19.9999 e
Info: *****************************
Info: Entering startup phase 0 with 685641 kB of memory in use.
Info: Entering startup phase 1 with 685641 kB of memory in use.
FATAL ERROR: Memory allocation failed on processor 0.
It did work for the partial system below though:
Info: ****************************
Info: STRUCTURE SUMMARY:
Info: 251459 ATOMS
Info: 262671 BONDS
Info: 470453 ANGLES
Info: 693542 DIHEDRALS
Info: 43830 IMPROPERS
Info: 0 EXCLUSIONS
Info: 106193 FIXED ATOMS
Info: 435798 DEGREES OF FREEDOM
Info: 149190 HYDROGEN GROUPS
Info: 53056 HYDROGEN GROUPS WITH ALL ATOMS FIXED
Info: TOTAL MASS = 2.21721e+06 amu
Info: TOTAL CHARGE = -3835 e
Info: *****************************
Info: Entering startup phase 0 with 88793 kB of memory in use.
Info: Entering startup phase 1 with 88793 kB of memory in use.
Info: Entering startup phase 2 with 174897 kB of memory in use.
Info: Entering startup phase 3 with 174897 kB of memory in use.
Info: PATCH GRID IS 13 BY 11 BY 9
Info: REMOVING COM VELOCITY 0 0 0
Info: Entering startup phase 4 with 194193 kB of memory in use.
Info: Entering startup phase 5 with 194193 kB of memory in use.
Info: Entering startup phase 6 with 194193 kB of memory in use.
Info: Entering startup phase 7 with 194193 kB of memory in use.
Info: COULOMB TABLE R-SQUARED SPACING: 0.0625
Info: COULOMB TABLE SIZE: 2309 POINTS
Info: Entering startup phase 8 with 194193 kB of memory in use.
Info: Finished startup with 194193 kB of memory in use.
TCL: Minimizing for 50 steps
ETITLE: TS BOND ANGLE DIHED IMPRP
ELECT VDW BOUNDARY MISC
   KINETIC TOTAL TEMP
<More Output continues.....aka it works in this case>
Thanks for any valuable insights in advance.
Best regards,
-Tom Caulfield
****************************************
Tom Caulfield, Ph.D. Candidate
School of Chemistry & Biochemistry
Cherry Emerson Bldg., RM 329
Georgia Institute
of Technology
Atlanta, GA 30332-0400
Harvey Laboratory:
http://rumour.biology.gatech.edu
****************************************

****************************************
Tom Caulfield, Ph.D. Candidate
School of Chemistry & Biochemistry
Cherry Emerson Bldg., RM 329
Georgia Institute
ofTech nology
Atlanta, GA 30332-0400
Harvey Laboratory:
http://rumour.biology.gatech.edu
****************************************

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:42:34 CST