From: Jim Phillips (jim_at_ks.uiuc.edu)
Date: Wed Apr 13 2011 - 08:46:26 CDT
Hi Dong,
There was a memory leak in position (restart or trajectory) output that
has been fixed April 10, so you are probably right about this being a
memory issue. As you can tell, the /proc/self/stat memory usage numbers
are incorrect on BG/L. You may want to try modifying src/memusage.C to
use something that works better.
-Jim
On Tue, 29 Mar 2011, Dong Luo wrote:
> I have the same problem of NAMD 2.7/2.8b1 only run for limited steps on the
> platform of BlueGeneL. The number of steps only depends on the system size and
> quite repeatable for the same system. It looks like a memory issue. Both NAMD
> 2.7/2.8b1 are compiled according to http://bluegene.bnl.gov/comp/buildnamd.html.
>
> NAMD 2.7 hang with error message:
> "FATAL ERROR: Memory allocation failed on processor 0."
>
> NAMD 2.8b1 does not have any error message. The only Warning message is about
> binary file convert. Below are parts of the log:
>
> Charm++> Running on MPI version: 2.0 multi-thread support: 0 (max supported: -1)
> [0] isomalloc.c> Disabling isomalloc because no free virtual address space
> Charm++> Running on 128 unique compute nodes (1-way SMP).
> Info: NAMD CVS-2011-03-28 for BlueGeneL-MPI
> Info:
> Info: Please visit http://www.ks.uiuc.edu/Research/namd/
> Info: for updates, documentation, and support information.
> Info:
> Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005)
> Info: in all publications reporting results obtained with NAMD.
> Info:
> Info: Based on Charm++/Converse 60303 for mpi-bluegenel-xlc
> Info: Built Mon Mar 28 14:50:58 EDT 2011 by dongluo on lee
> Info: Running on 256 processors, 256 nodes, 128 physical nodes.
> Info: CPU topology information available.
> Info: Charm++/Converse parallel runtime startup completed at 17.6502 s
> Info: 17.3047 MB of memory in use based on /proc/self/stat
>
> Info: STRUCTURE SUMMARY:
> Info: 179021 ATOMS
> Info: 139534 BONDS
> Info: 156216 ANGLES
> Info: 163705 DIHEDRALS
> Info: 1990 IMPROPERS
> Info: 0 CROSSTERMS
> Info: 0 EXCLUSIONS
> Info: 537063 DEGREES OF FREEDOM
> Info: 63886 HYDROGEN GROUPS
> Info: 4 ATOMS IN LARGEST HYDROGEN GROUP
> Info: 63886 MIGRATION GROUPS
> Info: 4 ATOMS IN LARGEST MIGRATION GROUP
> Info: TOTAL MASS = 1.06581e+06 amu
> Info: TOTAL CHARGE = 5.83753e-05 e
> Info: MASS DENSITY = 1.04144 g/cm^3
> Info: ATOM DENSITY = 0.105341 atoms/A^3
>
> Info: Entering startup at 55.472 s, 17.3047 MB of memory in use
> Info: Startup phase 0 took 0.000834967 s, 17.3047 MB of memory in use
> Info: Startup phase 1 took 3.16115 s, 17.3047 MB of memory in use
> Info: Startup phase 2 took 0.00247623 s, 17.3047 MB of memory in use
> Info: Startup phase 3 took 0.000712251 s, 17.3047 MB of memory in use
> Info: PATCH GRID IS 15 (PERIODIC) BY 7 (PERIODIC) BY 6 (PERIODIC)
> Info: PATCH GRID IS 2-AWAY BY 1-AWAY BY 1-AWAY
> Info: LARGEST PATCH (112) HAS 330 ATOMS
> Info: Startup phase 4 took 0.566039 s, 17.3047 MB of memory in use
> Info: PME using 68 and 63 processors for FFT and reciprocal sum.
> Info: PME GRID LOCATIONS: 3 7 11 15 19 23 27 31 35 39 ...
> Info: PME TRANS LOCATIONS: 1 5 9 13 17 21 25 29 33 37 ...
> Info: Startup phase 5 took 0.111723 s, 17.3047 MB of memory in use
> Info: Startup phase 6 took 0.101994 s, 17.3047 MB of memory in use
> LDB: Central LB being created...
> Info: Startup phase 7 took 1.2017 s, 17.3047 MB of memory in use
> Info: CREATING 19170 COMPUTE OBJECTS
> Info: useSync: 0 useProxySync: 0
> Info: NONBONDED TABLE R-SQUARED SPACING: 0.0625
> Info: NONBONDED TABLE SIZE: 769 POINTS
>
> Info: Benchmark time: 256 CPUs 0.0585523 s/step 0.677689 days/ns 17.3047 MB
> memory
> Info: Benchmark time: 256 CPUs 0.0526625 s/step 0.60952 days/ns 17.3047 MB
> memory
> Info: Benchmark time: 256 CPUs 0.0528017 s/step 0.611131 days/ns 17.3047 MB
> memory
>
> The last position output (seq=23000) takes 0.122 seconds, 17.305 MB of memory in
> use
> WRITING VELOCITIES TO RESTART FILE AT STEP 23000
> FINISHED WRITING RESTART VELOCITIES
> The last velocity output (seq=23000) takes 0.105 seconds, 17.305 MB of memory in
> use
>
> above is last message in the log. Only 23000 steps are run before hanging.
>
> Dong
>
>
>
This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:56:58 CST