Re: NAMD 2.7/2.8b1 and BlueGeneL

From: Dong Luo (us917_at_yahoo.com)
Date: Wed Apr 13 2011 - 16:46:00 CDT

Hi Jim, The memory issue seems be fixed. namd2 newly compiled from latest CVS code runs ok on BGL. Test run for the same system (~180k atoms) that used to last only 23k steps now has been run for 44k steps and still goes on well. As for src/memusage.C, I replaced original memusage_proc_self_stat with#include <rts.h> inline unsigned long memusage_proc_self_stat() {   static int failed_once = 0;   if ( failed_once ) return 0;  // no point in retrying   size_t freeMemory;   if (rts_getavailablememory(&freeMemory)) {     failed_once = 1;     return 0;   }   BGLPersonality personality;   unsigned long vsz = 0;   if (BGLPersonality_virtualNodeMode(&personality))     vsz = 256*1024*1024 - freeMemory;   else     vsz = 512*1024*1024 - freeMemory;   if (vsz <= 0) {     vsz = 0;     failed_once = 1;   }   return vsz; } Now, the log reports: Info: Benchmark time: 256 CPUs 0.0579277 s/step 0.67046 days/ns 394.745 MB memory Info: Benchmark time: 256 CPUs 0.0533917 s/step 0.61796 days/ns 650.745 MB memory Info: Benchmark time: 256 CPUs 0.0534542 s/step 0.618683 days/ns 650.745 MB memory ... The last position output (seq=2044000) takes 0.166 seconds, 255.118 MB of memory in use WRITING VELOCITIES TO RESTART FILE AT STEP 2044000 FINISHED WRITING RESTART VELOCITIES The last velocity output (seq=2044000) takes 0.120 seconds, 511.118 MB of memory in use Before it only reported "17.305 MB". I'm not sure whether this can be counted on my change or not cause I did no do a control test with original function. Dong ________________________________ From: Jim Phillips <jim_at_ks.uiuc.edu> To: Dong Luo <us917_at_yahoo.com> Cc: namd-l_at_ks.uiuc.edu Sent: Wednesday, April 13, 2011 9:46 AM Subject: Re: namd-l: NAMD 2.7/2.8b1 and BlueGeneL Hi Dong, There was a memory leak in position (restart or trajectory) output that has been fixed April 10, so you are probably right about this being a memory issue.  As you can tell, the /proc/self/stat memory usage numbers are incorrect on BG/L.  You may want to try modifying src/memusage.C to use something that works better. -Jim On Tue, 29 Mar 2011, Dong Luo wrote: > I have the same problem of NAMD 2.7/2.8b1 only run for limited steps on the > platform of BlueGeneL. The number of steps only depends on the system size and > quite repeatable for the same system. It looks like a memory issue. Both NAMD > 2.7/2.8b1 are compiled according to http://bluegene.bnl.gov/comp/buildnamd.html. > > NAMD 2.7 hang with error message: > "FATAL ERROR: Memory allocation failed on processor 0." > > NAMD 2.8b1 does not have any error message. The only Warning message is about > binary file convert. Below are parts of the log: > > Charm++> Running on MPI version: 2.0 multi-thread support: 0 (max supported: -1) > [0] isomalloc.c> Disabling isomalloc because no free virtual address space > Charm++> Running on 128 unique compute nodes (1-way SMP). > Info: NAMD CVS-2011-03-28 for BlueGeneL-MPI > Info: > Info: Please visit http://www.ks.uiuc.edu/Research/namd/ > Info: for updates, documentation, and support information. > Info: > Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005) > Info: in all publications reporting results obtained with NAMD. > Info: > Info: Based on Charm++/Converse 60303 for mpi-bluegenel-xlc > Info: Built Mon Mar 28 14:50:58 EDT 2011 by dongluo on lee > Info: Running on 256 processors, 256 nodes, 128 physical nodes. > Info: CPU topology information available. > Info: Charm++/Converse parallel runtime startup completed at 17.6502 s > Info: 17.3047 MB of memory in use based on /proc/self/stat > > Info: STRUCTURE SUMMARY: > Info: 179021 ATOMS > Info: 139534 BONDS > Info: 156216 ANGLES > Info: 163705 DIHEDRALS > Info: 1990 IMPROPERS > Info: 0 CROSSTERMS > Info: 0 EXCLUSIONS > Info: 537063 DEGREES OF FREEDOM > Info: 63886 HYDROGEN GROUPS > Info: 4 ATOMS IN LARGEST HYDROGEN GROUP > Info: 63886 MIGRATION GROUPS > Info: 4 ATOMS IN LARGEST MIGRATION GROUP > Info: TOTAL MASS = 1.06581e+06 amu > Info: TOTAL CHARGE = 5.83753e-05 e > Info: MASS DENSITY = 1.04144 g/cm^3 > Info: ATOM DENSITY = 0.105341 atoms/A^3 > > Info: Entering startup at 55.472 s, 17.3047 MB of memory in use > Info: Startup phase 0 took 0.000834967 s, 17.3047 MB of memory in use > Info: Startup phase 1 took 3.16115 s, 17.3047 MB of memory in use > Info: Startup phase 2 took 0.00247623 s, 17.3047 MB of memory in use > Info: Startup phase 3 took 0.000712251 s, 17.3047 MB of memory in use > Info: PATCH GRID IS 15 (PERIODIC) BY 7 (PERIODIC) BY 6 (PERIODIC) > Info: PATCH GRID IS 2-AWAY BY 1-AWAY BY 1-AWAY > Info: LARGEST PATCH (112) HAS 330 ATOMS > Info: Startup phase 4 took 0.566039 s, 17.3047 MB of memory in use > Info: PME using 68 and 63 processors for FFT and reciprocal sum. > Info: PME GRID LOCATIONS: 3 7 11 15 19 23 27 31 35 39 ... > Info: PME TRANS LOCATIONS: 1 5 9 13 17 21 25 29 33 37 ... > Info: Startup phase 5 took 0.111723 s, 17.3047 MB of memory in use > Info: Startup phase 6 took 0.101994 s, 17.3047 MB of memory in use > LDB: Central LB being created... > Info: Startup phase 7 took 1.2017 s, 17.3047 MB of memory in use > Info: CREATING 19170 COMPUTE OBJECTS > Info: useSync: 0 useProxySync: 0 > Info: NONBONDED TABLE R-SQUARED SPACING: 0.0625 > Info: NONBONDED TABLE SIZE: 769 POINTS > > Info: Benchmark time: 256 CPUs 0.0585523 s/step 0.677689 days/ns 17.3047 MB > memory > Info: Benchmark time: 256 CPUs 0.0526625 s/step 0.60952 days/ns 17.3047 MB > memory > Info: Benchmark time: 256 CPUs 0.0528017 s/step 0.611131 days/ns 17.3047 MB > memory > > The last position output (seq=23000) takes 0.122 seconds, 17.305 MB of memory in > use > WRITING VELOCITIES TO RESTART FILE AT STEP 23000 > FINISHED WRITING RESTART VELOCITIES > The last velocity output (seq=23000) takes 0.105 seconds, 17.305 MB of memory in > use > > above is last message in the log. Only 23000 steps are run before hanging. > > Dong > > >

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:56:59 CST