Re: NAMD2.7 on BluegeneL hang at "LDB: Central LB being created..."

From: Jim Phillips (jim_at_ks.uiuc.edu)
Date: Wed Apr 13 2011 - 10:39:41 CDT

There was a memory leak in trajectory/restart output that was introduced
after 2.7 and fixed April 10 (so it is in 2.8b1 also). That could explain
some of what you are seeing.

-Jim

On Tue, 12 Apr 2011, Dong Luo wrote:

> In case someone will research the problem in the future, I add another problem appeared during the simulation with NAMD2.7 on BlueGene/L in the safe time range (i.e., running 250k steps for the 50k atoms system per simulation). Randomly a simulation may stopped with an error: [0] processControlPoints() haveControlPointChangeCallback=0 frameworkShouldAdvancePhase=0.
It appeared not often, and a repeat of the same simulation normally runs ok.
A rough search shows the error is throw out by charm++.
Again, the NAMD/charm++ are compiled from cvs code of 03/03/2011.

Dong

________________________________
From: Dong Luo <us917_at_yahoo.com>
To: Dong Luo <us917_at_yahoo.com>
Cc: namd-l_at_ks.uiuc.edu
Sent: Friday, March 4, 2011 11:40 AM
Subject: Re: namd-l: NAMD2.7 on BluegeneL hang at "LDB: Central LB being created..."

Well, disable virtual node does not really solve the problem, just delayed the occurrence of  "FATAL ERROR: Memory allocation failed on processor 0." With 128 physical nodes, it occurred after about 270000 steps compared to 121000 steps for 256 virtual nodes for the test system with 50k atoms. For another bigger simulation system with 170k atoms, this FATAL ERROR happened only after 20000 steps. NAMD2.6 version has no such problem.
Each Bluegene/L node consist of dual core 32-bit PPC440 processors (700 MHz) with 512 MB
  of main memory. Each node has a 32 KB L1 cache, 2 KB L2 cache, and a 4
MB L3 cache.

Dong

________________________________
From: Dong Luo <us917_at_yahoo.com>
To: Chris Harrison <charris5_at_gmail.com>
Cc: namd-l_at_ks.uiuc.edu
Sent: Fri, March 4, 2011 9:08:38 AM
Subject: Re: namd-l: NAMD2.7 on BluegeneL hang at "LDB: Central LB being created..."

I didn't say it clearly. I'm using the CVS version of Charm++, but modified the configure file to skip MPI test, otherwise it will refuse to compile.
 
But now I run into another problem with the fresh compiled namd2.
With virtual node enabled, the simulation will get an "FATAL ERROR: Memory allocation failed on processor 0." at step about 121000 (repeatable). The simulation system contains only 50905 atoms. Disable virtual mode solves the problem but slows the calculation speed from "256 CPUs 0.0204817 s/step 0.237057 days/ns" to "128 CPUs 0.0299266 s/step 0.346373 days/ns". Each physical CPU has 2 nodes on it. Thats why 128 CPUs can be counted as 256 when in virtual node mode.
 
Dong

________________________________
  From: Chris Harrison <charris5_at_gmail.com>
To: Dong Luo <us917_at_yahoo.com>
Cc: akohlmey_at_gmail.com; namd-l_at_ks.uiuc.edu
Sent: Thu, March 3, 2011 9:08:20 PM
Subject: Re: namd-l: NAMD2.7 on BluegeneL hang at "LDB: Central LB being created..."

Are you really using Charm++ 2.2?! Is there a reason? 

This may work for you, but you should really upgrade to Charm++
6.2.1 or later when possible.  Otherwise you're missing improvements
to performance from the more recent Charm++ versions.

Best,
Chris

--
Chris Harrison, Ph.D.
Theoretical and Computational Biophysics Group
NIH Resource for Macromolecular Modeling and Bioinformatics
Beckman
  Institute for Advanced Science and Technology
University of Illinois, 405 N. Mathews Ave., Urbana, IL 61801
char_at_ks.uiuc.edu                          Voice: 217-244-1733
http://www.ks.uiuc.edu/%7Echar              Fax:  217-244-6078
Dong Luo <us917_at_yahoo.com> writes:
> Date: Thu, 3 Mar 2011 17:59:10 -0800 (PST)
> From: Dong Luo <us917_at_yahoo.com>
> To: Chris Harrison <charris5_at_gmail.com>, akohlmey_at_gmail.com
> Cc: namd-l_at_ks.uiuc.edu
> Subject: Re: namd-l: NAMD2.7 on BluegeneL hang at "LDB: Central LB being
>  created..."
> X-Mailer: YahooMailRC/559 YahooMailWebService/0.8.109.292656
> 
> Chris, the CVS version of namd/charm++ work. Only that I have to comment out MPI 
> checking in the configure file of charm++ because it fails on Bluegene/L. It is 
> not checked in charm++ 2.2.
> 
> Axel, namd/charm++ are cross-compiled on Bluegene/L because the login host uses 
> different OS compared to the cluster nodes. I did not figure out a way to test 
> charm++.
> 
>
  Dong
> 
>  
> 
> ________________________________
> From: Chris Harrison <charris5_at_gmail.com>
> To: Dong Luo <us917_at_yahoo.com>
> Cc: namd-l_at_ks.uiuc.edu
> Sent: Thu, March 3, 2011 1:41:50 AM
> Subject: Re: namd-l: NAMD2.7 on BluegeneL hang at "LDB: Central LB being 
> created..."
> 
> We've made recent improvements to startup and load-balancing.  Can you
> try the CVS version or one of the nightly builds of namd, with the most 
> recent git archive or nightly build of charm++?
> 
> Best,
> Chris
> 
> 
>
  --
> Chris Harrison, Ph.D.
> Theoretical and Computational Biophysics Group
> NIH Resource for Macromolecular Modeling and Bioinformatics
> Beckman Institute for Advanced Science and Technology
> University of
  Illinois, 405 N. Mathews Ave., Urbana, IL 61801
> 
> char_at_ks.uiuc.edu                          Voice: 217-244-1733
> http://www.ks.uiuc.edu/%7Echar%C2%A0
> > 
> > However, the simulation (no matter with colvars or not) using this namd2 2.7 
> > version always hang after Startup phase 5 as shown in the log:
> > "
> > Info: REMOVING COM VELOCITY 0.0209799 0.0192793 0.000362722
> > Info: LARGEST PATCH (156) HAS 345 ATOMS
> > Info: Startup phase 3 took 0.246489 s, 17.3047 MB of memory in use
> > Info: PME using
  40 and 32
  processors for FFT and reciprocal sum.
> > Info: PME GRID LOCATIONS: 7 15 23 27 31 39 47 55 59 63 ...
> > Info: PME TRANS LOCATIONS: 3 11 19 29 35 43 51 61 67 75 ...
> > Info: Startup phase 4 took 0.00254185 s, 17.3047 MB of memory in use
> > Info: Startup phase 5 took 0.0261579 s, 17.3047 MB of memory in use
> > LDB: Central LB being created...
> > "
> > namd2 2.6 version can run normally, but lacks the colvars function I assume.
> > 
> > Any directions?
> > 
> > Thank you.
> > 
> > Dong
> > 
> > 
> > 
> >      
> 
> 
>     

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:20:08 CST