Re: NAMD2.7 on BluegeneL hang at "LDB: Central LB being created..."

From: Dong Luo (us917_at_yahoo.com)
Date: Tue Apr 12 2011 - 09:44:09 CDT

In case someone will research the problem in the future, I add another problem appeared during the simulation with NAMD2.7 on BlueGene/L in the safe time range (i.e., running 250k steps for the 50k atoms system per simulation). Randomly a simulation may stopped with an error: [0] processControlPoints() haveControlPointChangeCallback=0 frameworkShouldAdvancePhase=0. It appeared not often, and a repeat of the same simulation normally runs ok. A rough search shows the error is throw out by charm++. Again, the NAMD/charm++ are compiled from cvs code of 03/03/2011. Dong ________________________________ From: Dong Luo <us917_at_yahoo.com> To: Dong Luo <us917_at_yahoo.com> Cc: namd-l_at_ks.uiuc.edu Sent: Friday, March 4, 2011 11:40 AM Subject: Re: namd-l: NAMD2.7 on BluegeneL hang at "LDB: Central LB being created..." Well, disable virtual node does not really solve the problem, just delayed the occurrence of  "FATAL ERROR: Memory allocation failed on processor 0." With 128 physical nodes, it occurred after about 270000 steps compared to 121000 steps for 256 virtual nodes for the test system with 50k atoms. For another bigger simulation system with 170k atoms, this FATAL ERROR happened only after 20000 steps. NAMD2.6 version has no such problem. Each Bluegene/L node consist of dual core 32-bit PPC440 processors (700 MHz) with 512 MB of main memory. Each node has a 32 KB L1 cache, 2 KB L2 cache, and a 4 MB L3 cache. Dong ________________________________ From: Dong Luo <us917_at_yahoo.com> To: Chris Harrison <charris5_at_gmail.com> Cc: namd-l_at_ks.uiuc.edu Sent: Fri, March 4, 2011 9:08:38 AM Subject: Re: namd-l: NAMD2.7 on BluegeneL hang at "LDB: Central LB being created..." I didn't say it clearly. I'm using the CVS version of Charm++, but modified the configure file to skip MPI test, otherwise it will refuse to compile.   But now I run into another problem with the fresh compiled namd2. With virtual node enabled, the simulation will get an "FATAL ERROR: Memory allocation failed on processor 0." at step about 121000 (repeatable). The simulation system contains only 50905 atoms. Disable virtual mode solves the problem but slows the calculation speed from "256 CPUs 0.0204817 s/step 0.237057 days/ns" to "128 CPUs 0.0299266 s/step 0.346373 days/ns". Each physical CPU has 2 nodes on it. Thats why 128 CPUs can be counted as 256 when in virtual node mode.   Dong ________________________________ From: Chris Harrison <charris5_at_gmail.com> To: Dong Luo <us917_at_yahoo.com> Cc: akohlmey_at_gmail.com; namd-l_at_ks.uiuc.edu Sent: Thu, March 3, 2011 9:08:20 PM Subject: Re: namd-l: NAMD2.7 on BluegeneL hang at "LDB: Central LB being created..." Are you really using Charm++ 2.2?! Is there a reason?  This may work for you, but you should really upgrade to Charm++ 6.2.1 or later when possible.  Otherwise you're missing improvements to performance from the more recent Charm++ versions. Best, Chris -- Chris Harrison, Ph.D. Theoretical and Computational Biophysics Group NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute for Advanced Science and Technology University of Illinois, 405 N. Mathews Ave., Urbana, IL 61801 char_at_ks.uiuc.edu                          Voice: 217-244-1733 http://www.ks.uiuc.edu/%7Echar              Fax:  217-244-6078 Dong Luo <us917_at_yahoo.com> writes: > Date: Thu, 3 Mar 2011 17:59:10 -0800 (PST) > From: Dong Luo <us917_at_yahoo.com> > To: Chris Harrison <charris5_at_gmail.com>, akohlmey_at_gmail.com > Cc: namd-l_at_ks.uiuc.edu > Subject: Re: namd-l: NAMD2.7 on BluegeneL hang at "LDB: Central LB being >  created..." > X-Mailer: YahooMailRC/559 YahooMailWebService/0.8.109.292656 > > Chris, the CVS version of namd/charm++ work. Only that I have to comment out MPI > checking in the configure file of charm++ because it fails on Bluegene/L. It is > not checked in charm++ 2.2. > > Axel, namd/charm++ are cross-compiled on Bluegene/L because the login host uses > different OS compared to the cluster nodes. I did not figure out a way to test > charm++. > > Dong > >   > > ________________________________ > From: Chris Harrison <charris5_at_gmail.com> > To: Dong Luo <us917_at_yahoo.com> > Cc: namd-l_at_ks.uiuc.edu > Sent: Thu, March 3, 2011 1:41:50 AM > Subject: Re: namd-l: NAMD2.7 on BluegeneL hang at "LDB: Central LB being > created..." > > We've made recent improvements to startup and load-balancing.  Can you > try the CVS version or one of the nightly builds of namd, with the most > recent git archive or nightly build of charm++? > > Best, > Chris > > > -- > Chris Harrison, Ph.D. > Theoretical and Computational Biophysics Group > NIH Resource for Macromolecular Modeling and Bioinformatics > Beckman Institute for Advanced Science and Technology > University of Illinois, 405 N. Mathews Ave., Urbana, IL 61801 > > char_at_ks.uiuc.edu                          Voice: 217-244-1733 > http://www.ks.uiuc.edu/%7Echar%C2%A0 > > > > However, the simulation (no matter with colvars or not) using this namd2 2.7 > > version always hang after Startup phase 5 as shown in the log: > > " > > Info: REMOVING COM VELOCITY 0.0209799 0.0192793 0.000362722 > > Info: LARGEST PATCH (156) HAS 345 ATOMS > > Info: Startup phase 3 took 0.246489 s, 17.3047 MB of memory in use > > Info: PME using 40 and 32 processors for FFT and reciprocal sum. > > Info: PME GRID LOCATIONS: 7 15 23 27 31 39 47 55 59 63 ... > > Info: PME TRANS LOCATIONS: 3 11 19 29 35 43 51 61 67 75 ... > > Info: Startup phase 4 took 0.00254185 s, 17.3047 MB of memory in use > > Info: Startup phase 5 took 0.0261579 s, 17.3047 MB of memory in use > > LDB: Central LB being created... > > " > > namd2 2.6 version can run normally, but lacks the colvars function I assume. > > > > Any directions? > > > > Thank you. > > > > Dong > > > > > > > >      > > >     

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:56:58 CST