Re: NAMD 2.7/2.8b2 stuck - [0] processControlPoints() haveControlPointChangeCallback=0 frameworkShouldAdvancePhase=0

From: Bjoern Olausson (namdlist_at_googlemail.com)
Date: Thu May 19 2011 - 04:00:33 CDT

Thanks a lot!

Happy cheers,
Bjoern

On Thursday 19 May 2011 00:58:16 Jim Phillips wrote:
> The fix is checked in now, so yes, it will make the gold-plated 2.8 (as
> well as the nightly builds).
>
> Half-empty periodic cell without multiple timestepping on large processor
> counts and a large enough system for pencil PME is rather uncommon.
>
> -Jim
>
> On Wed, 18 May 2011, Bjoern Olausson wrote:
> > I guess the empty patches were due to the vacuum space at +/-Z.
> > and since langevinPiston is not used in NVT ensemble I ran straight
> > into this bug ;-)
> > Still wondering why I was the first one - Is my setup so weird?
> >
> > Thanks for fixing this.
> > Do you plan to include the fix in the 2.8 gold release?
> >
> > Cheers,
> > Bjoern
> >
> > On Tue, May 17, 2011 at 20:21, Jim Phillips <jim_at_ks.uiuc.edu> wrote:
> >> Hi again,
> >>
> >> Thanks. This is a race condition that can occur in pencil PME (only)
> >> when all of the patches on a processor are completely empty. This
> >> situation does not occur in normal periodic simulations. The bug will
> >> not occur if langevinPiston is used and is much less likely to occur if
> >> multiple timestepping is used. The bug will always produce a hanging
> >> simulation, never incorrect results. I am working on a fix.
> >>
> >> -Jim
> >>
> >> On Mon, 16 May 2011, Bjoern Olausson wrote:
> >>> Well, the system is a symmetrical monolayer setup with some vacuum
> >>> space in +Z and -Z direction so I would the global density expect to
> >>> be significant lower then in a "general" solvated system.
> >>> The local density e.g. for water, after some equilibration steps,
> >>> should be around 1 g/cm^3.
> >>>
> >>> Sure I can send you the input files.
> >>>
> >>> Cheers,
> >>> Bjoern
> >>>
> >>> On Mon, May 16, 2011 at 15:57, Jim Phillips <jim_at_ks.uiuc.edu> wrote:
> >>>> Hi again,
> >>>>
> >>>> Thanks. Is there a reason your system has half the typical density
> >>>> for a solvated periodic system? Can you point the input files so I
> >>>> can try to reproduce this myself?
> >>>>
> >>>> -Jim
> >>>>
> >>>> On Mon, 16 May 2011, Bjoern Olausson wrote:
> >>>>> Here are the requested test result with NAMD 2.8b2
> >>>>>
> >>>>> One is particular interesting. The setup with "twoAwayX yes", PME and
> >>>>> 264
> >>>>> cores did not fail consistently. From 4 tries it stalled only two
> >>>>> times.
> >>>>>
> >>>>> The timestep NAMD stalls on is not consitent too.
> >>>>>
> >>>>> Without PME there were no problems at all.
> >>>>>
> >>>>> Please find all relevant data here:
> >>>>> http://daten-transport.de/?id=8kNGP4tykfLp
> >>>>>
> >>>>> If you need more Information, don't hesitate to ask.
> >>>>>
> >>>>> Cheers,
> >>>>> Bjoern
> >>>>>
> >>>>> On Saturday 14 May 2011 20:56:26 Jim Phillips wrote:
> >>>>>> 2.8b2 would be best. -Jim
> >>>>>>
> >>>>>> On Sat, 14 May 2011, Bjoern Olausson wrote:
> >>>>>>> Should I run those tests with 2.8b2 or are you satisfied with 2.7?
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>> Bjoern
> >>>>>>>
> >>>>>>> On Fri, May 13, 2011 at 21:52, Jim Phillips <jim_at_ks.uiuc.edu> wrote:
> >>>>>>>> Thanks. It looks like the switch from PME slabs to pencils
> >>>>>>>> happens between 144 and 156, but there's no obvious change from
> >>>>>>>> 252 to 264. The 264-core runs for over 1000 steps so it's not a
> >>>>>>>> deterministic problem.
> >>>>>>>>
> >>>>>>>> Please try for the two failing cases first adding outputTiming 1
> >>>>>>>> so that
> >>>>>>>> we'll know what timestep it actually hangs on and then turning off
> >>>>>>>> PME
> >>>>>>>> so that we can tell if there's a connection to PME or not.
> >>>>>>>>
> >>>>>>>> -Jim
> >>>>>>>>
> >>>>>>>> On Fri, 13 May 2011, Bjoern Olausson wrote:
> >>>>>>>>> Hi,
> >>>>>>>>>
> >>>>>>>>> while tuning the "twoAway" options, the simulation which stalled
> >>>>>>>>> on 156 cores now stalled on 264 cores.
> >>>>>>>>> with twoAwayX, twoAwayY, twoAwayZ all set to NO it stalls on 156
> >>>>>>>>> cores
> >>>>>>>>> with twoAwayX set to YES and twoAwayY, twoAwayZ set to NO it
> >>>>>>>>> stalles
> >>>>>>>>> on 264 cores.
> >>>>>>>>>
> >>>>>>>>> (This was tested with NAMD 2.7, but I guess 2.8 will behave the
> >>>>>>>>> same way) Please find the according logs under the following
> >>>>>>>>> Link: http://daten-transport.de/?id=7qK3HdCVnM7W
> >>>>>>>>> (namd-logs.tar.bz2 584,5 Kilobytes)
> >>>>>>>>>
> >>>>>>>>> Cheers and many thanks,
> >>>>>>>>> Bjoern
> >>>>>>>>>
> >>>>>>>>> On Fri, May 13, 2011 at 15:20, Jim Phillips <jim_at_ks.uiuc.edu>
wrote:
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>> Please send me the complete log file for the largest working and
> >>>>>>>>>> smallest hanging runs (I guess that's 144 and 156 cores).
> >>>>>>>>>>
> >>>>>>>>>> -Jim
> >>>>>>>>>>
> >>>>>>>>>> On Fri, 13 May 2011, Bjoern Olausson wrote:
> >>>>>>>>>>> Hi,
> >>>>>>>>>>>
> >>>>>>>>>>> with one of my Simulation I ran into the following problem.
> >>>>>>>>>>> Running the simulation "B" on less then 156 Cores works fine
> >>>>>>>>>>> (Each try incremented by 12 Cores).
> >>>>>>>>>>> But with 156 Cores the simulations hangs after minimization.
> >>>>>>>>>>> Another
> >>>>>>>>>>> bigger simulation "A" runs fine with 156 Cores but stalls with
> >>>>>>>>>>> 252.
> >>>>>>>>>>>
> >>>>>>>>>>> I am using
> >>>>>>>>>>> NAMD_2.8b2_Linux-x86_64-ibverbs-net-linux-x86_64-ibverbs-icc
> >>>>>>>>>>> currently, but the same happens with NAMD 2.7:
> >>>>>>>>>>>
> >>>>>>>>>>> Simulation A is a monolayer (Vacuum | Monolayer with attached
> >>>>>>>>>>> Protein
> >>>>>>>>>>>
> >>>>>>>>>>> | Water | Monolayer with attached Protein | Vacuum)
> >>>>>>>>>>>
> >>>>>>>>>>> Simulation B is the same but I removed the two proteins and
> >>>>>>>>>>> some water between the two monolayers.
> >>>>>>>>>>>
> >>>>>>>>>>> A has 163214 Atoms
> >>>>>>>>>>> B has 79687 Atoms
> >>>>>>>>>>>
> >>>>>>>>>>> I can't find a reason why it happens at a certain Core number.
> >>>>>>>>>>>
> >>>>>>>>>>> LINE MINIMIZER BRACKET: DX 2.26297e-05 6.07123e-05 DU -0.112343
> >>>>>>>>>>> 0.803579 DUDX -9856.98 -88.7072 26529.9
> >>>>>>>>>>> LINE MINIMIZER REDUCING GRADIENT FROM 488884 TO 488.884
> >>>>>>>>>>> PRESSURE: 998 -3096.26 0.240235 -2.11389 0.240235 -3036.98
> >>>>>>>>>>> 30.6163 -2.11389 30.6163 -2719.13
> >>>>>>>>>>> GPRESSURE: 998 -3053.97 0.0322738 -2.31931 1.70752 -2997.23
> >>>>>>>>>>> 32.1548
> >>>>>>>>>>> 1.12647 30.6867 -2682.59
> >>>>>>>>>>> ENERGY: 998 5798.1099 9606.5134 11613.1689
> >>>>>>>>>>> 14.3917 -220491.3201 259.2408 0.0000
> >>>>>>>>>>> 0.0000 0.0000 -193199.8954 0.0000
> >>>>>>>>>>> -193199.8954 -193199.8954 0.0000 -2950.7895
> >>>>>>>>>>> -2911.2626
> >>>>>>>>>>>
> >>>>>>>>>>> PRESSURE: 999 -3101.92 0.427017 -1.88108 0.427017 -3029.82
> >>>>>>>>>>> 30.4947 -1.88108 30.4947 -2731.63
> >>>>>>>>>>> GPRESSURE: 999 -3056.02 0.387877 -3.93892 3.00918 -2994.69
> >>>>>>>>>>> 32.1866 0.17135 30.0678 -2692.69
> >>>>>>>>>>> ENERGY: 999 5831.4354 9616.9842 11604.8301
> >>>>>>>>>>> 13.8257 -220677.3820 308.1108 0.0000
> >>>>>>>>>>> 0.0000 0.0000 -193302.1958 0.0000
> >>>>>>>>>>> -193302.1958 -193302.1958 0.0000 -2954.4553
> >>>>>>>>>>> -2914.4624
> >>>>>>>>>>>
> >>>>>>>>>>> PRESSURE: 1000 -3101.92 0.427017 -1.88108 0.427017 -3029.82
> >>>>>>>>>>> 30.4947
> >>>>>>>>>>> -1.88108 30.4947 -2731.63
> >>>>>>>>>>> GPRESSURE: 1000 -3056.02 0.387877 -3.93892 3.00918 -2994.69
> >>>>>>>>>>> 32.1866
> >>>>>>>>>>> 0.171348 30.0678 -2692.69
> >>>>>>>>>>> TIMING: 1000 CPU: 24.3443, 0.0242553/step Wall: 24.388,
> >>>>>>>>>>> 0.0242993/step, 0 hours remaining, 238.144531 MB of memory in
> >>>>>>>>>>> use. ETITLE: TS BOND ANGLE
> >>>>>>>>>>> DIHED IMPRP ELECT VDW BOUNDARY
> >>>>>>>>>>> MISC KINETIC TOTAL TEMP POTENTIAL
> >>>>>>>>>>> TOTAL3 TEMPAVG PRESSURE GPRESSURE
> >>>>>>>>>>> ENERGY: 1000 5831.4354 9616.9842 11604.8301
> >>>>>>>>>>> 13.8257 -220677.3820 308.1108 0.0000
> >>>>>>>>>>> 0.0000 0.0000 -193302.1958 0.0000
> >>>>>>>>>>> -193302.1958 -193302.1958 0.0000 -2954.4553
> >>>>>>>>>>> -2914.4624
> >>>>>>>>>>>
> >>>>>>>>>>> WRITING EXTENDED SYSTEM TO RESTART FILE AT STEP 1000
> >>>>>>>>>>> WRITING COORDINATES TO DCD FILE AT STEP 1000
> >>>>>>>>>>> WRITING COORDINATES TO RESTART FILE AT STEP 1000
> >>>>>>>>>>> FINISHED WRITING RESTART COORDINATES
> >>>>>>>>>>> The last position output (seq=1000) takes 0.026 seconds,
> >>>>>>>>>>> 238.145 MB
> >>>>>>>>>>> of memory in use
> >>>>>>>>>>> WRITING VELOCITIES TO RESTART FILE AT STEP 1000
> >>>>>>>>>>> FINISHED WRITING RESTART VELOCITIES
> >>>>>>>>>>> The last velocity output (seq=1000) takes 0.019 seconds,
> >>>>>>>>>>> 238.145 MB
> >>>>>>>>>>> of memory in use
> >>>>>>>>>>> REINITIALIZING VELOCITIES AT STEP 1000 TO 303 KELVIN.
> >>>>>>>>>>> TCL: Running for 9000 steps
> >>>>>>>>>>> PRESSURE: 1000 -1607.18 5.85548 -10.9122 5.85548 -1546.56
> >>>>>>>>>>> 26.3568 -10.9122 26.3568 -886.287
> >>>>>>>>>>> GPRESSURE: 1000 -1469.55 7.5989 -10.7156 10.9579 -1410.74
> >>>>>>>>>>> 22.6426 -10.5674 20.7688 -1127
> >>>>>>>>>>> ETITLE: TS BOND ANGLE DIHED
> >>>>>>>>>>> IMPRP ELECT VDW BOUNDARY
> >>>>>>>>>>> MISC KINETIC TOTAL TEMP POTENTIAL
> >>>>>>>>>>> TOTAL3 TEMPAVG PRESSURE GPRESSURE
> >>>>>>>>>>> ENERGY: 1000 607.1667 6226.7038 11604.6460
> >>>>>>>>>>> 13.8497 -203337.4899 27.6364 0.0000
> >>>>>>>>>>> 0.0000 52831.6131 -132025.8742 303.3486
> >>>>>>>>>>> -184857.4873 -132057.5192 303.3486 -1346.6784
> >>>>>>>>>>> -1335.7638
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> it takes some hours until this message is printed:
> >>>>>>>>>>> [0] processControlPoints() haveControlPointChangeCallback=0
> >>>>>>>>>>> frameworkShouldAdvancePhase=0
> >>>>>>>>>>>
> >>>>>>>>>>> Any clue where I could search?
> >>>>>>>>>>> If you need more information, don't hesitate to ask.
> >>>>>>>>>>>
> >>>>>>>>>>> Cheers,
> >>>>>>>>>>> Bjoern
> >>>>>
> >>>>> --
> >>>>> Bjoern Olausson
> >>>>> Martin-Luther-Universität Halle-Wittenberg
> >>>>> Fachbereich Biochemie/Biotechnologie
> >>>>> Kurt-Mothes-Str. 3
> >>>>> 06120 Halle/Saale
> >>>>>
> >>>>> Phone: +49-345-55-24942

-- 
Bjoern Olausson
Martin-Luther-Universität Halle-Wittenberg 
Fachbereich Biochemie/Biotechnologie
Kurt-Mothes-Str. 3
06120 Halle/Saale
Phone: +49-345-55-24942

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:20:17 CST