Re: NAMD 2.7/2.8b2 stuck - [0] processControlPoints() haveControlPointChangeCallback=0 frameworkShouldAdvancePhase=0

From: Jim Phillips (jim_at_ks.uiuc.edu)
Date: Tue May 17 2011 - 13:21:49 CDT

Hi again,

Thanks. This is a race condition that can occur in pencil PME (only) when
all of the patches on a processor are completely empty. This situation
does not occur in normal periodic simulations. The bug will not occur if
langevinPiston is used and is much less likely to occur if multiple
timestepping is used. The bug will always produce a hanging simulation,
never incorrect results. I am working on a fix.

-Jim

On Mon, 16 May 2011, Bjoern Olausson wrote:

> Well, the system is a symmetrical monolayer setup with some vacuum
> space in +Z and -Z direction so I would the global density expect to
> be significant lower then in a "general" solvated system.
> The local density e.g. for water, after some equilibration steps,
> should be around 1 g/cm^3.
>
> Sure I can send you the input files.
>
> Cheers,
> Bjoern
>
> On Mon, May 16, 2011 at 15:57, Jim Phillips <jim_at_ks.uiuc.edu> wrote:
>> Hi again,
>>
>> Thanks.  Is there a reason your system has half the typical density for a
>> solvated periodic system?  Can you point the input files so I can try to
>> reproduce this myself?
>>
>> -Jim
>>
>> On Mon, 16 May 2011, Bjoern Olausson wrote:
>>
>>> Here are the requested test result with NAMD 2.8b2
>>>
>>> One is particular interesting. The setup with "twoAwayX yes", PME and 264
>>> cores did not fail consistently. From 4 tries it stalled only two times.
>>>
>>> The timestep NAMD stalls on is not consitent too.
>>>
>>> Without PME there were no problems at all.
>>>
>>> Please find all relevant data here:
>>> http://daten-transport.de/?id=8kNGP4tykfLp
>>>
>>> If you need more Information, don't hesitate to ask.
>>>
>>> Cheers,
>>> Bjoern
>>>
>>> On Saturday 14 May 2011 20:56:26 Jim Phillips wrote:
>>>>
>>>> 2.8b2 would be best.  -Jim
>>>>
>>>> On Sat, 14 May 2011, Bjoern Olausson wrote:
>>>>>
>>>>> Should I run those tests with 2.8b2 or are you satisfied with 2.7?
>>>>>
>>>>> Cheers,
>>>>> Bjoern
>>>>>
>>>>> On Fri, May 13, 2011 at 21:52, Jim Phillips <jim_at_ks.uiuc.edu> wrote:
>>>>>>
>>>>>> Thanks.  It looks like the switch from PME slabs to pencils happens
>>>>>> between 144 and 156, but there's no obvious change from 252 to 264.
>>>>>>  The 264-core runs for over 1000 steps so it's not a deterministic
>>>>>> problem.
>>>>>>
>>>>>> Please try for the two failing cases first adding outputTiming 1 so
>>>>>> that
>>>>>> we'll know what timestep it actually hangs on and then turning off PME
>>>>>> so that we can tell if there's a connection to PME or not.
>>>>>>
>>>>>> -Jim
>>>>>>
>>>>>> On Fri, 13 May 2011, Bjoern Olausson wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> while tuning the "twoAway" options, the simulation which stalled on
>>>>>>> 156 cores now stalled on 264 cores.
>>>>>>> with twoAwayX, twoAwayY, twoAwayZ all set to NO it stalls on 156 cores
>>>>>>> with twoAwayX set to YES and  twoAwayY, twoAwayZ set to NO it stalles
>>>>>>> on 264 cores.
>>>>>>>
>>>>>>> (This was tested with NAMD 2.7, but I guess 2.8 will behave the same
>>>>>>> way) Please find the according logs under the following Link:
>>>>>>> http://daten-transport.de/?id=7qK3HdCVnM7W (namd-logs.tar.bz2 584,5
>>>>>>> Kilobytes)
>>>>>>>
>>>>>>> Cheers and many thanks,
>>>>>>> Bjoern
>>>>>>>
>>>>>>> On Fri, May 13, 2011 at 15:20, Jim Phillips <jim_at_ks.uiuc.edu> wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Please send me the complete log file for the largest working and
>>>>>>>> smallest hanging runs (I guess that's 144 and 156 cores).
>>>>>>>>
>>>>>>>> -Jim
>>>>>>>>
>>>>>>>> On Fri, 13 May 2011, Bjoern Olausson wrote:
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> with one of my Simulation I ran into the following problem.
>>>>>>>>> Running the simulation "B" on less then 156 Cores works fine (Each
>>>>>>>>> try incremented by 12 Cores).
>>>>>>>>> But with 156 Cores the simulations hangs after minimization. Another
>>>>>>>>> bigger simulation "A" runs fine with 156 Cores but stalls with 252.
>>>>>>>>>
>>>>>>>>> I am using
>>>>>>>>> NAMD_2.8b2_Linux-x86_64-ibverbs-net-linux-x86_64-ibverbs-icc
>>>>>>>>> currently, but the same happens with NAMD 2.7:
>>>>>>>>>
>>>>>>>>> Simulation A is a monolayer (Vacuum | Monolayer with attached
>>>>>>>>> Protein
>>>>>>>>>
>>>>>>>>> | Water | Monolayer with attached Protein | Vacuum)
>>>>>>>>>
>>>>>>>>> Simulation B is the same but I removed the two proteins and some
>>>>>>>>> water between the two monolayers.
>>>>>>>>>
>>>>>>>>> A has 163214 Atoms
>>>>>>>>> B has   79687 Atoms
>>>>>>>>>
>>>>>>>>> I can't find a reason why it happens at a certain Core number.
>>>>>>>>>
>>>>>>>>> LINE MINIMIZER BRACKET: DX 2.26297e-05 6.07123e-05 DU -0.112343
>>>>>>>>> 0.803579 DUDX -9856.98 -88.7072 26529.9
>>>>>>>>> LINE MINIMIZER REDUCING GRADIENT FROM 488884 TO 488.884
>>>>>>>>> PRESSURE: 998 -3096.26 0.240235 -2.11389 0.240235 -3036.98 30.6163
>>>>>>>>> -2.11389 30.6163 -2719.13
>>>>>>>>> GPRESSURE: 998 -3053.97 0.0322738 -2.31931 1.70752 -2997.23 32.1548
>>>>>>>>> 1.12647 30.6867 -2682.59
>>>>>>>>> ENERGY:     998      5798.1099      9606.5134     11613.1689
>>>>>>>>> 14.3917        -220491.3201       259.2408         0.0000
>>>>>>>>> 0.0000         0.0000        -193199.8954         0.0000
>>>>>>>>> -193199.8954   -193199.8954         0.0000          -2950.7895
>>>>>>>>> -2911.2626
>>>>>>>>>
>>>>>>>>> PRESSURE: 999 -3101.92 0.427017 -1.88108 0.427017 -3029.82 30.4947
>>>>>>>>> -1.88108 30.4947 -2731.63
>>>>>>>>> GPRESSURE: 999 -3056.02 0.387877 -3.93892 3.00918 -2994.69 32.1866
>>>>>>>>> 0.17135 30.0678 -2692.69
>>>>>>>>> ENERGY:     999      5831.4354      9616.9842     11604.8301
>>>>>>>>> 13.8257        -220677.3820       308.1108         0.0000
>>>>>>>>> 0.0000         0.0000        -193302.1958         0.0000
>>>>>>>>> -193302.1958   -193302.1958         0.0000          -2954.4553
>>>>>>>>> -2914.4624
>>>>>>>>>
>>>>>>>>> PRESSURE: 1000 -3101.92 0.427017 -1.88108 0.427017 -3029.82 30.4947
>>>>>>>>> -1.88108 30.4947 -2731.63
>>>>>>>>> GPRESSURE: 1000 -3056.02 0.387877 -3.93892 3.00918 -2994.69 32.1866
>>>>>>>>> 0.171348 30.0678 -2692.69
>>>>>>>>> TIMING: 1000  CPU: 24.3443, 0.0242553/step  Wall: 24.388,
>>>>>>>>> 0.0242993/step, 0 hours remaining, 238.144531 MB of memory in use.
>>>>>>>>> ETITLE:      TS           BOND          ANGLE          DIHED
>>>>>>>>> IMPRP               ELECT            VDW       BOUNDARY
>>>>>>>>> MISC KINETIC               TOTAL           TEMP      POTENTIAL
>>>>>>>>> TOTAL3        TEMPAVG            PRESSURE      GPRESSURE
>>>>>>>>> ENERGY:    1000      5831.4354      9616.9842     11604.8301
>>>>>>>>> 13.8257        -220677.3820       308.1108         0.0000
>>>>>>>>> 0.0000         0.0000        -193302.1958         0.0000
>>>>>>>>> -193302.1958   -193302.1958         0.0000          -2954.4553
>>>>>>>>> -2914.4624
>>>>>>>>>
>>>>>>>>> WRITING EXTENDED SYSTEM TO RESTART FILE AT STEP 1000
>>>>>>>>> WRITING COORDINATES TO DCD FILE AT STEP 1000
>>>>>>>>> WRITING COORDINATES TO RESTART FILE AT STEP 1000
>>>>>>>>> FINISHED WRITING RESTART COORDINATES
>>>>>>>>> The last position output (seq=1000) takes 0.026 seconds, 238.145 MB
>>>>>>>>> of memory in use
>>>>>>>>> WRITING VELOCITIES TO RESTART FILE AT STEP 1000
>>>>>>>>> FINISHED WRITING RESTART VELOCITIES
>>>>>>>>> The last velocity output (seq=1000) takes 0.019 seconds, 238.145 MB
>>>>>>>>> of memory in use
>>>>>>>>> REINITIALIZING VELOCITIES AT STEP 1000 TO 303 KELVIN.
>>>>>>>>> TCL: Running for 9000 steps
>>>>>>>>> PRESSURE: 1000 -1607.18 5.85548 -10.9122 5.85548 -1546.56 26.3568
>>>>>>>>> -10.9122 26.3568 -886.287
>>>>>>>>> GPRESSURE: 1000 -1469.55 7.5989 -10.7156 10.9579 -1410.74 22.6426
>>>>>>>>> -10.5674 20.7688 -1127
>>>>>>>>> ETITLE:      TS           BOND          ANGLE          DIHED
>>>>>>>>> IMPRP               ELECT            VDW       BOUNDARY
>>>>>>>>> MISC KINETIC               TOTAL           TEMP      POTENTIAL
>>>>>>>>> TOTAL3        TEMPAVG            PRESSURE      GPRESSURE
>>>>>>>>> ENERGY:    1000       607.1667      6226.7038     11604.6460
>>>>>>>>> 13.8497        -203337.4899        27.6364         0.0000
>>>>>>>>> 0.0000     52831.6131        -132025.8742       303.3486
>>>>>>>>> -184857.4873   -132057.5192       303.3486          -1346.6784
>>>>>>>>> -1335.7638
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> it takes some hours until this message is printed:
>>>>>>>>> [0] processControlPoints() haveControlPointChangeCallback=0
>>>>>>>>> frameworkShouldAdvancePhase=0
>>>>>>>>>
>>>>>>>>> Any clue where I could search?
>>>>>>>>> If you need more information, don't hesitate to ask.
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Bjoern
>>>
>>> --
>>> Bjoern Olausson
>>> Martin-Luther-Universität Halle-Wittenberg
>>> Fachbereich Biochemie/Biotechnologie
>>> Kurt-Mothes-Str. 3
>>> 06120 Halle/Saale
>>>
>>> Phone: +49-345-55-24942
>>
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:57:08 CST