Re: NAMD 2.7/2.8b2 stuck - [0] processControlPoints() haveControlPointChangeCallback=0 frameworkShouldAdvancePhase=0

From: Bjoern Olausson (namdlist_at_googlemail.com)
Date: Wed May 18 2011 - 08:44:48 CDT

I guess the empty patches were due to the vacuum space at +/-Z.
and since langevinPiston is not used in NVT ensemble I ran straight
into this bug ;-)
Still wondering why I was the first one - Is my setup so weird?

Thanks for fixing this.
Do you plan to include the fix in the 2.8 gold release?

Cheers,
Bjoern

On Tue, May 17, 2011 at 20:21, Jim Phillips <jim_at_ks.uiuc.edu> wrote:
> Hi again,
>
> Thanks.  This is a race condition that can occur in pencil PME (only) when
> all of the patches on a processor are completely empty.  This situation does
> not occur in normal periodic simulations.  The bug will not occur if
> langevinPiston is used and is much less likely to occur if multiple
> timestepping is used.  The bug will always produce a hanging simulation,
> never incorrect results.  I am working on a fix.
>
> -Jim
>
>
> On Mon, 16 May 2011, Bjoern Olausson wrote:
>
>> Well, the system is a symmetrical monolayer setup with some vacuum
>> space in +Z and -Z direction so I would the global density expect to
>> be significant lower then in a "general" solvated system.
>> The local density e.g. for water, after some equilibration steps,
>> should be around 1 g/cm^3.
>>
>> Sure I can send you the input files.
>>
>> Cheers,
>> Bjoern
>>
>> On Mon, May 16, 2011 at 15:57, Jim Phillips <jim_at_ks.uiuc.edu> wrote:
>>>
>>> Hi again,
>>>
>>> Thanks.  Is there a reason your system has half the typical density for a
>>> solvated periodic system?  Can you point the input files so I can try to
>>> reproduce this myself?
>>>
>>> -Jim
>>>
>>> On Mon, 16 May 2011, Bjoern Olausson wrote:
>>>
>>>> Here are the requested test result with NAMD 2.8b2
>>>>
>>>> One is particular interesting. The setup with "twoAwayX yes", PME and
>>>> 264
>>>> cores did not fail consistently. From 4 tries it stalled only two times.
>>>>
>>>> The timestep NAMD stalls on is not consitent too.
>>>>
>>>> Without PME there were no problems at all.
>>>>
>>>> Please find all relevant data here:
>>>> http://daten-transport.de/?id=8kNGP4tykfLp
>>>>
>>>> If you need more Information, don't hesitate to ask.
>>>>
>>>> Cheers,
>>>> Bjoern
>>>>
>>>> On Saturday 14 May 2011 20:56:26 Jim Phillips wrote:
>>>>>
>>>>> 2.8b2 would be best.  -Jim
>>>>>
>>>>> On Sat, 14 May 2011, Bjoern Olausson wrote:
>>>>>>
>>>>>> Should I run those tests with 2.8b2 or are you satisfied with 2.7?
>>>>>>
>>>>>> Cheers,
>>>>>> Bjoern
>>>>>>
>>>>>> On Fri, May 13, 2011 at 21:52, Jim Phillips <jim_at_ks.uiuc.edu> wrote:
>>>>>>>
>>>>>>> Thanks.  It looks like the switch from PME slabs to pencils happens
>>>>>>> between 144 and 156, but there's no obvious change from 252 to 264.
>>>>>>>  The 264-core runs for over 1000 steps so it's not a deterministic
>>>>>>> problem.
>>>>>>>
>>>>>>> Please try for the two failing cases first adding outputTiming 1 so
>>>>>>> that
>>>>>>> we'll know what timestep it actually hangs on and then turning off
>>>>>>> PME
>>>>>>> so that we can tell if there's a connection to PME or not.
>>>>>>>
>>>>>>> -Jim
>>>>>>>
>>>>>>> On Fri, 13 May 2011, Bjoern Olausson wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> while tuning the "twoAway" options, the simulation which stalled on
>>>>>>>> 156 cores now stalled on 264 cores.
>>>>>>>> with twoAwayX, twoAwayY, twoAwayZ all set to NO it stalls on 156
>>>>>>>> cores
>>>>>>>> with twoAwayX set to YES and  twoAwayY, twoAwayZ set to NO it
>>>>>>>> stalles
>>>>>>>> on 264 cores.
>>>>>>>>
>>>>>>>> (This was tested with NAMD 2.7, but I guess 2.8 will behave the same
>>>>>>>> way) Please find the according logs under the following Link:
>>>>>>>> http://daten-transport.de/?id=7qK3HdCVnM7W (namd-logs.tar.bz2 584,5
>>>>>>>> Kilobytes)
>>>>>>>>
>>>>>>>> Cheers and many thanks,
>>>>>>>> Bjoern
>>>>>>>>
>>>>>>>> On Fri, May 13, 2011 at 15:20, Jim Phillips <jim_at_ks.uiuc.edu> wrote:
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Please send me the complete log file for the largest working and
>>>>>>>>> smallest hanging runs (I guess that's 144 and 156 cores).
>>>>>>>>>
>>>>>>>>> -Jim
>>>>>>>>>
>>>>>>>>> On Fri, 13 May 2011, Bjoern Olausson wrote:
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> with one of my Simulation I ran into the following problem.
>>>>>>>>>> Running the simulation "B" on less then 156 Cores works fine (Each
>>>>>>>>>> try incremented by 12 Cores).
>>>>>>>>>> But with 156 Cores the simulations hangs after minimization.
>>>>>>>>>> Another
>>>>>>>>>> bigger simulation "A" runs fine with 156 Cores but stalls with
>>>>>>>>>> 252.
>>>>>>>>>>
>>>>>>>>>> I am using
>>>>>>>>>> NAMD_2.8b2_Linux-x86_64-ibverbs-net-linux-x86_64-ibverbs-icc
>>>>>>>>>> currently, but the same happens with NAMD 2.7:
>>>>>>>>>>
>>>>>>>>>> Simulation A is a monolayer (Vacuum | Monolayer with attached
>>>>>>>>>> Protein
>>>>>>>>>>
>>>>>>>>>> | Water | Monolayer with attached Protein | Vacuum)
>>>>>>>>>>
>>>>>>>>>> Simulation B is the same but I removed the two proteins and some
>>>>>>>>>> water between the two monolayers.
>>>>>>>>>>
>>>>>>>>>> A has 163214 Atoms
>>>>>>>>>> B has   79687 Atoms
>>>>>>>>>>
>>>>>>>>>> I can't find a reason why it happens at a certain Core number.
>>>>>>>>>>
>>>>>>>>>> LINE MINIMIZER BRACKET: DX 2.26297e-05 6.07123e-05 DU -0.112343
>>>>>>>>>> 0.803579 DUDX -9856.98 -88.7072 26529.9
>>>>>>>>>> LINE MINIMIZER REDUCING GRADIENT FROM 488884 TO 488.884
>>>>>>>>>> PRESSURE: 998 -3096.26 0.240235 -2.11389 0.240235 -3036.98 30.6163
>>>>>>>>>> -2.11389 30.6163 -2719.13
>>>>>>>>>> GPRESSURE: 998 -3053.97 0.0322738 -2.31931 1.70752 -2997.23
>>>>>>>>>> 32.1548
>>>>>>>>>> 1.12647 30.6867 -2682.59
>>>>>>>>>> ENERGY:     998      5798.1099      9606.5134     11613.1689
>>>>>>>>>> 14.3917        -220491.3201       259.2408         0.0000
>>>>>>>>>> 0.0000         0.0000        -193199.8954         0.0000
>>>>>>>>>> -193199.8954   -193199.8954         0.0000          -2950.7895
>>>>>>>>>> -2911.2626
>>>>>>>>>>
>>>>>>>>>> PRESSURE: 999 -3101.92 0.427017 -1.88108 0.427017 -3029.82 30.4947
>>>>>>>>>> -1.88108 30.4947 -2731.63
>>>>>>>>>> GPRESSURE: 999 -3056.02 0.387877 -3.93892 3.00918 -2994.69 32.1866
>>>>>>>>>> 0.17135 30.0678 -2692.69
>>>>>>>>>> ENERGY:     999      5831.4354      9616.9842     11604.8301
>>>>>>>>>> 13.8257        -220677.3820       308.1108         0.0000
>>>>>>>>>> 0.0000         0.0000        -193302.1958         0.0000
>>>>>>>>>> -193302.1958   -193302.1958         0.0000          -2954.4553
>>>>>>>>>> -2914.4624
>>>>>>>>>>
>>>>>>>>>> PRESSURE: 1000 -3101.92 0.427017 -1.88108 0.427017 -3029.82
>>>>>>>>>> 30.4947
>>>>>>>>>> -1.88108 30.4947 -2731.63
>>>>>>>>>> GPRESSURE: 1000 -3056.02 0.387877 -3.93892 3.00918 -2994.69
>>>>>>>>>> 32.1866
>>>>>>>>>> 0.171348 30.0678 -2692.69
>>>>>>>>>> TIMING: 1000  CPU: 24.3443, 0.0242553/step  Wall: 24.388,
>>>>>>>>>> 0.0242993/step, 0 hours remaining, 238.144531 MB of memory in use.
>>>>>>>>>> ETITLE:      TS           BOND          ANGLE          DIHED
>>>>>>>>>> IMPRP               ELECT            VDW       BOUNDARY
>>>>>>>>>> MISC KINETIC               TOTAL           TEMP      POTENTIAL
>>>>>>>>>> TOTAL3        TEMPAVG            PRESSURE      GPRESSURE
>>>>>>>>>> ENERGY:    1000      5831.4354      9616.9842     11604.8301
>>>>>>>>>> 13.8257        -220677.3820       308.1108         0.0000
>>>>>>>>>> 0.0000         0.0000        -193302.1958         0.0000
>>>>>>>>>> -193302.1958   -193302.1958         0.0000          -2954.4553
>>>>>>>>>> -2914.4624
>>>>>>>>>>
>>>>>>>>>> WRITING EXTENDED SYSTEM TO RESTART FILE AT STEP 1000
>>>>>>>>>> WRITING COORDINATES TO DCD FILE AT STEP 1000
>>>>>>>>>> WRITING COORDINATES TO RESTART FILE AT STEP 1000
>>>>>>>>>> FINISHED WRITING RESTART COORDINATES
>>>>>>>>>> The last position output (seq=1000) takes 0.026 seconds, 238.145
>>>>>>>>>> MB
>>>>>>>>>> of memory in use
>>>>>>>>>> WRITING VELOCITIES TO RESTART FILE AT STEP 1000
>>>>>>>>>> FINISHED WRITING RESTART VELOCITIES
>>>>>>>>>> The last velocity output (seq=1000) takes 0.019 seconds, 238.145
>>>>>>>>>> MB
>>>>>>>>>> of memory in use
>>>>>>>>>> REINITIALIZING VELOCITIES AT STEP 1000 TO 303 KELVIN.
>>>>>>>>>> TCL: Running for 9000 steps
>>>>>>>>>> PRESSURE: 1000 -1607.18 5.85548 -10.9122 5.85548 -1546.56 26.3568
>>>>>>>>>> -10.9122 26.3568 -886.287
>>>>>>>>>> GPRESSURE: 1000 -1469.55 7.5989 -10.7156 10.9579 -1410.74 22.6426
>>>>>>>>>> -10.5674 20.7688 -1127
>>>>>>>>>> ETITLE:      TS           BOND          ANGLE          DIHED
>>>>>>>>>> IMPRP               ELECT            VDW       BOUNDARY
>>>>>>>>>> MISC KINETIC               TOTAL           TEMP      POTENTIAL
>>>>>>>>>> TOTAL3        TEMPAVG            PRESSURE      GPRESSURE
>>>>>>>>>> ENERGY:    1000       607.1667      6226.7038     11604.6460
>>>>>>>>>> 13.8497        -203337.4899        27.6364         0.0000
>>>>>>>>>> 0.0000     52831.6131        -132025.8742       303.3486
>>>>>>>>>> -184857.4873   -132057.5192       303.3486          -1346.6784
>>>>>>>>>> -1335.7638
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> it takes some hours until this message is printed:
>>>>>>>>>> [0] processControlPoints() haveControlPointChangeCallback=0
>>>>>>>>>> frameworkShouldAdvancePhase=0
>>>>>>>>>>
>>>>>>>>>> Any clue where I could search?
>>>>>>>>>> If you need more information, don't hesitate to ask.
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> Bjoern
>>>>
>>>> --
>>>> Bjoern Olausson
>>>> Martin-Luther-Universität Halle-Wittenberg
>>>> Fachbereich Biochemie/Biotechnologie
>>>> Kurt-Mothes-Str. 3
>>>> 06120 Halle/Saale
>>>>
>>>> Phone: +49-345-55-24942
>>>
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 05:23:58 CST