Re: NAMD 2.7/2.8b2 stuck - [0] processControlPoints() haveControlPointChangeCallback=0 frameworkShouldAdvancePhase=0

From: Bjoern Olausson (namdlist_at_googlemail.com)
Date: Sat May 14 2011 - 01:12:57 CDT

Should I run those tests with 2.8b2 or are you satisfied with 2.7?

Cheers,
Bjoern

On Fri, May 13, 2011 at 21:52, Jim Phillips <jim_at_ks.uiuc.edu> wrote:
>
> Thanks.  It looks like the switch from PME slabs to pencils happens between
> 144 and 156, but there's no obvious change from 252 to 264.  The 264-core
> runs for over 1000 steps so it's not a deterministic problem.
>
> Please try for the two failing cases first adding outputTiming 1 so that
> we'll know what timestep it actually hangs on and then turning off PME so
> that we can tell if there's a connection to PME or not.
>
> -Jim
>
> On Fri, 13 May 2011, Bjoern Olausson wrote:
>
>> Hi,
>>
>> while tuning the "twoAway" options, the simulation which stalled on
>> 156 cores now stalled on 264 cores.
>> with twoAwayX, twoAwayY, twoAwayZ all set to NO it stalls on 156 cores
>> with twoAwayX set to YES and  twoAwayY, twoAwayZ set to NO it stalles
>> on 264 cores.
>>
>> (This was tested with NAMD 2.7, but I guess 2.8 will behave the same way)
>> Please find the according logs under the following Link:
>> http://daten-transport.de/?id=7qK3HdCVnM7W (namd-logs.tar.bz2 584,5
>> Kilobytes)
>>
>> Cheers and many thanks,
>> Bjoern
>>
>> On Fri, May 13, 2011 at 15:20, Jim Phillips <jim_at_ks.uiuc.edu> wrote:
>>>
>>> Hi,
>>>
>>> Please send me the complete log file for the largest working and smallest
>>> hanging runs (I guess that's 144 and 156 cores).
>>>
>>> -Jim
>>>
>>>
>>> On Fri, 13 May 2011, Bjoern Olausson wrote:
>>>
>>>> Hi,
>>>>
>>>> with one of my Simulation I ran into the following problem.
>>>> Running the simulation "B" on less then 156 Cores works fine (Each try
>>>> incremented by 12 Cores).
>>>> But with 156 Cores the simulations hangs after minimization. Another
>>>> bigger simulation "A" runs fine with 156 Cores but stalls with 252.
>>>>
>>>> I am using NAMD_2.8b2_Linux-x86_64-ibverbs-net-linux-x86_64-ibverbs-icc
>>>> currently, but the same happens with NAMD 2.7:
>>>>
>>>> Simulation A is a monolayer (Vacuum | Monolayer with attached Protein
>>>> | Water | Monolayer with attached Protein | Vacuum)
>>>> Simulation B is the same but I removed the two proteins and some water
>>>> between the two monolayers.
>>>>
>>>> A has 163214 Atoms
>>>> B has   79687 Atoms
>>>>
>>>> I can't find a reason why it happens at a certain Core number.
>>>>
>>>> LINE MINIMIZER BRACKET: DX 2.26297e-05 6.07123e-05 DU -0.112343
>>>> 0.803579 DUDX -9856.98 -88.7072 26529.9
>>>> LINE MINIMIZER REDUCING GRADIENT FROM 488884 TO 488.884
>>>> PRESSURE: 998 -3096.26 0.240235 -2.11389 0.240235 -3036.98 30.6163
>>>> -2.11389 30.6163 -2719.13
>>>> GPRESSURE: 998 -3053.97 0.0322738 -2.31931 1.70752 -2997.23 32.1548
>>>> 1.12647 30.6867 -2682.59
>>>> ENERGY:     998      5798.1099      9606.5134     11613.1689
>>>> 14.3917        -220491.3201       259.2408         0.0000
>>>> 0.0000         0.0000        -193199.8954         0.0000
>>>> -193199.8954   -193199.8954         0.0000          -2950.7895
>>>> -2911.2626
>>>>
>>>> PRESSURE: 999 -3101.92 0.427017 -1.88108 0.427017 -3029.82 30.4947
>>>> -1.88108 30.4947 -2731.63
>>>> GPRESSURE: 999 -3056.02 0.387877 -3.93892 3.00918 -2994.69 32.1866
>>>> 0.17135 30.0678 -2692.69
>>>> ENERGY:     999      5831.4354      9616.9842     11604.8301
>>>> 13.8257        -220677.3820       308.1108         0.0000
>>>> 0.0000         0.0000        -193302.1958         0.0000
>>>> -193302.1958   -193302.1958         0.0000          -2954.4553
>>>> -2914.4624
>>>>
>>>> PRESSURE: 1000 -3101.92 0.427017 -1.88108 0.427017 -3029.82 30.4947
>>>> -1.88108 30.4947 -2731.63
>>>> GPRESSURE: 1000 -3056.02 0.387877 -3.93892 3.00918 -2994.69 32.1866
>>>> 0.171348 30.0678 -2692.69
>>>> TIMING: 1000  CPU: 24.3443, 0.0242553/step  Wall: 24.388,
>>>> 0.0242993/step, 0 hours remaining, 238.144531 MB of memory in use.
>>>> ETITLE:      TS           BOND          ANGLE          DIHED
>>>> IMPRP               ELECT            VDW       BOUNDARY           MISC
>>>>      KINETIC               TOTAL           TEMP      POTENTIAL
>>>>  TOTAL3        TEMPAVG            PRESSURE      GPRESSURE
>>>> ENERGY:    1000      5831.4354      9616.9842     11604.8301
>>>> 13.8257        -220677.3820       308.1108         0.0000
>>>> 0.0000         0.0000        -193302.1958         0.0000
>>>> -193302.1958   -193302.1958         0.0000          -2954.4553
>>>> -2914.4624
>>>>
>>>> WRITING EXTENDED SYSTEM TO RESTART FILE AT STEP 1000
>>>> WRITING COORDINATES TO DCD FILE AT STEP 1000
>>>> WRITING COORDINATES TO RESTART FILE AT STEP 1000
>>>> FINISHED WRITING RESTART COORDINATES
>>>> The last position output (seq=1000) takes 0.026 seconds, 238.145 MB of
>>>> memory in use
>>>> WRITING VELOCITIES TO RESTART FILE AT STEP 1000
>>>> FINISHED WRITING RESTART VELOCITIES
>>>> The last velocity output (seq=1000) takes 0.019 seconds, 238.145 MB of
>>>> memory in use
>>>> REINITIALIZING VELOCITIES AT STEP 1000 TO 303 KELVIN.
>>>> TCL: Running for 9000 steps
>>>> PRESSURE: 1000 -1607.18 5.85548 -10.9122 5.85548 -1546.56 26.3568
>>>> -10.9122 26.3568 -886.287
>>>> GPRESSURE: 1000 -1469.55 7.5989 -10.7156 10.9579 -1410.74 22.6426
>>>> -10.5674 20.7688 -1127
>>>> ETITLE:      TS           BOND          ANGLE          DIHED
>>>> IMPRP               ELECT            VDW       BOUNDARY           MISC
>>>>      KINETIC               TOTAL           TEMP      POTENTIAL
>>>>  TOTAL3        TEMPAVG            PRESSURE      GPRESSURE
>>>> ENERGY:    1000       607.1667      6226.7038     11604.6460
>>>> 13.8497        -203337.4899        27.6364         0.0000
>>>> 0.0000     52831.6131        -132025.8742       303.3486
>>>> -184857.4873   -132057.5192       303.3486          -1346.6784
>>>> -1335.7638
>>>>
>>>>
>>>>
>>>> it takes some hours until this message is printed:
>>>> [0] processControlPoints() haveControlPointChangeCallback=0
>>>> frameworkShouldAdvancePhase=0
>>>>
>>>> Any clue where I could search?
>>>> If you need more information, don't hesitate to ask.
>>>>
>>>> Cheers,
>>>> Bjoern
>>>>
>>>
>

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:20:16 CST