Re: NAMD 2.7/2.8b2 stuck - [0] processControlPoints() haveControlPointChangeCallback=0 frameworkShouldAdvancePhase=0

From: Jim Phillips (jim_at_ks.uiuc.edu)
Date: Fri May 13 2011 - 14:52:53 CDT

Thanks. It looks like the switch from PME slabs to pencils happens
between 144 and 156, but there's no obvious change from 252 to 264. The
264-core runs for over 1000 steps so it's not a deterministic problem.

Please try for the two failing cases first adding outputTiming 1 so that
we'll know what timestep it actually hangs on and then turning off PME so
that we can tell if there's a connection to PME or not.

-Jim

On Fri, 13 May 2011, Bjoern Olausson wrote:

> Hi,
>
> while tuning the "twoAway" options, the simulation which stalled on
> 156 cores now stalled on 264 cores.
> with twoAwayX, twoAwayY, twoAwayZ all set to NO it stalls on 156 cores
> with twoAwayX set to YES and twoAwayY, twoAwayZ set to NO it stalles
> on 264 cores.
>
> (This was tested with NAMD 2.7, but I guess 2.8 will behave the same way)
> Please find the according logs under the following Link:
> http://daten-transport.de/?id=7qK3HdCVnM7W (namd-logs.tar.bz2 584,5 Kilobytes)
>
> Cheers and many thanks,
> Bjoern
>
> On Fri, May 13, 2011 at 15:20, Jim Phillips <jim_at_ks.uiuc.edu> wrote:
>> Hi,
>>
>> Please send me the complete log file for the largest working and smallest
>> hanging runs (I guess that's 144 and 156 cores).
>>
>> -Jim
>>
>>
>> On Fri, 13 May 2011, Bjoern Olausson wrote:
>>
>>> Hi,
>>>
>>> with one of my Simulation I ran into the following problem.
>>> Running the simulation "B" on less then 156 Cores works fine (Each try
>>> incremented by 12 Cores).
>>> But with 156 Cores the simulations hangs after minimization. Another
>>> bigger simulation "A" runs fine with 156 Cores but stalls with 252.
>>>
>>> I am using NAMD_2.8b2_Linux-x86_64-ibverbs-net-linux-x86_64-ibverbs-icc
>>> currently, but the same happens with NAMD 2.7:
>>>
>>> Simulation A is a monolayer (Vacuum | Monolayer with attached Protein
>>> | Water | Monolayer with attached Protein | Vacuum)
>>> Simulation B is the same but I removed the two proteins and some water
>>> between the two monolayers.
>>>
>>> A has 163214 Atoms
>>> B has   79687 Atoms
>>>
>>> I can't find a reason why it happens at a certain Core number.
>>>
>>> LINE MINIMIZER BRACKET: DX 2.26297e-05 6.07123e-05 DU -0.112343
>>> 0.803579 DUDX -9856.98 -88.7072 26529.9
>>> LINE MINIMIZER REDUCING GRADIENT FROM 488884 TO 488.884
>>> PRESSURE: 998 -3096.26 0.240235 -2.11389 0.240235 -3036.98 30.6163
>>> -2.11389 30.6163 -2719.13
>>> GPRESSURE: 998 -3053.97 0.0322738 -2.31931 1.70752 -2997.23 32.1548
>>> 1.12647 30.6867 -2682.59
>>> ENERGY:     998      5798.1099      9606.5134     11613.1689
>>> 14.3917        -220491.3201       259.2408         0.0000
>>> 0.0000         0.0000        -193199.8954         0.0000
>>> -193199.8954   -193199.8954         0.0000          -2950.7895
>>> -2911.2626
>>>
>>> PRESSURE: 999 -3101.92 0.427017 -1.88108 0.427017 -3029.82 30.4947
>>> -1.88108 30.4947 -2731.63
>>> GPRESSURE: 999 -3056.02 0.387877 -3.93892 3.00918 -2994.69 32.1866
>>> 0.17135 30.0678 -2692.69
>>> ENERGY:     999      5831.4354      9616.9842     11604.8301
>>> 13.8257        -220677.3820       308.1108         0.0000
>>> 0.0000         0.0000        -193302.1958         0.0000
>>> -193302.1958   -193302.1958         0.0000          -2954.4553
>>> -2914.4624
>>>
>>> PRESSURE: 1000 -3101.92 0.427017 -1.88108 0.427017 -3029.82 30.4947
>>> -1.88108 30.4947 -2731.63
>>> GPRESSURE: 1000 -3056.02 0.387877 -3.93892 3.00918 -2994.69 32.1866
>>> 0.171348 30.0678 -2692.69
>>> TIMING: 1000  CPU: 24.3443, 0.0242553/step  Wall: 24.388,
>>> 0.0242993/step, 0 hours remaining, 238.144531 MB of memory in use.
>>> ETITLE:      TS           BOND          ANGLE          DIHED
>>> IMPRP               ELECT            VDW       BOUNDARY           MISC
>>>      KINETIC               TOTAL           TEMP      POTENTIAL
>>>  TOTAL3        TEMPAVG            PRESSURE      GPRESSURE
>>> ENERGY:    1000      5831.4354      9616.9842     11604.8301
>>> 13.8257        -220677.3820       308.1108         0.0000
>>> 0.0000         0.0000        -193302.1958         0.0000
>>> -193302.1958   -193302.1958         0.0000          -2954.4553
>>> -2914.4624
>>>
>>> WRITING EXTENDED SYSTEM TO RESTART FILE AT STEP 1000
>>> WRITING COORDINATES TO DCD FILE AT STEP 1000
>>> WRITING COORDINATES TO RESTART FILE AT STEP 1000
>>> FINISHED WRITING RESTART COORDINATES
>>> The last position output (seq=1000) takes 0.026 seconds, 238.145 MB of
>>> memory in use
>>> WRITING VELOCITIES TO RESTART FILE AT STEP 1000
>>> FINISHED WRITING RESTART VELOCITIES
>>> The last velocity output (seq=1000) takes 0.019 seconds, 238.145 MB of
>>> memory in use
>>> REINITIALIZING VELOCITIES AT STEP 1000 TO 303 KELVIN.
>>> TCL: Running for 9000 steps
>>> PRESSURE: 1000 -1607.18 5.85548 -10.9122 5.85548 -1546.56 26.3568
>>> -10.9122 26.3568 -886.287
>>> GPRESSURE: 1000 -1469.55 7.5989 -10.7156 10.9579 -1410.74 22.6426
>>> -10.5674 20.7688 -1127
>>> ETITLE:      TS           BOND          ANGLE          DIHED
>>> IMPRP               ELECT            VDW       BOUNDARY           MISC
>>>      KINETIC               TOTAL           TEMP      POTENTIAL
>>>  TOTAL3        TEMPAVG            PRESSURE      GPRESSURE
>>> ENERGY:    1000       607.1667      6226.7038     11604.6460
>>> 13.8497        -203337.4899        27.6364         0.0000
>>> 0.0000     52831.6131        -132025.8742       303.3486
>>> -184857.4873   -132057.5192       303.3486          -1346.6784
>>> -1335.7638
>>>
>>>
>>>
>>> it takes some hours until this message is printed:
>>> [0] processControlPoints() haveControlPointChangeCallback=0
>>> frameworkShouldAdvancePhase=0
>>>
>>> Any clue where I could search?
>>> If you need more information, don't hesitate to ask.
>>>
>>> Cheers,
>>> Bjoern
>>>
>>
>

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:20:15 CST