From: Bjoern Olausson (namdlist_at_googlemail.com)
Date: Sat May 14 2011 - 01:12:57 CDT
Should I run those tests with 2.8b2 or are you satisfied with 2.7?
Cheers,
Bjoern
On Fri, May 13, 2011 at 21:52, Jim Phillips <jim_at_ks.uiuc.edu> wrote:
>
> Thanks. It looks like the switch from PME slabs to pencils happens between
> 144 and 156, but there's no obvious change from 252 to 264. The 264-core
> runs for over 1000 steps so it's not a deterministic problem.
>
> Please try for the two failing cases first adding outputTiming 1 so that
> we'll know what timestep it actually hangs on and then turning off PME so
> that we can tell if there's a connection to PME or not.
>
> -Jim
>
> On Fri, 13 May 2011, Bjoern Olausson wrote:
>
>> Hi,
>>
>> while tuning the "twoAway" options, the simulation which stalled on
>> 156 cores now stalled on 264 cores.
>> with twoAwayX, twoAwayY, twoAwayZ all set to NO it stalls on 156 cores
>> with twoAwayX set to YES and twoAwayY, twoAwayZ set to NO it stalles
>> on 264 cores.
>>
>> (This was tested with NAMD 2.7, but I guess 2.8 will behave the same way)
>> Please find the according logs under the following Link:
>> http://daten-transport.de/?id=7qK3HdCVnM7W (namd-logs.tar.bz2 584,5
>> Kilobytes)
>>
>> Cheers and many thanks,
>> Bjoern
>>
>> On Fri, May 13, 2011 at 15:20, Jim Phillips <jim_at_ks.uiuc.edu> wrote:
>>>
>>> Hi,
>>>
>>> Please send me the complete log file for the largest working and smallest
>>> hanging runs (I guess that's 144 and 156 cores).
>>>
>>> -Jim
>>>
>>>
>>> On Fri, 13 May 2011, Bjoern Olausson wrote:
>>>
>>>> Hi,
>>>>
>>>> with one of my Simulation I ran into the following problem.
>>>> Running the simulation "B" on less then 156 Cores works fine (Each try
>>>> incremented by 12 Cores).
>>>> But with 156 Cores the simulations hangs after minimization. Another
>>>> bigger simulation "A" runs fine with 156 Cores but stalls with 252.
>>>>
>>>> I am using NAMD_2.8b2_Linux-x86_64-ibverbs-net-linux-x86_64-ibverbs-icc
>>>> currently, but the same happens with NAMD 2.7:
>>>>
>>>> Simulation A is a monolayer (Vacuum | Monolayer with attached Protein
>>>> | Water | Monolayer with attached Protein | Vacuum)
>>>> Simulation B is the same but I removed the two proteins and some water
>>>> between the two monolayers.
>>>>
>>>> A has 163214 Atoms
>>>> B has 79687 Atoms
>>>>
>>>> I can't find a reason why it happens at a certain Core number.
>>>>
>>>> LINE MINIMIZER BRACKET: DX 2.26297e-05 6.07123e-05 DU -0.112343
>>>> 0.803579 DUDX -9856.98 -88.7072 26529.9
>>>> LINE MINIMIZER REDUCING GRADIENT FROM 488884 TO 488.884
>>>> PRESSURE: 998 -3096.26 0.240235 -2.11389 0.240235 -3036.98 30.6163
>>>> -2.11389 30.6163 -2719.13
>>>> GPRESSURE: 998 -3053.97 0.0322738 -2.31931 1.70752 -2997.23 32.1548
>>>> 1.12647 30.6867 -2682.59
>>>> ENERGY: 998 5798.1099 9606.5134 11613.1689
>>>> 14.3917 -220491.3201 259.2408 0.0000
>>>> 0.0000 0.0000 -193199.8954 0.0000
>>>> -193199.8954 -193199.8954 0.0000 -2950.7895
>>>> -2911.2626
>>>>
>>>> PRESSURE: 999 -3101.92 0.427017 -1.88108 0.427017 -3029.82 30.4947
>>>> -1.88108 30.4947 -2731.63
>>>> GPRESSURE: 999 -3056.02 0.387877 -3.93892 3.00918 -2994.69 32.1866
>>>> 0.17135 30.0678 -2692.69
>>>> ENERGY: 999 5831.4354 9616.9842 11604.8301
>>>> 13.8257 -220677.3820 308.1108 0.0000
>>>> 0.0000 0.0000 -193302.1958 0.0000
>>>> -193302.1958 -193302.1958 0.0000 -2954.4553
>>>> -2914.4624
>>>>
>>>> PRESSURE: 1000 -3101.92 0.427017 -1.88108 0.427017 -3029.82 30.4947
>>>> -1.88108 30.4947 -2731.63
>>>> GPRESSURE: 1000 -3056.02 0.387877 -3.93892 3.00918 -2994.69 32.1866
>>>> 0.171348 30.0678 -2692.69
>>>> TIMING: 1000 CPU: 24.3443, 0.0242553/step Wall: 24.388,
>>>> 0.0242993/step, 0 hours remaining, 238.144531 MB of memory in use.
>>>> ETITLE: TS BOND ANGLE DIHED
>>>> IMPRP ELECT VDW BOUNDARY MISC
>>>> KINETIC TOTAL TEMP POTENTIAL
>>>> TOTAL3 TEMPAVG PRESSURE GPRESSURE
>>>> ENERGY: 1000 5831.4354 9616.9842 11604.8301
>>>> 13.8257 -220677.3820 308.1108 0.0000
>>>> 0.0000 0.0000 -193302.1958 0.0000
>>>> -193302.1958 -193302.1958 0.0000 -2954.4553
>>>> -2914.4624
>>>>
>>>> WRITING EXTENDED SYSTEM TO RESTART FILE AT STEP 1000
>>>> WRITING COORDINATES TO DCD FILE AT STEP 1000
>>>> WRITING COORDINATES TO RESTART FILE AT STEP 1000
>>>> FINISHED WRITING RESTART COORDINATES
>>>> The last position output (seq=1000) takes 0.026 seconds, 238.145 MB of
>>>> memory in use
>>>> WRITING VELOCITIES TO RESTART FILE AT STEP 1000
>>>> FINISHED WRITING RESTART VELOCITIES
>>>> The last velocity output (seq=1000) takes 0.019 seconds, 238.145 MB of
>>>> memory in use
>>>> REINITIALIZING VELOCITIES AT STEP 1000 TO 303 KELVIN.
>>>> TCL: Running for 9000 steps
>>>> PRESSURE: 1000 -1607.18 5.85548 -10.9122 5.85548 -1546.56 26.3568
>>>> -10.9122 26.3568 -886.287
>>>> GPRESSURE: 1000 -1469.55 7.5989 -10.7156 10.9579 -1410.74 22.6426
>>>> -10.5674 20.7688 -1127
>>>> ETITLE: TS BOND ANGLE DIHED
>>>> IMPRP ELECT VDW BOUNDARY MISC
>>>> KINETIC TOTAL TEMP POTENTIAL
>>>> TOTAL3 TEMPAVG PRESSURE GPRESSURE
>>>> ENERGY: 1000 607.1667 6226.7038 11604.6460
>>>> 13.8497 -203337.4899 27.6364 0.0000
>>>> 0.0000 52831.6131 -132025.8742 303.3486
>>>> -184857.4873 -132057.5192 303.3486 -1346.6784
>>>> -1335.7638
>>>>
>>>>
>>>>
>>>> it takes some hours until this message is printed:
>>>> [0] processControlPoints() haveControlPointChangeCallback=0
>>>> frameworkShouldAdvancePhase=0
>>>>
>>>> Any clue where I could search?
>>>> If you need more information, don't hesitate to ask.
>>>>
>>>> Cheers,
>>>> Bjoern
>>>>
>>>
>
This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:20:16 CST