From: Jim Phillips (jim_at_ks.uiuc.edu)
Date: Tue May 17 2011 - 13:21:49 CDT
Hi again,
Thanks. This is a race condition that can occur in pencil PME (only) when
all of the patches on a processor are completely empty. This situation
does not occur in normal periodic simulations. The bug will not occur if
langevinPiston is used and is much less likely to occur if multiple
timestepping is used. The bug will always produce a hanging simulation,
never incorrect results. I am working on a fix.
-Jim
On Mon, 16 May 2011, Bjoern Olausson wrote:
> Well, the system is a symmetrical monolayer setup with some vacuum
> space in +Z and -Z direction so I would the global density expect to
> be significant lower then in a "general" solvated system.
> The local density e.g. for water, after some equilibration steps,
> should be around 1 g/cm^3.
>
> Sure I can send you the input files.
>
> Cheers,
> Bjoern
>
> On Mon, May 16, 2011 at 15:57, Jim Phillips <jim_at_ks.uiuc.edu> wrote:
>> Hi again,
>>
>> Thanks. Is there a reason your system has half the typical density for a
>> solvated periodic system? Can you point the input files so I can try to
>> reproduce this myself?
>>
>> -Jim
>>
>> On Mon, 16 May 2011, Bjoern Olausson wrote:
>>
>>> Here are the requested test result with NAMD 2.8b2
>>>
>>> One is particular interesting. The setup with "twoAwayX yes", PME and 264
>>> cores did not fail consistently. From 4 tries it stalled only two times.
>>>
>>> The timestep NAMD stalls on is not consitent too.
>>>
>>> Without PME there were no problems at all.
>>>
>>> Please find all relevant data here:
>>> http://daten-transport.de/?id=8kNGP4tykfLp
>>>
>>> If you need more Information, don't hesitate to ask.
>>>
>>> Cheers,
>>> Bjoern
>>>
>>> On Saturday 14 May 2011 20:56:26 Jim Phillips wrote:
>>>>
>>>> 2.8b2 would be best. -Jim
>>>>
>>>> On Sat, 14 May 2011, Bjoern Olausson wrote:
>>>>>
>>>>> Should I run those tests with 2.8b2 or are you satisfied with 2.7?
>>>>>
>>>>> Cheers,
>>>>> Bjoern
>>>>>
>>>>> On Fri, May 13, 2011 at 21:52, Jim Phillips <jim_at_ks.uiuc.edu> wrote:
>>>>>>
>>>>>> Thanks. It looks like the switch from PME slabs to pencils happens
>>>>>> between 144 and 156, but there's no obvious change from 252 to 264.
>>>>>> The 264-core runs for over 1000 steps so it's not a deterministic
>>>>>> problem.
>>>>>>
>>>>>> Please try for the two failing cases first adding outputTiming 1 so
>>>>>> that
>>>>>> we'll know what timestep it actually hangs on and then turning off PME
>>>>>> so that we can tell if there's a connection to PME or not.
>>>>>>
>>>>>> -Jim
>>>>>>
>>>>>> On Fri, 13 May 2011, Bjoern Olausson wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> while tuning the "twoAway" options, the simulation which stalled on
>>>>>>> 156 cores now stalled on 264 cores.
>>>>>>> with twoAwayX, twoAwayY, twoAwayZ all set to NO it stalls on 156 cores
>>>>>>> with twoAwayX set to YES and twoAwayY, twoAwayZ set to NO it stalles
>>>>>>> on 264 cores.
>>>>>>>
>>>>>>> (This was tested with NAMD 2.7, but I guess 2.8 will behave the same
>>>>>>> way) Please find the according logs under the following Link:
>>>>>>> http://daten-transport.de/?id=7qK3HdCVnM7W (namd-logs.tar.bz2 584,5
>>>>>>> Kilobytes)
>>>>>>>
>>>>>>> Cheers and many thanks,
>>>>>>> Bjoern
>>>>>>>
>>>>>>> On Fri, May 13, 2011 at 15:20, Jim Phillips <jim_at_ks.uiuc.edu> wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Please send me the complete log file for the largest working and
>>>>>>>> smallest hanging runs (I guess that's 144 and 156 cores).
>>>>>>>>
>>>>>>>> -Jim
>>>>>>>>
>>>>>>>> On Fri, 13 May 2011, Bjoern Olausson wrote:
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> with one of my Simulation I ran into the following problem.
>>>>>>>>> Running the simulation "B" on less then 156 Cores works fine (Each
>>>>>>>>> try incremented by 12 Cores).
>>>>>>>>> But with 156 Cores the simulations hangs after minimization. Another
>>>>>>>>> bigger simulation "A" runs fine with 156 Cores but stalls with 252.
>>>>>>>>>
>>>>>>>>> I am using
>>>>>>>>> NAMD_2.8b2_Linux-x86_64-ibverbs-net-linux-x86_64-ibverbs-icc
>>>>>>>>> currently, but the same happens with NAMD 2.7:
>>>>>>>>>
>>>>>>>>> Simulation A is a monolayer (Vacuum | Monolayer with attached
>>>>>>>>> Protein
>>>>>>>>>
>>>>>>>>> | Water | Monolayer with attached Protein | Vacuum)
>>>>>>>>>
>>>>>>>>> Simulation B is the same but I removed the two proteins and some
>>>>>>>>> water between the two monolayers.
>>>>>>>>>
>>>>>>>>> A has 163214 Atoms
>>>>>>>>> B has 79687 Atoms
>>>>>>>>>
>>>>>>>>> I can't find a reason why it happens at a certain Core number.
>>>>>>>>>
>>>>>>>>> LINE MINIMIZER BRACKET: DX 2.26297e-05 6.07123e-05 DU -0.112343
>>>>>>>>> 0.803579 DUDX -9856.98 -88.7072 26529.9
>>>>>>>>> LINE MINIMIZER REDUCING GRADIENT FROM 488884 TO 488.884
>>>>>>>>> PRESSURE: 998 -3096.26 0.240235 -2.11389 0.240235 -3036.98 30.6163
>>>>>>>>> -2.11389 30.6163 -2719.13
>>>>>>>>> GPRESSURE: 998 -3053.97 0.0322738 -2.31931 1.70752 -2997.23 32.1548
>>>>>>>>> 1.12647 30.6867 -2682.59
>>>>>>>>> ENERGY: 998 5798.1099 9606.5134 11613.1689
>>>>>>>>> 14.3917 -220491.3201 259.2408 0.0000
>>>>>>>>> 0.0000 0.0000 -193199.8954 0.0000
>>>>>>>>> -193199.8954 -193199.8954 0.0000 -2950.7895
>>>>>>>>> -2911.2626
>>>>>>>>>
>>>>>>>>> PRESSURE: 999 -3101.92 0.427017 -1.88108 0.427017 -3029.82 30.4947
>>>>>>>>> -1.88108 30.4947 -2731.63
>>>>>>>>> GPRESSURE: 999 -3056.02 0.387877 -3.93892 3.00918 -2994.69 32.1866
>>>>>>>>> 0.17135 30.0678 -2692.69
>>>>>>>>> ENERGY: 999 5831.4354 9616.9842 11604.8301
>>>>>>>>> 13.8257 -220677.3820 308.1108 0.0000
>>>>>>>>> 0.0000 0.0000 -193302.1958 0.0000
>>>>>>>>> -193302.1958 -193302.1958 0.0000 -2954.4553
>>>>>>>>> -2914.4624
>>>>>>>>>
>>>>>>>>> PRESSURE: 1000 -3101.92 0.427017 -1.88108 0.427017 -3029.82 30.4947
>>>>>>>>> -1.88108 30.4947 -2731.63
>>>>>>>>> GPRESSURE: 1000 -3056.02 0.387877 -3.93892 3.00918 -2994.69 32.1866
>>>>>>>>> 0.171348 30.0678 -2692.69
>>>>>>>>> TIMING: 1000 CPU: 24.3443, 0.0242553/step Wall: 24.388,
>>>>>>>>> 0.0242993/step, 0 hours remaining, 238.144531 MB of memory in use.
>>>>>>>>> ETITLE: TS BOND ANGLE DIHED
>>>>>>>>> IMPRP ELECT VDW BOUNDARY
>>>>>>>>> MISC KINETIC TOTAL TEMP POTENTIAL
>>>>>>>>> TOTAL3 TEMPAVG PRESSURE GPRESSURE
>>>>>>>>> ENERGY: 1000 5831.4354 9616.9842 11604.8301
>>>>>>>>> 13.8257 -220677.3820 308.1108 0.0000
>>>>>>>>> 0.0000 0.0000 -193302.1958 0.0000
>>>>>>>>> -193302.1958 -193302.1958 0.0000 -2954.4553
>>>>>>>>> -2914.4624
>>>>>>>>>
>>>>>>>>> WRITING EXTENDED SYSTEM TO RESTART FILE AT STEP 1000
>>>>>>>>> WRITING COORDINATES TO DCD FILE AT STEP 1000
>>>>>>>>> WRITING COORDINATES TO RESTART FILE AT STEP 1000
>>>>>>>>> FINISHED WRITING RESTART COORDINATES
>>>>>>>>> The last position output (seq=1000) takes 0.026 seconds, 238.145 MB
>>>>>>>>> of memory in use
>>>>>>>>> WRITING VELOCITIES TO RESTART FILE AT STEP 1000
>>>>>>>>> FINISHED WRITING RESTART VELOCITIES
>>>>>>>>> The last velocity output (seq=1000) takes 0.019 seconds, 238.145 MB
>>>>>>>>> of memory in use
>>>>>>>>> REINITIALIZING VELOCITIES AT STEP 1000 TO 303 KELVIN.
>>>>>>>>> TCL: Running for 9000 steps
>>>>>>>>> PRESSURE: 1000 -1607.18 5.85548 -10.9122 5.85548 -1546.56 26.3568
>>>>>>>>> -10.9122 26.3568 -886.287
>>>>>>>>> GPRESSURE: 1000 -1469.55 7.5989 -10.7156 10.9579 -1410.74 22.6426
>>>>>>>>> -10.5674 20.7688 -1127
>>>>>>>>> ETITLE: TS BOND ANGLE DIHED
>>>>>>>>> IMPRP ELECT VDW BOUNDARY
>>>>>>>>> MISC KINETIC TOTAL TEMP POTENTIAL
>>>>>>>>> TOTAL3 TEMPAVG PRESSURE GPRESSURE
>>>>>>>>> ENERGY: 1000 607.1667 6226.7038 11604.6460
>>>>>>>>> 13.8497 -203337.4899 27.6364 0.0000
>>>>>>>>> 0.0000 52831.6131 -132025.8742 303.3486
>>>>>>>>> -184857.4873 -132057.5192 303.3486 -1346.6784
>>>>>>>>> -1335.7638
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> it takes some hours until this message is printed:
>>>>>>>>> [0] processControlPoints() haveControlPointChangeCallback=0
>>>>>>>>> frameworkShouldAdvancePhase=0
>>>>>>>>>
>>>>>>>>> Any clue where I could search?
>>>>>>>>> If you need more information, don't hesitate to ask.
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Bjoern
>>>
>>> --
>>> Bjoern Olausson
>>> Martin-Luther-Universität Halle-Wittenberg
>>> Fachbereich Biochemie/Biotechnologie
>>> Kurt-Mothes-Str. 3
>>> 06120 Halle/Saale
>>>
>>> Phone: +49-345-55-24942
>>
>
This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:57:08 CST