From: Jim Phillips (jim_at_ks.uiuc.edu)
Date: Wed May 18 2011 - 17:58:16 CDT
The fix is checked in now, so yes, it will make the gold-plated 2.8 (as
well as the nightly builds).
Half-empty periodic cell without multiple timestepping on large processor
counts and a large enough system for pencil PME is rather uncommon.
-Jim
On Wed, 18 May 2011, Bjoern Olausson wrote:
> I guess the empty patches were due to the vacuum space at +/-Z.
> and since langevinPiston is not used in NVT ensemble I ran straight
> into this bug ;-)
> Still wondering why I was the first one - Is my setup so weird?
>
> Thanks for fixing this.
> Do you plan to include the fix in the 2.8 gold release?
>
> Cheers,
> Bjoern
>
> On Tue, May 17, 2011 at 20:21, Jim Phillips <jim_at_ks.uiuc.edu> wrote:
>> Hi again,
>>
>> Thanks. This is a race condition that can occur in pencil PME (only) when
>> all of the patches on a processor are completely empty. This situation does
>> not occur in normal periodic simulations. The bug will not occur if
>> langevinPiston is used and is much less likely to occur if multiple
>> timestepping is used. The bug will always produce a hanging simulation,
>> never incorrect results. I am working on a fix.
>>
>> -Jim
>>
>>
>> On Mon, 16 May 2011, Bjoern Olausson wrote:
>>
>>> Well, the system is a symmetrical monolayer setup with some vacuum
>>> space in +Z and -Z direction so I would the global density expect to
>>> be significant lower then in a "general" solvated system.
>>> The local density e.g. for water, after some equilibration steps,
>>> should be around 1 g/cm^3.
>>>
>>> Sure I can send you the input files.
>>>
>>> Cheers,
>>> Bjoern
>>>
>>> On Mon, May 16, 2011 at 15:57, Jim Phillips <jim_at_ks.uiuc.edu> wrote:
>>>>
>>>> Hi again,
>>>>
>>>> Thanks. Is there a reason your system has half the typical density for a
>>>> solvated periodic system? Can you point the input files so I can try to
>>>> reproduce this myself?
>>>>
>>>> -Jim
>>>>
>>>> On Mon, 16 May 2011, Bjoern Olausson wrote:
>>>>
>>>>> Here are the requested test result with NAMD 2.8b2
>>>>>
>>>>> One is particular interesting. The setup with "twoAwayX yes", PME and
>>>>> 264
>>>>> cores did not fail consistently. From 4 tries it stalled only two times.
>>>>>
>>>>> The timestep NAMD stalls on is not consitent too.
>>>>>
>>>>> Without PME there were no problems at all.
>>>>>
>>>>> Please find all relevant data here:
>>>>> http://daten-transport.de/?id=8kNGP4tykfLp
>>>>>
>>>>> If you need more Information, don't hesitate to ask.
>>>>>
>>>>> Cheers,
>>>>> Bjoern
>>>>>
>>>>> On Saturday 14 May 2011 20:56:26 Jim Phillips wrote:
>>>>>>
>>>>>> 2.8b2 would be best. -Jim
>>>>>>
>>>>>> On Sat, 14 May 2011, Bjoern Olausson wrote:
>>>>>>>
>>>>>>> Should I run those tests with 2.8b2 or are you satisfied with 2.7?
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Bjoern
>>>>>>>
>>>>>>> On Fri, May 13, 2011 at 21:52, Jim Phillips <jim_at_ks.uiuc.edu> wrote:
>>>>>>>>
>>>>>>>> Thanks. It looks like the switch from PME slabs to pencils happens
>>>>>>>> between 144 and 156, but there's no obvious change from 252 to 264.
>>>>>>>> The 264-core runs for over 1000 steps so it's not a deterministic
>>>>>>>> problem.
>>>>>>>>
>>>>>>>> Please try for the two failing cases first adding outputTiming 1 so
>>>>>>>> that
>>>>>>>> we'll know what timestep it actually hangs on and then turning off
>>>>>>>> PME
>>>>>>>> so that we can tell if there's a connection to PME or not.
>>>>>>>>
>>>>>>>> -Jim
>>>>>>>>
>>>>>>>> On Fri, 13 May 2011, Bjoern Olausson wrote:
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> while tuning the "twoAway" options, the simulation which stalled on
>>>>>>>>> 156 cores now stalled on 264 cores.
>>>>>>>>> with twoAwayX, twoAwayY, twoAwayZ all set to NO it stalls on 156
>>>>>>>>> cores
>>>>>>>>> with twoAwayX set to YES and twoAwayY, twoAwayZ set to NO it
>>>>>>>>> stalles
>>>>>>>>> on 264 cores.
>>>>>>>>>
>>>>>>>>> (This was tested with NAMD 2.7, but I guess 2.8 will behave the same
>>>>>>>>> way) Please find the according logs under the following Link:
>>>>>>>>> http://daten-transport.de/?id=7qK3HdCVnM7W (namd-logs.tar.bz2 584,5
>>>>>>>>> Kilobytes)
>>>>>>>>>
>>>>>>>>> Cheers and many thanks,
>>>>>>>>> Bjoern
>>>>>>>>>
>>>>>>>>> On Fri, May 13, 2011 at 15:20, Jim Phillips <jim_at_ks.uiuc.edu> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> Please send me the complete log file for the largest working and
>>>>>>>>>> smallest hanging runs (I guess that's 144 and 156 cores).
>>>>>>>>>>
>>>>>>>>>> -Jim
>>>>>>>>>>
>>>>>>>>>> On Fri, 13 May 2011, Bjoern Olausson wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> with one of my Simulation I ran into the following problem.
>>>>>>>>>>> Running the simulation "B" on less then 156 Cores works fine (Each
>>>>>>>>>>> try incremented by 12 Cores).
>>>>>>>>>>> But with 156 Cores the simulations hangs after minimization.
>>>>>>>>>>> Another
>>>>>>>>>>> bigger simulation "A" runs fine with 156 Cores but stalls with
>>>>>>>>>>> 252.
>>>>>>>>>>>
>>>>>>>>>>> I am using
>>>>>>>>>>> NAMD_2.8b2_Linux-x86_64-ibverbs-net-linux-x86_64-ibverbs-icc
>>>>>>>>>>> currently, but the same happens with NAMD 2.7:
>>>>>>>>>>>
>>>>>>>>>>> Simulation A is a monolayer (Vacuum | Monolayer with attached
>>>>>>>>>>> Protein
>>>>>>>>>>>
>>>>>>>>>>> | Water | Monolayer with attached Protein | Vacuum)
>>>>>>>>>>>
>>>>>>>>>>> Simulation B is the same but I removed the two proteins and some
>>>>>>>>>>> water between the two monolayers.
>>>>>>>>>>>
>>>>>>>>>>> A has 163214 Atoms
>>>>>>>>>>> B has 79687 Atoms
>>>>>>>>>>>
>>>>>>>>>>> I can't find a reason why it happens at a certain Core number.
>>>>>>>>>>>
>>>>>>>>>>> LINE MINIMIZER BRACKET: DX 2.26297e-05 6.07123e-05 DU -0.112343
>>>>>>>>>>> 0.803579 DUDX -9856.98 -88.7072 26529.9
>>>>>>>>>>> LINE MINIMIZER REDUCING GRADIENT FROM 488884 TO 488.884
>>>>>>>>>>> PRESSURE: 998 -3096.26 0.240235 -2.11389 0.240235 -3036.98 30.6163
>>>>>>>>>>> -2.11389 30.6163 -2719.13
>>>>>>>>>>> GPRESSURE: 998 -3053.97 0.0322738 -2.31931 1.70752 -2997.23
>>>>>>>>>>> 32.1548
>>>>>>>>>>> 1.12647 30.6867 -2682.59
>>>>>>>>>>> ENERGY: 998 5798.1099 9606.5134 11613.1689
>>>>>>>>>>> 14.3917 -220491.3201 259.2408 0.0000
>>>>>>>>>>> 0.0000 0.0000 -193199.8954 0.0000
>>>>>>>>>>> -193199.8954 -193199.8954 0.0000 -2950.7895
>>>>>>>>>>> -2911.2626
>>>>>>>>>>>
>>>>>>>>>>> PRESSURE: 999 -3101.92 0.427017 -1.88108 0.427017 -3029.82 30.4947
>>>>>>>>>>> -1.88108 30.4947 -2731.63
>>>>>>>>>>> GPRESSURE: 999 -3056.02 0.387877 -3.93892 3.00918 -2994.69 32.1866
>>>>>>>>>>> 0.17135 30.0678 -2692.69
>>>>>>>>>>> ENERGY: 999 5831.4354 9616.9842 11604.8301
>>>>>>>>>>> 13.8257 -220677.3820 308.1108 0.0000
>>>>>>>>>>> 0.0000 0.0000 -193302.1958 0.0000
>>>>>>>>>>> -193302.1958 -193302.1958 0.0000 -2954.4553
>>>>>>>>>>> -2914.4624
>>>>>>>>>>>
>>>>>>>>>>> PRESSURE: 1000 -3101.92 0.427017 -1.88108 0.427017 -3029.82
>>>>>>>>>>> 30.4947
>>>>>>>>>>> -1.88108 30.4947 -2731.63
>>>>>>>>>>> GPRESSURE: 1000 -3056.02 0.387877 -3.93892 3.00918 -2994.69
>>>>>>>>>>> 32.1866
>>>>>>>>>>> 0.171348 30.0678 -2692.69
>>>>>>>>>>> TIMING: 1000 CPU: 24.3443, 0.0242553/step Wall: 24.388,
>>>>>>>>>>> 0.0242993/step, 0 hours remaining, 238.144531 MB of memory in use.
>>>>>>>>>>> ETITLE: TS BOND ANGLE DIHED
>>>>>>>>>>> IMPRP ELECT VDW BOUNDARY
>>>>>>>>>>> MISC KINETIC TOTAL TEMP POTENTIAL
>>>>>>>>>>> TOTAL3 TEMPAVG PRESSURE GPRESSURE
>>>>>>>>>>> ENERGY: 1000 5831.4354 9616.9842 11604.8301
>>>>>>>>>>> 13.8257 -220677.3820 308.1108 0.0000
>>>>>>>>>>> 0.0000 0.0000 -193302.1958 0.0000
>>>>>>>>>>> -193302.1958 -193302.1958 0.0000 -2954.4553
>>>>>>>>>>> -2914.4624
>>>>>>>>>>>
>>>>>>>>>>> WRITING EXTENDED SYSTEM TO RESTART FILE AT STEP 1000
>>>>>>>>>>> WRITING COORDINATES TO DCD FILE AT STEP 1000
>>>>>>>>>>> WRITING COORDINATES TO RESTART FILE AT STEP 1000
>>>>>>>>>>> FINISHED WRITING RESTART COORDINATES
>>>>>>>>>>> The last position output (seq=1000) takes 0.026 seconds, 238.145
>>>>>>>>>>> MB
>>>>>>>>>>> of memory in use
>>>>>>>>>>> WRITING VELOCITIES TO RESTART FILE AT STEP 1000
>>>>>>>>>>> FINISHED WRITING RESTART VELOCITIES
>>>>>>>>>>> The last velocity output (seq=1000) takes 0.019 seconds, 238.145
>>>>>>>>>>> MB
>>>>>>>>>>> of memory in use
>>>>>>>>>>> REINITIALIZING VELOCITIES AT STEP 1000 TO 303 KELVIN.
>>>>>>>>>>> TCL: Running for 9000 steps
>>>>>>>>>>> PRESSURE: 1000 -1607.18 5.85548 -10.9122 5.85548 -1546.56 26.3568
>>>>>>>>>>> -10.9122 26.3568 -886.287
>>>>>>>>>>> GPRESSURE: 1000 -1469.55 7.5989 -10.7156 10.9579 -1410.74 22.6426
>>>>>>>>>>> -10.5674 20.7688 -1127
>>>>>>>>>>> ETITLE: TS BOND ANGLE DIHED
>>>>>>>>>>> IMPRP ELECT VDW BOUNDARY
>>>>>>>>>>> MISC KINETIC TOTAL TEMP POTENTIAL
>>>>>>>>>>> TOTAL3 TEMPAVG PRESSURE GPRESSURE
>>>>>>>>>>> ENERGY: 1000 607.1667 6226.7038 11604.6460
>>>>>>>>>>> 13.8497 -203337.4899 27.6364 0.0000
>>>>>>>>>>> 0.0000 52831.6131 -132025.8742 303.3486
>>>>>>>>>>> -184857.4873 -132057.5192 303.3486 -1346.6784
>>>>>>>>>>> -1335.7638
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> it takes some hours until this message is printed:
>>>>>>>>>>> [0] processControlPoints() haveControlPointChangeCallback=0
>>>>>>>>>>> frameworkShouldAdvancePhase=0
>>>>>>>>>>>
>>>>>>>>>>> Any clue where I could search?
>>>>>>>>>>> If you need more information, don't hesitate to ask.
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Bjoern
>>>>>
>>>>> --
>>>>> Bjoern Olausson
>>>>> Martin-Luther-Universität Halle-Wittenberg
>>>>> Fachbereich Biochemie/Biotechnologie
>>>>> Kurt-Mothes-Str. 3
>>>>> 06120 Halle/Saale
>>>>>
>>>>> Phone: +49-345-55-24942
>>>>
>>
>
This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:20:17 CST