Re: crash with more than 96 processors (v2.7b1)

From: Grace Brannigan (gracebrannigan_at_gmail.com)
Date: Fri May 01 2009 - 16:36:11 CDT

Hi George,

When I turn off pressure control, starting from a frame that had been
equilibrated with pressure control, it starts running and then I get the
following error:

ERROR: Stray PME grid charges detected: 38 sending to 127 for planes 98 99
ERROR: Margin is too small for 3 atoms during timestep 81.
ERROR: Incorrect nonbonded forces and energies may be calculated!

with the first error repeated many times. Again, when I run the same
configuration on 96 procs there is no problem. Were you using PME when you
got yours to work?

-Grace

On Fri, May 1, 2009 at 11:28 AM, George Madalin Giambasu <giambasu_at_gmail.com
> wrote:

> I just wanted to report that a similar behavior is observed on BluegeneP.
> Increasing the number of processors (>64) makes namd crash randomly between
> steps 1000-30000. Turning off the Langevin or Beredensen pressure control
> eliminates the crashing problem. My system is ~90000 atoms, amber force
> field (in amber format) using a rhombo-dodecahedron symmetry.
>
>
>
> George
>
>
>
>
> Grace Brannigan wrote:
>
> Hi Chris,
>
> The closest distance is between the two oxygens of the water : 1.69A. There
> was no clash in the beginning of the simulation (they started at 2.73 A).
> However, the system is shrinking as a result of the Langevin piston, but
> should that be dependent on the number of nodes?
>
> Timestep of 1.0 results in a crash as well with the same waters, after more
> steps but about the same amount of fs.
>
> Changing the hgroupCutoff to 2.5 actually made the simulation crash at step
> 120 instead 140.
>
> Increasing the margin to 1.0 doesn't change anything.
>
> Ideas?
>
> -Grace
>
> On Thu, Apr 30, 2009 at 8:41 PM, Chris Harrison <char_at_ks.uiuc.edu> wrote:
>
>> Grace,
>>
>> A few questions:
>> You say "close," can you give the distance between the closest atoms in
>> Angstroms?
>>
>> Are the two closest atoms hydrogens by chance? If so, could you try a
>> restart from something fairly close to step 140, using timestep of 1.0.
>>
>> Also, is there a specific reason for the hgroupCutoff value 2.8? If not,
>> could you try reducing that to 2.5 and see if that makes a difference?
>>
>> If neither of these make a difference, could you increase the margin to
>> 0.5 or 1.0 and test that?
>>
>>
>> C.
>>
>>
>> --
>> Chris Harrison, Ph.D.
>> Theoretical and Computational Biophysics Group
>> NIH Resource for Macromolecular Modeling and Bioinformatics
>> Beckman Institute for Advanced Science and Technology
>> University of Illinois, 405 N. Mathews Ave., Urbana, IL 61801
>>
>> char_at_ks.uiuc.edu Voice: 217-244-1733
>> http://www.ks.uiuc.edu/~char <http://www.ks.uiuc.edu/%7Echar>
>> Fax: 217-244-6078
>>
>>
>>
>> On Thu, Apr 30, 2009 at 2:49 PM, Grace Brannigan <
>> gracebrannigan_at_gmail.com> wrote:
>>
>>> Hi Chris,
>>>
>>> I did as you suggested. For the system run on 128 nodes, the energies
>>> right before the crash at step 140 are:
>>>
>>> ENERGY: 139 368.4760 1082.8205 1324.6014
>>> 45.3948 -251964.9404 25326.2532 496.9566
>>> 0.0000 6871.6446 -216448.7934 57.8604 -216443.9396
>>> -216448.2858 57.8604 -228.2303 -274.6782
>>> 587038.2456 -228.2303 -274.6782
>>>
>>> ENERGY: 140 366.8165 1084.7263 1325.5485
>>> 46.3538 -251992.5581 26939.8959 495.4494 0.0000
>>> 99999999.9999 99999999.9999 99999999.9999 99999999.9999
>>> nan -99999999.9999 -99999999.9999 -99999999.9999 586888.6700
>>> -99999999.9999 -99999999.9999
>>>
>>> For comparison, the energies at the same step on 96 nodes are
>>>
>>> ENERGY: 139 358.1118 1087.0480 1328.9915
>>> 46.5093 -252345.2854 25274.9919 497.1248
>>> 0.0000 6527.0026 -217225.5054 54.9585 -217220.9702
>>> -217225.8113 54.9585 -302.9743 -347.3116
>>> 587059.0631 -302.9743 -347.3116
>>>
>>> Looking at the dcd file, there are two water molecules (the ones with
>>> infinite velocity at step 140) that are close right before the crash, but
>>> not overlapping.
>>>
>>> -Grace
>>>
>>>
>>>
>>> On Tue, Apr 28, 2009 at 7:30 PM, Chris Harrison <char_at_ks.uiuc.edu>wrote:
>>>
>>>> Could you please set the following parameters as indicated and rerun the
>>>> 128 proc job on either cluster:
>>>>
>>>> dcdfreq 1
>>>> outputEnergies 1
>>>>
>>>> The idea is to isolate, via looking at the components of the energy in
>>>> the log file and the changes in the structure from the dcd file, anything
>>>> "physical" in your simulation that may be "blowing up." If there is
>>>> something physical "blowing up", you will need to do two things:
>>>>
>>>> 1. examine the energy components from the log file at the corresponding
>>>> log file. The component that shoots up should correspond to the physical
>>>> interaction responsible for the "physical blowing up."
>>>>
>>>> 2. You should probably also compare the dynamics and energy component
>>>> trends to the 96 processor simulation to examine their similarity and assess
>>>> how reasonable it is that MTS yielded dynamics different enough to crash one
>>>> sim w/ X # of procs vs one sim w/ Y # of procs. Basically are the
>>>> simulations comparable up to a point and at what point do they seriously
>>>> diverge quickly leading to a crash ... in which regime of MTS (based on your
>>>> config parameters) does this seem to fit. We need to figure out if we're
>>>> looking at a difference in dynamics or if there's a "bug" yielding a
>>>> "physically realistic blow up" that only shows up during a parallel process
>>>> like patch migration/reduction, etc when using 128 as opposed to 96 procs.
>>>>
>>>> If there is nothing physical that is blowing up and the simulation is
>>>> really just spontaneously crashing on both architectures using 128 procs
>>>> then we'll have to dig deeper and consider running your simulation with
>>>> debug flags and trace things to the source of the crash.
>>>>
>>>>
>>>> C.
>>>>
>>>>
>>>> --
>>>> Chris Harrison, Ph.D.
>>>> Theoretical and Computational Biophysics Group
>>>> NIH Resource for Macromolecular Modeling and Bioinformatics
>>>> Beckman Institute for Advanced Science and Technology
>>>> University of Illinois, 405 N. Mathews Ave., Urbana, IL 61801
>>>>
>>>> char_at_ks.uiuc.edu Voice: 217-244-1733
>>>> http://www.ks.uiuc.edu/~char <http://www.ks.uiuc.edu/%7Echar>
>>>> Fax: 217-244-6078
>>>>
>>>>
>>>>
>>>> On Tue, Apr 28, 2009 at 5:36 PM, Grace Brannigan <
>>>> grace_at_vitae.cmm.upenn.edu> wrote:
>>>>
>>>>> Hi Chris,
>>>>>
>>>>> The 128 processor job dies immediately, while the 96 processor job can
>>>>> go on for forever (or at least 4ns).
>>>>>
>>>>> Our cluster is a dual quadcore Xeon E5430 with infiniband interconnect,
>>>>> and yes, it dies at 128 cores on both clusters.
>>>>>
>>>>> -Grace
>>>>>
>>>>> On Tue, Apr 28, 2009 at 5:50 PM, Chris Harrison <char_at_ks.uiuc.edu>wrote:
>>>>>
>>>>>> Grace,
>>>>>>
>>>>>> You say your cluster, I'm assuming this isn't an XT5. ;)
>>>>>>
>>>>>> Can you provide some details on your cluster and clarify if you mean
>>>>>> 128 procs on both clusters, irrespective of architecture?
>>>>>>
>>>>>> Also, you have confirmed that using the lower # of procs you can
>>>>>> exceed the timestep at which the "128 proc" job dies, correct?
>>>>>>
>>>>>>
>>>>>> C.
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Chris Harrison, Ph.D.
>>>>>> Theoretical and Computational Biophysics Group
>>>>>> NIH Resource for Macromolecular Modeling and Bioinformatics
>>>>>> Beckman Institute for Advanced Science and Technology
>>>>>> University of Illinois, 405 N. Mathews Ave., Urbana, IL 61801
>>>>>>
>>>>>> char_at_ks.uiuc.edu Voice: 217-244-1733
>>>>>> http://www.ks.uiuc.edu/~char <http://www.ks.uiuc.edu/%7Echar>
>>>>>> Fax: 217-244-6078
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Apr 28, 2009 at 2:16 PM, Grace Brannigan <
>>>>>> grace_at_vitae.cmm.upenn.edu> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I have been simulating a protein in a truncated octahedral water
>>>>>>> box(~90k atoms) using NAMD2.7b1. On both our local cluster and Jim's kraken
>>>>>>> build, the job runs fine if I use up to 96 processors. With 128 the job
>>>>>>> crashes after an error message, which is not consistent and can either be
>>>>>>> "bad global exclusion count", atoms with nan velocities, or just a seg
>>>>>>> fault. I haven't had any problems like this with the other jobs I've been
>>>>>>> running using v2.7b1, which, admittedly, have more conventional geometries.
>>>>>>> My conf file is below - any ideas?
>>>>>>>
>>>>>>> -Grace
>>>>>>>
>>>>>>>
>>>>>>> **********************
>>>>>>>
>>>>>>> # FILENAMES
>>>>>>> set outName [file rootname [file tail [info script]]]
>>>>>>> #set inFleNum [expr [scan [string range $outName end-1 end]
>>>>>>> "%d"] - 1]
>>>>>>> #set inName [format "%s%02u" [string range $outName 0
>>>>>>> end-2] $inFileNum]
>>>>>>> #set inName ionized
>>>>>>> set inName min01
>>>>>>> set homedir ../../..
>>>>>>> set sourcepath ../../solvate_and_ionize/riso
>>>>>>>
>>>>>>> timestep 2.0
>>>>>>>
>>>>>>> structure $sourcepath/ionized.psf
>>>>>>> parameters $homedir/toppar/par_all27_prot_lipid.prm
>>>>>>> parameters $homedir/toppar/par_isoflurane_RS.inp
>>>>>>> paraTypeCharmm on
>>>>>>>
>>>>>>> set temp 300.0
>>>>>>> #temperature $temp
>>>>>>> # RESTRAINTS
>>>>>>>
>>>>>>> constraints on
>>>>>>> consref $sourcepath/constraints.pdb
>>>>>>> conskfile $sourcepath/constraints.pdb
>>>>>>> conskcol O
>>>>>>>
>>>>>>> # INPUT
>>>>>>>
>>>>>>> coordinates $sourcepath/ionized.pdb
>>>>>>> extendedsystem $inName.xsc
>>>>>>> binvelocities $inName.vel
>>>>>>> bincoordinates $inName.coor
>>>>>>> #cellBasisVector1 108 0 0
>>>>>>> #cellBasisVector2 0 108 0
>>>>>>> #cellBasisVector3 54 54 54
>>>>>>>
>>>>>>> # OUTPUT
>>>>>>>
>>>>>>> outputenergies 500
>>>>>>> outputtiming 500
>>>>>>> outputpressure 500
>>>>>>> binaryoutput yes
>>>>>>> outputname [format "%so" $outName]
>>>>>>> restartname $outName
>>>>>>> restartfreq 500
>>>>>>> binaryrestart yes
>>>>>>>
>>>>>>> XSTFreq 500
>>>>>>> COMmotion no
>>>>>>>
>>>>>>> # DCD TRAJECTORY
>>>>>>>
>>>>>>> DCDfile $outName.dcd
>>>>>>> DCDfreq 5000
>>>>>>>
>>>>>>> # CUT-OFFs
>>>>>>>
>>>>>>> splitpatch hydrogen
>>>>>>> hgroupcutoff 2.8
>>>>>>> stepspercycle 20
>>>>>>> switching on
>>>>>>> switchdist 10.0
>>>>>>> cutoff 12.0
>>>>>>> pairlistdist 13.0
>>>>>>>
>>>>>>> #margin 1.0
>>>>>>>
>>>>>>> wrapWater no
>>>>>>>
>>>>>>> # CONSTANT-T
>>>>>>>
>>>>>>> langevin on
>>>>>>> langevinTemp $temp
>>>>>>> langevinDamping 0.1
>>>>>>>
>>>>>>> # CONSTANT-P
>>>>>>>
>>>>>>> useFlexibleCell no
>>>>>>> useConstantRatio no
>>>>>>> useGroupPressure yes
>>>>>>>
>>>>>>> langevinPiston on
>>>>>>> langevinPistonTarget 1
>>>>>>> langevinPistonPeriod 200
>>>>>>> langevinPistonDecay 100
>>>>>>> langevinPistonTemp $temp
>>>>>>>
>>>>>>> # PME
>>>>>>>
>>>>>>> PME yes
>>>>>>> PMETolerance 10e-6
>>>>>>> PMEInterpOrder 4
>>>>>>>
>>>>>>> PMEGridSizeX 120
>>>>>>> PMEGridSizeY 120
>>>>>>> PMEGridSizeZ 96
>>>>>>>
>>>>>>> # MULTIPLE TIME-STEP
>>>>>>>
>>>>>>> fullelectfrequency 2
>>>>>>> nonbondedfreq 1
>>>>>>>
>>>>>>> # SHAKE/RATTLE
>>>>>>>
>>>>>>> rigidBonds all
>>>>>>>
>>>>>>> # 1-4's
>>>>>>>
>>>>>>> exclude scaled1-4
>>>>>>> 1-4scaling 1.0
>>>>>>>
>>>>>>> constraintscaling 1.0
>>>>>>> run 250000
>>>>>>> constraintscaling 0.0
>>>>>>> 1250000
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>
>
>
> --
> ________________________________________________________________________
> George Madalin Giambasu PhD Student
> University of Minnesota Phone : (612) 625-6317
> Department of Chemistry Fax : (612) 626-7541
> 207 Pleasant St. SE e-mail:
> Minneapolis, MN USA 55455-0431 GeorgeMGiambasu_at_umn.edu
> York Research Group giambasu_at_gmail.comhttp://theory.chem.umn.edu/
> ________________________________________________________________________
>
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:52:43 CST