Re: crash with more than 96 processors (v2.7b1)

From: Chris Harrison (char_at_ks.uiuc.edu)
Date: Thu Apr 30 2009 - 19:41:55 CDT

Grace,

A few questions:
You say "close," can you give the distance between the closest atoms in
Angstroms?

Are the two closest atoms hydrogens by chance? If so, could you try a
restart from something fairly close to step 140, using timestep of 1.0.

Also, is there a specific reason for the hgroupCutoff value 2.8? If not,
could you try reducing that to 2.5 and see if that makes a difference?

If neither of these make a difference, could you increase the margin to 0.5
or 1.0 and test that?

C.

--
Chris Harrison, Ph.D.
Theoretical and Computational Biophysics Group
NIH Resource for Macromolecular Modeling and Bioinformatics
Beckman Institute for Advanced Science and Technology
University of Illinois, 405 N. Mathews Ave., Urbana, IL 61801
char_at_ks.uiuc.edu                            Voice: 217-244-1733
http://www.ks.uiuc.edu/~char               Fax: 217-244-6078
On Thu, Apr 30, 2009 at 2:49 PM, Grace Brannigan
<gracebrannigan_at_gmail.com>wrote:
> Hi Chris,
>
> I did as you suggested. For the system run on 128 nodes, the energies right
> before the crash at step 140 are:
>
> ENERGY:     139       368.4760      1082.8205      1324.6014
> 45.3948        -251964.9404     25326.2532       496.9566
> 0.0000      6871.6446        -216448.7934        57.8604   -216443.9396
> -216448.2858        57.8604           -228.2303      -274.6782
> 587038.2456      -228.2303      -274.6782
>
> ENERGY:     140       366.8165      1084.7263      1325.5485
> 46.3538        -251992.5581     26939.8959       495.4494         0.0000
> 99999999.9999       99999999.9999  99999999.9999  99999999.9999
> nan -99999999.9999      -99999999.9999 -99999999.9999    586888.6700
> -99999999.9999 -99999999.9999
>
> For comparison, the energies at the same step on 96 nodes are
>
> ENERGY:     139       358.1118      1087.0480      1328.9915
> 46.5093        -252345.2854     25274.9919       497.1248
> 0.0000      6527.0026        -217225.5054        54.9585   -217220.9702
> -217225.8113        54.9585           -302.9743      -347.3116
> 587059.0631      -302.9743      -347.3116
>
> Looking at the dcd file, there are two water molecules (the ones with
> infinite velocity at step 140) that are close right before the crash, but
> not overlapping.
>
> -Grace
>
>
>
>
> On Tue, Apr 28, 2009 at 7:30 PM, Chris Harrison <char_at_ks.uiuc.edu> wrote:
>
>> Could you please set the following parameters as indicated and rerun the
>> 128 proc job on either cluster:
>>
>> dcdfreq 1
>> outputEnergies 1
>>
>> The idea is to isolate, via looking at the components of the energy in the
>> log file and the changes in the structure from the dcd file, anything
>> "physical" in your simulation that may be "blowing up."  If there is
>> something physical "blowing up", you will need to do two things:
>>
>> 1. examine the energy components from the log file at the corresponding
>> log file.  The component that shoots up should correspond to the physical
>> interaction responsible for the "physical blowing up."
>>
>> 2. You should probably also compare the dynamics and energy component
>> trends to the 96 processor simulation to examine their similarity and assess
>> how reasonable it is that MTS yielded dynamics different enough to crash one
>> sim w/ X # of procs vs one sim w/ Y # of procs.  Basically are the
>> simulations comparable up to a point and at what point do they seriously
>> diverge quickly leading to a crash ... in which regime of MTS (based on your
>> config parameters) does this seem to fit.  We need to figure out if we're
>> looking at a difference in dynamics or if there's a "bug" yielding a
>> "physically realistic blow up" that only shows up during a parallel process
>> like patch migration/reduction, etc when using 128 as opposed to 96 procs.
>>
>> If there is nothing physical that is blowing up and the simulation is
>> really just spontaneously crashing on both architectures using 128 procs
>> then we'll have to dig deeper and consider running your simulation with
>> debug flags and trace things to the source of the crash.
>>
>>
>> C.
>>
>>
>> --
>> Chris Harrison, Ph.D.
>> Theoretical and Computational Biophysics Group
>> NIH Resource for Macromolecular Modeling and Bioinformatics
>> Beckman Institute for Advanced Science and Technology
>> University of Illinois, 405 N. Mathews Ave., Urbana, IL 61801
>>
>> char_at_ks.uiuc.edu                            Voice: 217-244-1733
>> http://www.ks.uiuc.edu/~char <http://www.ks.uiuc.edu/%7Echar>
>>   Fax: 217-244-6078
>>
>>
>>
>> On Tue, Apr 28, 2009 at 5:36 PM, Grace Brannigan <
>> grace_at_vitae.cmm.upenn.edu> wrote:
>>
>>> Hi Chris,
>>>
>>> The 128 processor job dies immediately, while the 96 processor job can go
>>> on for forever (or at least 4ns).
>>>
>>> Our cluster is a dual quadcore Xeon E5430 with infiniband interconnect,
>>> and yes, it dies at 128 cores on both clusters.
>>>
>>> -Grace
>>>
>>>
>>> On Tue, Apr 28, 2009 at 5:50 PM, Chris Harrison <char_at_ks.uiuc.edu>wrote:
>>>
>>>> Grace,
>>>>
>>>> You say your cluster, I'm assuming this isn't an XT5.  ;)
>>>>
>>>> Can you provide some details on your cluster and clarify if you mean 128
>>>> procs on both clusters, irrespective of architecture?
>>>>
>>>> Also, you have confirmed that using the lower # of procs you can exceed
>>>> the timestep at which the "128 proc" job dies, correct?
>>>>
>>>>
>>>> C.
>>>>
>>>>
>>>> --
>>>> Chris Harrison, Ph.D.
>>>> Theoretical and Computational Biophysics Group
>>>> NIH Resource for Macromolecular Modeling and Bioinformatics
>>>> Beckman Institute for Advanced Science and Technology
>>>> University of Illinois, 405 N. Mathews Ave., Urbana, IL 61801
>>>>
>>>> char_at_ks.uiuc.edu                            Voice: 217-244-1733
>>>> http://www.ks.uiuc.edu/~char <http://www.ks.uiuc.edu/%7Echar>
>>>>     Fax: 217-244-6078
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Apr 28, 2009 at 2:16 PM, Grace Brannigan <
>>>> grace_at_vitae.cmm.upenn.edu> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I have been simulating a protein in a truncated octahedral water
>>>>> box(~90k atoms) using NAMD2.7b1. On both our local cluster and Jim's kraken
>>>>> build, the job runs fine if I use up to 96 processors. With 128 the job
>>>>> crashes after an error message, which is not consistent and can either be
>>>>> "bad global exclusion count", atoms with nan velocities,  or just a seg
>>>>> fault. I haven't had any problems like this with the other jobs I've been
>>>>> running using v2.7b1, which, admittedly, have more conventional geometries.
>>>>> My conf file is below - any ideas?
>>>>>
>>>>> -Grace
>>>>>
>>>>>
>>>>> **********************
>>>>>
>>>>> # FILENAMES
>>>>> set outName             [file rootname [file tail [info script]]]
>>>>> #set inFleNum           [expr [scan [string range $outName end-1 end]
>>>>> "%d"] - 1]
>>>>> #set inName              [format "%s%02u" [string range $outName 0
>>>>> end-2] $inFileNum]
>>>>> #set inName           ionized
>>>>> set inName         min01
>>>>> set homedir       ../../..
>>>>> set sourcepath     ../../solvate_and_ionize/riso
>>>>>
>>>>> timestep            2.0
>>>>>
>>>>> structure           $sourcepath/ionized.psf
>>>>> parameters          $homedir/toppar/par_all27_prot_lipid.prm
>>>>> parameters         $homedir/toppar/par_isoflurane_RS.inp
>>>>> paraTypeCharmm      on
>>>>>
>>>>> set temp            300.0
>>>>> #temperature         $temp
>>>>> # RESTRAINTS
>>>>>
>>>>> constraints         on
>>>>> consref             $sourcepath/constraints.pdb
>>>>> conskfile           $sourcepath/constraints.pdb
>>>>> conskcol            O
>>>>>
>>>>> # INPUT
>>>>>
>>>>> coordinates         $sourcepath/ionized.pdb
>>>>> extendedsystem       $inName.xsc
>>>>> binvelocities        $inName.vel
>>>>> bincoordinates        $inName.coor
>>>>> #cellBasisVector1    108 0 0
>>>>> #cellBasisVector2     0 108 0
>>>>> #cellBasisVector3     54 54 54
>>>>>
>>>>> # OUTPUT
>>>>>
>>>>> outputenergies       500
>>>>> outputtiming         500
>>>>> outputpressure       500
>>>>> binaryoutput         yes
>>>>> outputname           [format "%so" $outName]
>>>>> restartname          $outName
>>>>> restartfreq         500
>>>>> binaryrestart        yes
>>>>>
>>>>> XSTFreq              500
>>>>> COMmotion         no
>>>>>
>>>>> # DCD TRAJECTORY
>>>>>
>>>>> DCDfile              $outName.dcd
>>>>> DCDfreq              5000
>>>>>
>>>>> # CUT-OFFs
>>>>>
>>>>> splitpatch           hydrogen
>>>>> hgroupcutoff         2.8
>>>>> stepspercycle        20
>>>>> switching            on
>>>>> switchdist           10.0
>>>>> cutoff               12.0
>>>>> pairlistdist         13.0
>>>>>
>>>>> #margin     1.0
>>>>>
>>>>> wrapWater        no
>>>>>
>>>>> # CONSTANT-T
>>>>>
>>>>> langevin                on
>>>>> langevinTemp            $temp
>>>>> langevinDamping         0.1
>>>>>
>>>>> # CONSTANT-P
>>>>>
>>>>> useFlexibleCell      no
>>>>> useConstantRatio     no
>>>>> useGroupPressure     yes
>>>>>
>>>>> langevinPiston       on
>>>>> langevinPistonTarget 1
>>>>> langevinPistonPeriod 200
>>>>> langevinPistonDecay  100
>>>>> langevinPistonTemp   $temp
>>>>>
>>>>> # PME
>>>>>
>>>>> PME                  yes
>>>>> PMETolerance         10e-6
>>>>> PMEInterpOrder       4
>>>>>
>>>>> PMEGridSizeX         120
>>>>> PMEGridSizeY         120
>>>>> PMEGridSizeZ         96
>>>>>
>>>>> # MULTIPLE TIME-STEP
>>>>>
>>>>> fullelectfrequency   2
>>>>> nonbondedfreq        1
>>>>>
>>>>> # SHAKE/RATTLE
>>>>>
>>>>> rigidBonds           all
>>>>>
>>>>> # 1-4's
>>>>>
>>>>> exclude              scaled1-4
>>>>> 1-4scaling           1.0
>>>>>
>>>>> constraintscaling   1.0
>>>>> run 250000
>>>>> constraintscaling   0.0
>>>>> 1250000
>>>>>
>>>>>
>>>>
>>>
>>
>
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:52:42 CST