crash with more than 96 processors (v2.7b1)

From: Grace Brannigan (gracebrannigan_at_gmail.com)
Date: Thu Apr 30 2009 - 14:49:58 CDT

Hi Chris,

I did as you suggested. For the system run on 128 nodes, the energies right
before the crash at step 140 are:

ENERGY: 139 368.4760 1082.8205 1324.6014
45.3948 -251964.9404 25326.2532 496.9566
0.0000 6871.6446 -216448.7934 57.8604 -216443.9396
-216448.2858 57.8604 -228.2303 -274.6782
587038.2456 -228.2303 -274.6782

ENERGY: 140 366.8165 1084.7263 1325.5485
46.3538 -251992.5581 26939.8959 495.4494 0.0000
99999999.9999 99999999.9999 99999999.9999 99999999.9999
nan -99999999.9999 -99999999.9999 -99999999.9999 586888.6700
-99999999.9999 -99999999.9999

For comparison, the energies at the same step on 96 nodes are

ENERGY: 139 358.1118 1087.0480 1328.9915
46.5093 -252345.2854 25274.9919 497.1248
0.0000 6527.0026 -217225.5054 54.9585 -217220.9702
-217225.8113 54.9585 -302.9743 -347.3116
587059.0631 -302.9743 -347.3116

Looking at the dcd file, there are two water molecules (the ones with
infinite velocity at step 140) that are close right before the crash, but
not overlapping.

-Grace

On Tue, Apr 28, 2009 at 7:30 PM, Chris Harrison <char_at_ks.uiuc.edu> wrote:

> Could you please set the following parameters as indicated and rerun the
> 128 proc job on either cluster:
>
> dcdfreq 1
> outputEnergies 1
>
> The idea is to isolate, via looking at the components of the energy in the
> log file and the changes in the structure from the dcd file, anything
> "physical" in your simulation that may be "blowing up." If there is
> something physical "blowing up", you will need to do two things:
>
> 1. examine the energy components from the log file at the corresponding log
> file. The component that shoots up should correspond to the physical
> interaction responsible for the "physical blowing up."
>
> 2. You should probably also compare the dynamics and energy component
> trends to the 96 processor simulation to examine their similarity and assess
> how reasonable it is that MTS yielded dynamics different enough to crash one
> sim w/ X # of procs vs one sim w/ Y # of procs. Basically are the
> simulations comparable up to a point and at what point do they seriously
> diverge quickly leading to a crash ... in which regime of MTS (based on your
> config parameters) does this seem to fit. We need to figure out if we're
> looking at a difference in dynamics or if there's a "bug" yielding a
> "physically realistic blow up" that only shows up during a parallel process
> like patch migration/reduction, etc when using 128 as opposed to 96 procs.
>
> If there is nothing physical that is blowing up and the simulation is
> really just spontaneously crashing on both architectures using 128 procs
> then we'll have to dig deeper and consider running your simulation with
> debug flags and trace things to the source of the crash.
>
>
> C.
>
>
> --
> Chris Harrison, Ph.D.
> Theoretical and Computational Biophysics Group
> NIH Resource for Macromolecular Modeling and Bioinformatics
> Beckman Institute for Advanced Science and Technology
> University of Illinois, 405 N. Mathews Ave., Urbana, IL 61801
>
> char_at_ks.uiuc.edu Voice: 217-244-1733
> http://www.ks.uiuc.edu/~char <http://www.ks.uiuc.edu/%7Echar>
> Fax: 217-244-6078
>
>
>
> On Tue, Apr 28, 2009 at 5:36 PM, Grace Brannigan <
> grace_at_vitae.cmm.upenn.edu> wrote:
>
>> Hi Chris,
>>
>> The 128 processor job dies immediately, while the 96 processor job can go
>> on for forever (or at least 4ns).
>>
>> Our cluster is a dual quadcore Xeon E5430 with infiniband interconnect,
>> and yes, it dies at 128 cores on both clusters.
>>
>> -Grace
>>
>>
>> On Tue, Apr 28, 2009 at 5:50 PM, Chris Harrison <char_at_ks.uiuc.edu> wrote:
>>
>>> Grace,
>>>
>>> You say your cluster, I'm assuming this isn't an XT5. ;)
>>>
>>> Can you provide some details on your cluster and clarify if you mean 128
>>> procs on both clusters, irrespective of architecture?
>>>
>>> Also, you have confirmed that using the lower # of procs you can exceed
>>> the timestep at which the "128 proc" job dies, correct?
>>>
>>>
>>> C.
>>>
>>>
>>> --
>>> Chris Harrison, Ph.D.
>>> Theoretical and Computational Biophysics Group
>>> NIH Resource for Macromolecular Modeling and Bioinformatics
>>> Beckman Institute for Advanced Science and Technology
>>> University of Illinois, 405 N. Mathews Ave., Urbana, IL 61801
>>>
>>> char_at_ks.uiuc.edu Voice: 217-244-1733
>>> http://www.ks.uiuc.edu/~char <http://www.ks.uiuc.edu/%7Echar>
>>> Fax: 217-244-6078
>>>
>>>
>>>
>>>
>>> On Tue, Apr 28, 2009 at 2:16 PM, Grace Brannigan <
>>> grace_at_vitae.cmm.upenn.edu> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I have been simulating a protein in a truncated octahedral water
>>>> box(~90k atoms) using NAMD2.7b1. On both our local cluster and Jim's kraken
>>>> build, the job runs fine if I use up to 96 processors. With 128 the job
>>>> crashes after an error message, which is not consistent and can either be
>>>> "bad global exclusion count", atoms with nan velocities, or just a seg
>>>> fault. I haven't had any problems like this with the other jobs I've been
>>>> running using v2.7b1, which, admittedly, have more conventional geometries.
>>>> My conf file is below - any ideas?
>>>>
>>>> -Grace
>>>>
>>>>
>>>> **********************
>>>>
>>>> # FILENAMES
>>>> set outName [file rootname [file tail [info script]]]
>>>> #set inFleNum [expr [scan [string range $outName end-1 end]
>>>> "%d"] - 1]
>>>> #set inName [format "%s%02u" [string range $outName 0
>>>> end-2] $inFileNum]
>>>> #set inName ionized
>>>> set inName min01
>>>> set homedir ../../..
>>>> set sourcepath ../../solvate_and_ionize/riso
>>>>
>>>> timestep 2.0
>>>>
>>>> structure $sourcepath/ionized.psf
>>>> parameters $homedir/toppar/par_all27_prot_lipid.prm
>>>> parameters $homedir/toppar/par_isoflurane_RS.inp
>>>> paraTypeCharmm on
>>>>
>>>> set temp 300.0
>>>> #temperature $temp
>>>> # RESTRAINTS
>>>>
>>>> constraints on
>>>> consref $sourcepath/constraints.pdb
>>>> conskfile $sourcepath/constraints.pdb
>>>> conskcol O
>>>>
>>>> # INPUT
>>>>
>>>> coordinates $sourcepath/ionized.pdb
>>>> extendedsystem $inName.xsc
>>>> binvelocities $inName.vel
>>>> bincoordinates $inName.coor
>>>> #cellBasisVector1 108 0 0
>>>> #cellBasisVector2 0 108 0
>>>> #cellBasisVector3 54 54 54
>>>>
>>>> # OUTPUT
>>>>
>>>> outputenergies 500
>>>> outputtiming 500
>>>> outputpressure 500
>>>> binaryoutput yes
>>>> outputname [format "%so" $outName]
>>>> restartname $outName
>>>> restartfreq 500
>>>> binaryrestart yes
>>>>
>>>> XSTFreq 500
>>>> COMmotion no
>>>>
>>>> # DCD TRAJECTORY
>>>>
>>>> DCDfile $outName.dcd
>>>> DCDfreq 5000
>>>>
>>>> # CUT-OFFs
>>>>
>>>> splitpatch hydrogen
>>>> hgroupcutoff 2.8
>>>> stepspercycle 20
>>>> switching on
>>>> switchdist 10.0
>>>> cutoff 12.0
>>>> pairlistdist 13.0
>>>>
>>>> #margin 1.0
>>>>
>>>> wrapWater no
>>>>
>>>> # CONSTANT-T
>>>>
>>>> langevin on
>>>> langevinTemp $temp
>>>> langevinDamping 0.1
>>>>
>>>> # CONSTANT-P
>>>>
>>>> useFlexibleCell no
>>>> useConstantRatio no
>>>> useGroupPressure yes
>>>>
>>>> langevinPiston on
>>>> langevinPistonTarget 1
>>>> langevinPistonPeriod 200
>>>> langevinPistonDecay 100
>>>> langevinPistonTemp $temp
>>>>
>>>> # PME
>>>>
>>>> PME yes
>>>> PMETolerance 10e-6
>>>> PMEInterpOrder 4
>>>>
>>>> PMEGridSizeX 120
>>>> PMEGridSizeY 120
>>>> PMEGridSizeZ 96
>>>>
>>>> # MULTIPLE TIME-STEP
>>>>
>>>> fullelectfrequency 2
>>>> nonbondedfreq 1
>>>>
>>>> # SHAKE/RATTLE
>>>>
>>>> rigidBonds all
>>>>
>>>> # 1-4's
>>>>
>>>> exclude scaled1-4
>>>> 1-4scaling 1.0
>>>>
>>>> constraintscaling 1.0
>>>> run 250000
>>>> constraintscaling 0.0
>>>> 1250000
>>>>
>>>>
>>>
>>
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:52:42 CST