Re: crash with more than 96 processors (v2.7b1)

From: George Madalin Giambasu (giambasu_at_gmail.com)
Date: Fri May 01 2009 - 10:28:50 CDT

I just wanted to report that a similar behavior is observed on
BluegeneP. Increasing the number of processors (>64) makes namd crash
randomly between steps 1000-30000. Turning off the Langevin or
Beredensen pressure control eliminates the crashing problem. My system
is ~90000 atoms, amber force field (in amber format) using a
rhombo-dodecahedron symmetry.

George

Grace Brannigan wrote:
> Hi Chris,
>
> The closest distance is between the two oxygens of the water : 1.69A.
> There was no clash in the beginning of the simulation (they started at
> 2.73 A). However, the system is shrinking as a result of the Langevin
> piston, but should that be dependent on the number of nodes?
>
> Timestep of 1.0 results in a crash as well with the same waters, after
> more steps but about the same amount of fs.
>
> Changing the hgroupCutoff to 2.5 actually made the simulation crash at
> step 120 instead 140.
>
> Increasing the margin to 1.0 doesn't change anything.
>
> Ideas?
>
> -Grace
>
> On Thu, Apr 30, 2009 at 8:41 PM, Chris Harrison <char_at_ks.uiuc.edu
> <mailto:char_at_ks.uiuc.edu>> wrote:
>
> Grace,
>
> A few questions:
> You say "close," can you give the distance between the closest
> atoms in Angstroms?
>
> Are the two closest atoms hydrogens by chance? If so, could you
> try a restart from something fairly close to step 140, using
> timestep of 1.0.
>
> Also, is there a specific reason for the hgroupCutoff value 2.8?
> If not, could you try reducing that to 2.5 and see if that makes a
> difference?
>
> If neither of these make a difference, could you increase the
> margin to 0.5 or 1.0 and test that?
>
>
> C.
>
>
> --
> Chris Harrison, Ph.D.
> Theoretical and Computational Biophysics Group
> NIH Resource for Macromolecular Modeling and Bioinformatics
> Beckman Institute for Advanced Science and Technology
> University of Illinois, 405 N. Mathews Ave., Urbana, IL 61801
>
> char_at_ks.uiuc.edu <mailto:char_at_ks.uiuc.edu>
> Voice: 217-244-1733
> http://www.ks.uiuc.edu/~char <http://www.ks.uiuc.edu/%7Echar>
> Fax: 217-244-6078
>
>
>
> On Thu, Apr 30, 2009 at 2:49 PM, Grace Brannigan
> <gracebrannigan_at_gmail.com <mailto:gracebrannigan_at_gmail.com>> wrote:
>
> Hi Chris,
>
> I did as you suggested. For the system run on 128 nodes, the
> energies right before the crash at step 140 are:
>
> ENERGY: 139 368.4760 1082.8205
> 1324.6014 45.3948 -251964.9404
> 25326.2532 496.9566 0.0000 6871.6446
> -216448.7934 57.8604 -216443.9396
> -216448.2858 57.8604 -228.2303
> -274.6782 587038.2456 -228.2303 -274.6782
>
> ENERGY: 140 366.8165 1084.7263
> 1325.5485 46.3538 -251992.5581
> 26939.8959 495.4494 0.0000 99999999.9999
> 99999999.9999 99999999.9999 99999999.9999 nan
> -99999999.9999 -99999999.9999 -99999999.9999
> 586888.6700 -99999999.9999 -99999999.9999
>
> For comparison, the energies at the same step on 96 nodes are
>
> ENERGY: 139 358.1118 1087.0480
> 1328.9915 46.5093 -252345.2854
> 25274.9919 497.1248 0.0000 6527.0026
> -217225.5054 54.9585 -217220.9702
> -217225.8113 54.9585 -302.9743
> -347.3116 587059.0631 -302.9743 -347.3116
>
> Looking at the dcd file, there are two water molecules (the
> ones with infinite velocity at step 140) that are close right
> before the crash, but not overlapping.
>
> -Grace
>
>
>
>
> On Tue, Apr 28, 2009 at 7:30 PM, Chris Harrison
> <char_at_ks.uiuc.edu <mailto:char_at_ks.uiuc.edu>> wrote:
>
> Could you please set the following parameters as indicated
> and rerun the 128 proc job on either cluster:
>
> dcdfreq 1
> outputEnergies 1
>
> The idea is to isolate, via looking at the components of
> the energy in the log file and the changes in the
> structure from the dcd file, anything "physical" in your
> simulation that may be "blowing up." If there is
> something physical "blowing up", you will need to do two
> things:
>
> 1. examine the energy components from the log file at the
> corresponding log file. The component that shoots up
> should correspond to the physical interaction responsible
> for the "physical blowing up."
>
> 2. You should probably also compare the dynamics and
> energy component trends to the 96 processor simulation to
> examine their similarity and assess how reasonable it is
> that MTS yielded dynamics different enough to crash one
> sim w/ X # of procs vs one sim w/ Y # of procs. Basically
> are the simulations comparable up to a point and at what
> point do they seriously diverge quickly leading to a crash
> ... in which regime of MTS (based on your config
> parameters) does this seem to fit. We need to figure out
> if we're looking at a difference in dynamics or if there's
> a "bug" yielding a "physically realistic blow up" that
> only shows up during a parallel process like patch
> migration/reduction, etc when using 128 as opposed to 96
> procs.
>
> If there is nothing physical that is blowing up and the
> simulation is really just spontaneously crashing on both
> architectures using 128 procs then we'll have to dig
> deeper and consider running your simulation with debug
> flags and trace things to the source of the crash.
>
>
>
> C.
>
>
> --
> Chris Harrison, Ph.D.
> Theoretical and Computational Biophysics Group
> NIH Resource for Macromolecular Modeling and Bioinformatics
> Beckman Institute for Advanced Science and Technology
> University of Illinois, 405 N. Mathews Ave., Urbana, IL 61801
>
> char_at_ks.uiuc.edu <mailto:char_at_ks.uiuc.edu>
> Voice: 217-244-1733
> http://www.ks.uiuc.edu/~char
> <http://www.ks.uiuc.edu/%7Echar> Fax:
> 217-244-6078
>
>
>
> On Tue, Apr 28, 2009 at 5:36 PM, Grace Brannigan
> <grace_at_vitae.cmm.upenn.edu
> <mailto:grace_at_vitae.cmm.upenn.edu>> wrote:
>
> Hi Chris,
>
> The 128 processor job dies immediately, while the 96
> processor job can go on for forever (or at least 4ns).
>
> Our cluster is a dual quadcore Xeon E5430 with
> infiniband interconnect, and yes, it dies at 128 cores
> on both clusters.
>
> -Grace
>
>
> On Tue, Apr 28, 2009 at 5:50 PM, Chris Harrison
> <char_at_ks.uiuc.edu <mailto:char_at_ks.uiuc.edu>> wrote:
>
> Grace,
>
> You say your cluster, I'm assuming this isn't an
> XT5. ;)
>
> Can you provide some details on your cluster and
> clarify if you mean 128 procs on both clusters,
> irrespective of architecture?
>
> Also, you have confirmed that using the lower # of
> procs you can exceed the timestep at which the
> "128 proc" job dies, correct?
>
>
> C.
>
>
> --
> Chris Harrison, Ph.D.
> Theoretical and Computational Biophysics Group
> NIH Resource for Macromolecular Modeling and
> Bioinformatics
> Beckman Institute for Advanced Science and Technology
> University of Illinois, 405 N. Mathews Ave.,
> Urbana, IL 61801
>
> char_at_ks.uiuc.edu <mailto:char_at_ks.uiuc.edu>
> Voice: 217-244-1733
> http://www.ks.uiuc.edu/~char
> <http://www.ks.uiuc.edu/%7Echar>
> Fax: 217-244-6078
>
>
>
>
> On Tue, Apr 28, 2009 at 2:16 PM, Grace Brannigan
> <grace_at_vitae.cmm.upenn.edu
> <mailto:grace_at_vitae.cmm.upenn.edu>> wrote:
>
> Hi all,
>
> I have been simulating a protein in a
> truncated octahedral water box(~90k atoms)
> using NAMD2.7b1. On both our local cluster and
> Jim's kraken build, the job runs fine if I use
> up to 96 processors. With 128 the job crashes
> after an error message, which is not
> consistent and can either be "bad global
> exclusion count", atoms with nan velocities,
> or just a seg fault. I haven't had any
> problems like this with the other jobs I've
> been running using v2.7b1, which, admittedly,
> have more conventional geometries. My conf
> file is below - any ideas?
>
> -Grace
>
>
> **********************
>
> # FILENAMES
> set outName [file rootname [file
> tail [info script]]]
> #set inFleNum [expr [scan [string
> range $outName end-1 end] "%d"] - 1]
> #set inName [format "%s%02u"
> [string range $outName 0 end-2] $inFileNum]
> #set inName ionized
> set inName min01
> set homedir ../../..
> set sourcepath ../../solvate_and_ionize/riso
>
> timestep 2.0
>
> structure $sourcepath/ionized.psf
> parameters
> $homedir/toppar/par_all27_prot_lipid.prm
> parameters
> $homedir/toppar/par_isoflurane_RS.inp
> paraTypeCharmm on
>
> set temp 300.0
> #temperature $temp
> # RESTRAINTS
>
> constraints on
> consref $sourcepath/constraints.pdb
> conskfile $sourcepath/constraints.pdb
> conskcol O
>
> # INPUT
>
> coordinates $sourcepath/ionized.pdb
> extendedsystem $inName.xsc
> binvelocities $inName.vel
> bincoordinates $inName.coor
> #cellBasisVector1 108 0 0
> #cellBasisVector2 0 108 0
> #cellBasisVector3 54 54 54
>
> # OUTPUT
>
> outputenergies 500
> outputtiming 500
> outputpressure 500
> binaryoutput yes
> outputname [format "%so" $outName]
> restartname $outName
> restartfreq 500
> binaryrestart yes
>
> XSTFreq 500
> COMmotion no
>
> # DCD TRAJECTORY
>
> DCDfile $outName.dcd
> DCDfreq 5000
>
> # CUT-OFFs
>
> splitpatch hydrogen
> hgroupcutoff 2.8
> stepspercycle 20
> switching on
> switchdist 10.0
> cutoff 12.0
> pairlistdist 13.0
>
> #margin 1.0
>
> wrapWater no
>
> # CONSTANT-T
>
> langevin on
> langevinTemp $temp
> langevinDamping 0.1
>
> # CONSTANT-P
>
> useFlexibleCell no
> useConstantRatio no
> useGroupPressure yes
>
> langevinPiston on
> langevinPistonTarget 1
> langevinPistonPeriod 200
> langevinPistonDecay 100
> langevinPistonTemp $temp
>
> # PME
>
> PME yes
> PMETolerance 10e-6
> PMEInterpOrder 4
>
> PMEGridSizeX 120
> PMEGridSizeY 120
> PMEGridSizeZ 96
>
> # MULTIPLE TIME-STEP
>
> fullelectfrequency 2
> nonbondedfreq 1
>
> # SHAKE/RATTLE
>
> rigidBonds all
>
> # 1-4's
>
> exclude scaled1-4
> 1-4scaling 1.0
>
> constraintscaling 1.0
> run 250000
> constraintscaling 0.0
> 1250000
>
>
>
>
>
>
>
>

-- 
________________________________________________________________________
George Madalin Giambasu                        PhD Student
University of Minnesota                        Phone : (612) 625-6317
Department of Chemistry                        Fax   : (612) 626-7541
207 Pleasant St. SE                            e-mail:
Minneapolis, MN USA 55455-0431                 GeorgeMGiambasu_at_umn.edu
York Research Group                            giambasu_at_gmail.com
http://theory.chem.umn.edu/
________________________________________________________________________

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:52:43 CST