Re: crash with more than 96 processors (v2.7b1)

From: George Madalin Giambasu (giambasu_at_gmail.com)
Date: Fri May 01 2009 - 10:28:50 CDT

Next message: Christopher Hartshorn: "Re: FEP tutorial graph question"
Previous message: Grace Brannigan: "Re: crash with more than 96 processors (v2.7b1)"
In reply to: Grace Brannigan: "Re: crash with more than 96 processors (v2.7b1)"
Next in thread: Grace Brannigan: "Re: crash with more than 96 processors (v2.7b1)"
Reply: Grace Brannigan: "Re: crash with more than 96 processors (v2.7b1)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

I just wanted to
BluegeneP. Increasing
randomly between
Beredensen pressure
is ~90000 atoms,
rhombo-dodecahedron symmetry.

George

Grace Brannigan wrote:
> Hi Chris,
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Ideas?
>
> -Grace
>
>
>
>
> Grace,
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> C.
>
>
> --
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

-- __________________________ George Madalin Giambasu University of Minnesota Department of Chemistry 207 Pleasant St. SE Minneapolis, MN USA 55455-0431 York Research Group

Next

Previous

In reply

Next

Reply:

Messages [ date ] [ thread ] [ [ [

This archive was : Wed Feb 29 2012 - 15:52:43 CST report that a similar behavior is observed on the number of processors (>64) makes namd crash steps 1000-30000. Turning off the Langevin or control eliminates the crashing problem. My system amber force field (in amber format) using a The closest distance is between the two oxygens of the water : 1.69A. There was no clash in the beginning of the simulation (they started at 2.73 A). However, the system is shrinking as a result of the Langevin piston, but should that be dependent on the number of nodes? Timestep of 1.0 results in a crash as well with the same waters, after more steps but about the same amount of fs. Changing the hgroupCutoff to 2.5 actually made the simulation crash at step 120 instead 140. Increasing the margin to 1.0 doesn't change anything. On Thu, Apr 30, 2009 at 8:41 PM, Chris Harrison <char_at_ks.uiuc.edu <mailto:char_at_ks.uiuc.edu>> wrote: A few questions: You say "close," can you give the distance between the closest atoms in Angstroms? Are the two closest atoms hydrogens by chance? If so, could you try a restart from something fairly close to step 140, using timestep of 1.0. Also, is there a specific reason for the hgroupCutoff value 2.8? If not, could you try reducing that to 2.5 and see if that makes a difference? If neither of these make a difference, could you increase the margin to 0.5 or 1.0 and test that? Chris Harrison, Ph.D. Theoretical and Computational Biophysics Group NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute for Advanced Science and Technology University of Illinois, 405 N. Mathews Ave., Urbana, IL 61801 char_at_ks.uiuc.edu <mailto:char_at_ks.uiuc.edu> Voice: 217-244-1733 http://www.ks.uiuc.edu/~char <http://www.ks.uiuc.edu/%7Echar> Fax: 217-244-6078 On Thu, Apr 30, 2009 at 2:49 PM, Grace Brannigan <gracebrannigan_at_gmail.com <mailto:gracebrannigan_at_gmail.com>> wrote: Hi Chris, I did as you suggested. For the system run on 128 nodes, the energies right before the crash at step 140 are: ENERGY: 139 368.4760 1082.8205 1324.6014 45.3948 -251964.9404 25326.2532 496.9566 0.0000 6871.6446 -216448.7934 57.8604 -216443.9396 -216448.2858 57.8604 -228.2303 -274.6782 587038.2456 -228.2303 -274.6782 ENERGY: 140 366.8165 1084.7263 1325.5485 46.3538 -251992.5581 26939.8959 495.4494 0.0000 99999999.9999 99999999.9999 99999999.9999 99999999.9999 nan -99999999.9999 -99999999.9999 -99999999.9999 586888.6700 -99999999.9999 -99999999.9999 For comparison, the energies at the same step on 96 nodes are ENERGY: 139 358.1118 1087.0480 1328.9915 46.5093 -252345.2854 25274.9919 497.1248 0.0000 6527.0026 -217225.5054 54.9585 -217220.9702 -217225.8113 54.9585 -302.9743 -347.3116 587059.0631 -302.9743 -347.3116 Looking at the dcd file, there are two water molecules (the ones with infinite velocity at step 140) that are close right before the crash, but not overlapping. -Grace On Tue, Apr 28, 2009 at 7:30 PM, Chris Harrison <char_at_ks.uiuc.edu <mailto:char_at_ks.uiuc.edu>> wrote: Could you please set the following parameters as indicated and rerun the 128 proc job on either cluster: dcdfreq 1 outputEnergies 1 The idea is to isolate, via looking at the components of the energy in the log file and the changes in the structure from the dcd file, anything "physical" in your simulation that may be "blowing up." If there is something physical "blowing up", you will need to do two things: 1. examine the energy components from the log file at the corresponding log file. The component that shoots up should correspond to the physical interaction responsible for the "physical blowing up." 2. You should probably also compare the dynamics and energy component trends to the 96 processor simulation to examine their similarity and assess how reasonable it is that MTS yielded dynamics different enough to crash one sim w/ X # of procs vs one sim w/ Y # of procs. Basically are the simulations comparable up to a point and at what point do they seriously diverge quickly leading to a crash ... in which regime of MTS (based on your config parameters) does this seem to fit. We need to figure out if we're looking at a difference in dynamics or if there's a "bug" yielding a "physically realistic blow up" that only shows up during a parallel process like patch migration/reduction, etc when using 128 as opposed to 96 procs. If there is nothing physical that is blowing up and the simulation is really just spontaneously crashing on both architectures using 128 procs then we'll have to dig deeper and consider running your simulation with debug flags and trace things to the source of the crash. C. -- Chris Harrison, Ph.D. Theoretical and Computational Biophysics Group NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute for Advanced Science and Technology University of Illinois, 405 N. Mathews Ave., Urbana, IL 61801 char_at_ks.uiuc.edu <mailto:char_at_ks.uiuc.edu> Voice: 217-244-1733 http://www.ks.uiuc.edu/~char <http://www.ks.uiuc.edu/%7Echar> Fax: 217-244-6078 On Tue, Apr 28, 2009 at 5:36 PM, Grace Brannigan <grace_at_vitae.cmm.upenn.edu <mailto:grace_at_vitae.cmm.upenn.edu>> wrote: Hi Chris, The 128 processor job dies immediately, while the 96 processor job can go on for forever (or at least 4ns). Our cluster is a dual quadcore Xeon E5430 with infiniband interconnect, and yes, it dies at 128 cores on both clusters. -Grace On Tue, Apr 28, 2009 at 5:50 PM, Chris Harrison <char_at_ks.uiuc.edu <mailto:char_at_ks.uiuc.edu>> wrote: Grace, You say your cluster, I'm assuming this isn't an XT5. ;) Can you provide some details on your cluster and clarify if you mean 128 procs on both clusters, irrespective of architecture? Also, you have confirmed that using the lower # of procs you can exceed the timestep at which the "128 proc" job dies, correct? C. -- Chris Harrison, Ph.D. Theoretical and Computational Biophysics Group NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute for Advanced Science and Technology University of Illinois, 405 N. Mathews Ave., Urbana, IL 61801 char_at_ks.uiuc.edu <mailto:char_at_ks.uiuc.edu> Voice: 217-244-1733 http://www.ks.uiuc.edu/~char <http://www.ks.uiuc.edu/%7Echar> Fax: 217-244-6078 On Tue, Apr 28, 2009 at 2:16 PM, Grace Brannigan <grace_at_vitae.cmm.upenn.edu <mailto:grace_at_vitae.cmm.upenn.edu>> wrote: Hi all, I have been simulating a protein in a truncated octahedral water box(~90k atoms) using NAMD2.7b1. On both our local cluster and Jim's kraken build, the job runs fine if I use up to 96 processors. With 128 the job crashes after an error message, which is not consistent and can either be "bad global exclusion count", atoms with nan velocities, or just a seg fault. I haven't had any problems like this with the other jobs I've been running using v2.7b1, which, admittedly, have more conventional geometries. My conf file is below - any ideas? -Grace ********************** # FILENAMES set outName [file rootname [file tail [info script]]] #set inFleNum [expr [scan [string range $outName end-1 end] "%d"] - 1] #set inName [format "%s%02u" [string range $outName 0 end-2] $inFileNum] #set inName ionized set inName min01 set homedir ../../.. set sourcepath ../../solvate_and_ionize/riso timestep 2.0 structure $sourcepath/ionized.psf parameters $homedir/toppar/par_all27_prot_lipid.prm parameters $homedir/toppar/par_isoflurane_RS.inp paraTypeCharmm on set temp 300.0 #temperature $temp # RESTRAINTS constraints on consref $sourcepath/constraints.pdb conskfile $sourcepath/constraints.pdb conskcol O # INPUT coordinates $sourcepath/ionized.pdb extendedsystem $inName.xsc binvelocities $inName.vel bincoordinates $inName.coor #cellBasisVector1 108 0 0 #cellBasisVector2 0 108 0 #cellBasisVector3 54 54 54 # OUTPUT outputenergies 500 outputtiming 500 outputpressure 500 binaryoutput yes outputname [format "%so" $outName] restartname $outName restartfreq 500 binaryrestart yes XSTFreq 500 COMmotion no # DCD TRAJECTORY DCDfile $outName.dcd DCDfreq 5000 # CUT-OFFs splitpatch hydrogen hgroupcutoff 2.8 stepspercycle 20 switching on switchdist 10.0 cutoff 12.0 pairlistdist 13.0 #margin 1.0 wrapWater no # CONSTANT-T langevin on langevinTemp $temp langevinDamping 0.1 # CONSTANT-P useFlexibleCell no useConstantRatio no useGroupPressure yes langevinPiston on langevinPistonTarget 1 langevinPistonPeriod 200 langevinPistonDecay 100 langevinPistonTemp $temp # PME PME yes PMETolerance 10e-6 PMEInterpOrder 4 PMEGridSizeX 120 PMEGridSizeY 120 PMEGridSizeZ 96 # MULTIPLE TIME-STEP fullelectfrequency 2 nonbondedfreq 1 # SHAKE/RATTLE rigidBonds all # 1-4's exclude scaled1-4 1-4scaling 1.0 constraintscaling 1.0 run 250000 constraintscaling 0.0 1250000 ______________________________________________ PhD Student Phone : (612) 625-6317 Fax : (612) 626-7541 e-mail: GeorgeMGiambasu_at_umn.edu giambasu_at_gmail.com m.umn.edu/">http://theory.chem.umn.edu/ ______________________________________________ message: Christopher Hartshorn: "Re: FEP tutorial graph question" message: Grace Brannigan: "Re: crash with more than 96 processors (v2.7b1)" to: Grace Brannigan: "Re: crash with more than 96 processors (v2.7b1)" in thread: Grace Brannigan: "Re: crash with more than 96 processors (v2.7b1)" Grace Brannigan: "Re: crash with more than 96 processors (v2.7b1)" sorted by: subject ] author ] attachment ] generated by hypermail 2.1.6