Re: crash with more than 96 processors (v2.7b1)

From: Grace Brannigan (grace_at_vitae.cmm.upenn.edu)
Date: Tue Apr 28 2009 - 17:36:23 CDT

Hi Chris,

The 128 processor job dies immediately, while the 96 processor job can go on
for forever (or at least 4ns).

Our cluster is a dual quadcore Xeon E5430 with infiniband interconnect, and
yes, it dies at 128 cores on both clusters.

-Grace

On Tue, Apr 28, 2009 at 5:50 PM, Chris Harrison <char_at_ks.uiuc.edu> wrote:

> Grace,
>
> You say your cluster, I'm assuming this isn't an XT5. ;)
>
> Can you provide some details on your cluster and clarify if you mean 128
> procs on both clusters, irrespective of architecture?
>
> Also, you have confirmed that using the lower # of procs you can exceed the
> timestep at which the "128 proc" job dies, correct?
>
>
> C.
>
>
> --
> Chris Harrison, Ph.D.
> Theoretical and Computational Biophysics Group
> NIH Resource for Macromolecular Modeling and Bioinformatics
> Beckman Institute for Advanced Science and Technology
> University of Illinois, 405 N. Mathews Ave., Urbana, IL 61801
>
> char_at_ks.uiuc.edu Voice: 217-244-1733
> http://www.ks.uiuc.edu/~char <http://www.ks.uiuc.edu/%7Echar>
> Fax: 217-244-6078
>
>
>
>
> On Tue, Apr 28, 2009 at 2:16 PM, Grace Brannigan <
> grace_at_vitae.cmm.upenn.edu> wrote:
>
>> Hi all,
>>
>> I have been simulating a protein in a truncated octahedral water box(~90k
>> atoms) using NAMD2.7b1. On both our local cluster and Jim's kraken build,
>> the job runs fine if I use up to 96 processors. With 128 the job crashes
>> after an error message, which is not consistent and can either be "bad
>> global exclusion count", atoms with nan velocities, or just a seg fault. I
>> haven't had any problems like this with the other jobs I've been running
>> using v2.7b1, which, admittedly, have more conventional geometries. My conf
>> file is below - any ideas?
>>
>> -Grace
>>
>>
>> **********************
>>
>> # FILENAMES
>> set outName [file rootname [file tail [info script]]]
>> #set inFleNum [expr [scan [string range $outName end-1 end]
>> "%d"] - 1]
>> #set inName [format "%s%02u" [string range $outName 0 end-2]
>> $inFileNum]
>> #set inName ionized
>> set inName min01
>> set homedir ../../..
>> set sourcepath ../../solvate_and_ionize/riso
>>
>> timestep 2.0
>>
>> structure $sourcepath/ionized.psf
>> parameters $homedir/toppar/par_all27_prot_lipid.prm
>> parameters $homedir/toppar/par_isoflurane_RS.inp
>> paraTypeCharmm on
>>
>> set temp 300.0
>> #temperature $temp
>> # RESTRAINTS
>>
>> constraints on
>> consref $sourcepath/constraints.pdb
>> conskfile $sourcepath/constraints.pdb
>> conskcol O
>>
>> # INPUT
>>
>> coordinates $sourcepath/ionized.pdb
>> extendedsystem $inName.xsc
>> binvelocities $inName.vel
>> bincoordinates $inName.coor
>> #cellBasisVector1 108 0 0
>> #cellBasisVector2 0 108 0
>> #cellBasisVector3 54 54 54
>>
>> # OUTPUT
>>
>> outputenergies 500
>> outputtiming 500
>> outputpressure 500
>> binaryoutput yes
>> outputname [format "%so" $outName]
>> restartname $outName
>> restartfreq 500
>> binaryrestart yes
>>
>> XSTFreq 500
>> COMmotion no
>>
>> # DCD TRAJECTORY
>>
>> DCDfile $outName.dcd
>> DCDfreq 5000
>>
>> # CUT-OFFs
>>
>> splitpatch hydrogen
>> hgroupcutoff 2.8
>> stepspercycle 20
>> switching on
>> switchdist 10.0
>> cutoff 12.0
>> pairlistdist 13.0
>>
>> #margin 1.0
>>
>> wrapWater no
>>
>> # CONSTANT-T
>>
>> langevin on
>> langevinTemp $temp
>> langevinDamping 0.1
>>
>> # CONSTANT-P
>>
>> useFlexibleCell no
>> useConstantRatio no
>> useGroupPressure yes
>>
>> langevinPiston on
>> langevinPistonTarget 1
>> langevinPistonPeriod 200
>> langevinPistonDecay 100
>> langevinPistonTemp $temp
>>
>> # PME
>>
>> PME yes
>> PMETolerance 10e-6
>> PMEInterpOrder 4
>>
>> PMEGridSizeX 120
>> PMEGridSizeY 120
>> PMEGridSizeZ 96
>>
>> # MULTIPLE TIME-STEP
>>
>> fullelectfrequency 2
>> nonbondedfreq 1
>>
>> # SHAKE/RATTLE
>>
>> rigidBonds all
>>
>> # 1-4's
>>
>> exclude scaled1-4
>> 1-4scaling 1.0
>>
>> constraintscaling 1.0
>> run 250000
>> constraintscaling 0.0
>> 1250000
>>
>>
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:50:50 CST