Re: crash with more than 96 processors (v2.7b1)

From: Chris Harrison (char_at_ks.uiuc.edu)
Date: Tue Apr 28 2009 - 16:50:57 CDT

Grace,

You say your cluster, I'm assuming this isn't an XT5. ;)

Can you provide some details on your cluster and clarify if you mean 128
procs on both clusters, irrespective of architecture?

Also, you have confirmed that using the lower # of procs you can exceed the
timestep at which the "128 proc" job dies, correct?

C.

--
Chris Harrison, Ph.D.
Theoretical and Computational Biophysics Group
NIH Resource for Macromolecular Modeling and Bioinformatics
Beckman Institute for Advanced Science and Technology
University of Illinois, 405 N. Mathews Ave., Urbana, IL 61801
char_at_ks.uiuc.edu                            Voice: 217-244-1733
http://www.ks.uiuc.edu/~char               Fax: 217-244-6078
On Tue, Apr 28, 2009 at 2:16 PM, Grace Brannigan
<grace_at_vitae.cmm.upenn.edu>wrote:
> Hi all,
>
> I have been simulating a protein in a truncated octahedral water box(~90k
> atoms) using NAMD2.7b1. On both our local cluster and Jim's kraken build,
> the job runs fine if I use up to 96 processors. With 128 the job crashes
> after an error message, which is not consistent and can either be "bad
> global exclusion count", atoms with nan velocities,  or just a seg fault. I
> haven't had any problems like this with the other jobs I've been running
> using v2.7b1, which, admittedly, have more conventional geometries.  My conf
> file is below - any ideas?
>
> -Grace
>
>
> **********************
>
> # FILENAMES
> set outName             [file rootname [file tail [info script]]]
> #set inFleNum           [expr [scan [string range $outName end-1 end] "%d"]
> - 1]
> #set inName              [format "%s%02u" [string range $outName 0 end-2]
> $inFileNum]
> #set inName           ionized
> set inName         min01
> set homedir       ../../..
> set sourcepath     ../../solvate_and_ionize/riso
>
> timestep            2.0
>
> structure           $sourcepath/ionized.psf
> parameters          $homedir/toppar/par_all27_prot_lipid.prm
> parameters         $homedir/toppar/par_isoflurane_RS.inp
> paraTypeCharmm      on
>
> set temp            300.0
> #temperature         $temp
> # RESTRAINTS
>
> constraints         on
> consref             $sourcepath/constraints.pdb
> conskfile           $sourcepath/constraints.pdb
> conskcol            O
>
> # INPUT
>
> coordinates         $sourcepath/ionized.pdb
> extendedsystem       $inName.xsc
> binvelocities        $inName.vel
> bincoordinates        $inName.coor
> #cellBasisVector1    108 0 0
> #cellBasisVector2     0 108 0
> #cellBasisVector3     54 54 54
>
> # OUTPUT
>
> outputenergies       500
> outputtiming         500
> outputpressure       500
> binaryoutput         yes
> outputname           [format "%so" $outName]
> restartname          $outName
> restartfreq         500
> binaryrestart        yes
>
> XSTFreq              500
> COMmotion         no
>
> # DCD TRAJECTORY
>
> DCDfile              $outName.dcd
> DCDfreq              5000
>
> # CUT-OFFs
>
> splitpatch           hydrogen
> hgroupcutoff         2.8
> stepspercycle        20
> switching            on
> switchdist           10.0
> cutoff               12.0
> pairlistdist         13.0
>
> #margin     1.0
>
> wrapWater        no
>
> # CONSTANT-T
>
> langevin                on
> langevinTemp            $temp
> langevinDamping         0.1
>
> # CONSTANT-P
>
> useFlexibleCell      no
> useConstantRatio     no
> useGroupPressure     yes
>
> langevinPiston       on
> langevinPistonTarget 1
> langevinPistonPeriod 200
> langevinPistonDecay  100
> langevinPistonTemp   $temp
>
> # PME
>
> PME                  yes
> PMETolerance         10e-6
> PMEInterpOrder       4
>
> PMEGridSizeX         120
> PMEGridSizeY         120
> PMEGridSizeZ         96
>
> # MULTIPLE TIME-STEP
>
> fullelectfrequency   2
> nonbondedfreq        1
>
> # SHAKE/RATTLE
>
> rigidBonds           all
>
> # 1-4's
>
> exclude              scaled1-4
> 1-4scaling           1.0
>
> constraintscaling   1.0
> run 250000
> constraintscaling   0.0
> 1250000
>
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:52:41 CST