crash with more than 96 processors (v2.7b1)

From: Grace Brannigan (grace_at_vitae.cmm.upenn.edu)
Date: Tue Apr 28 2009 - 14:16:07 CDT

Hi all,

I have been simulating a protein in a truncated octahedral water box(~90k
atoms) using NAMD2.7b1. On both our local cluster and Jim's kraken build,
the job runs fine if I use up to 96 processors. With 128 the job crashes
after an error message, which is not consistent and can either be "bad
global exclusion count", atoms with nan velocities, or just a seg fault. I
haven't had any problems like this with the other jobs I've been running
using v2.7b1, which, admittedly, have more conventional geometries. My conf
file is below - any ideas?

-Grace

**********************

# FILENAMES
set outName [file rootname [file tail [info script]]]
#set inFleNum [expr [scan [string range $outName end-1 end] "%d"]
- 1]
#set inName [format "%s%02u" [string range $outName 0 end-2]
$inFileNum]
#set inName ionized
set inName min01
set homedir ../../..
set sourcepath ../../solvate_and_ionize/riso

timestep 2.0

structure $sourcepath/ionized.psf
parameters $homedir/toppar/par_all27_prot_lipid.prm
parameters $homedir/toppar/par_isoflurane_RS.inp
paraTypeCharmm on

set temp 300.0
#temperature $temp
# RESTRAINTS

constraints on
consref $sourcepath/constraints.pdb
conskfile $sourcepath/constraints.pdb
conskcol O

# INPUT

coordinates $sourcepath/ionized.pdb
extendedsystem $inName.xsc
binvelocities $inName.vel
bincoordinates $inName.coor
#cellBasisVector1 108 0 0
#cellBasisVector2 0 108 0
#cellBasisVector3 54 54 54

# OUTPUT

outputenergies 500
outputtiming 500
outputpressure 500
binaryoutput yes
outputname [format "%so" $outName]
restartname $outName
restartfreq 500
binaryrestart yes

XSTFreq 500
COMmotion no

# DCD TRAJECTORY

DCDfile $outName.dcd
DCDfreq 5000

# CUT-OFFs

splitpatch hydrogen
hgroupcutoff 2.8
stepspercycle 20
switching on
switchdist 10.0
cutoff 12.0
pairlistdist 13.0

#margin 1.0

wrapWater no

# CONSTANT-T

langevin on
langevinTemp $temp
langevinDamping 0.1

# CONSTANT-P

useFlexibleCell no
useConstantRatio no
useGroupPressure yes

langevinPiston on
langevinPistonTarget 1
langevinPistonPeriod 200
langevinPistonDecay 100
langevinPistonTemp $temp

# PME

PME yes
PMETolerance 10e-6
PMEInterpOrder 4

PMEGridSizeX 120
PMEGridSizeY 120
PMEGridSizeZ 96

# MULTIPLE TIME-STEP

fullelectfrequency 2
nonbondedfreq 1

# SHAKE/RATTLE

rigidBonds all

# 1-4's

exclude scaled1-4
1-4scaling 1.0

constraintscaling 1.0
run 250000
constraintscaling 0.0
1250000

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:52:40 CST