AW: AW: AW: CUDA problem?

From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Wed Apr 11 2012 - 01:04:08 CDT

Hi,

there has been several people facing this error also with the 2.8 cuda version as I remember, also I saw this error on a broken gpu before. It's unlikely that several people have a broken gpu simultaneously but you are right, we need more data.

Norman Geist.

> -----Ursprüngliche Nachricht-----
> Von: Jim Phillips [mailto:jim_at_ks.uiuc.edu]
> Gesendet: Freitag, 6. April 2012 05:38
> An: Norman Geist
> Cc: 'Albert'; Namd Mailing List
> Betreff: Re: AW: AW: namd-l: CUDA problem?
>
>
> This is the real error:
>
> FATAL ERROR: cuda_check_remote_progress polled 1000000 times over
> 101.085352 s on step 1778
>
> What it means is that NAMD has been waiting 101s for the CUDA event
> indicating that the kernel has completed and NAMD is exiting rather
> than
> likely hanging indefinitely. I've noticed that these errors were more
> likely with energy evaluation (hence the connection to minimization),
> certain compiler settings (-ftz), and particular devices on the Forge
> cluster at NCSA that later crashed, suggesting this this is some kind
> of
> hardware issue (GPU or PCIe bus) or driver/runtime/compiler fault. The
> alternative is that I've missed a race condition that leads to an
> infinite
> loop in the kernel.
>
> I'm really hoping someone will find a way to trigger this consistently
> since in my experience it has been too rare to identify a cause.
>
> -Jim
>
>
> On Thu, 5 Apr 2012, Norman Geist wrote:
>
> > I guess the developers will fix this soon as 2.9b2 is a beta, bugs
> are expected. And reports a wished.
> >
> >
> >
> > Norman Geist.
> >
> >
> >
> > Von: Albert [mailto:mailmd2011_at_gmail.com]
> > Gesendet: Donnerstag, 5. April 2012 08:16
> > An: Norman Geist; namd-l_at_ks.uiuc.edu
> > Betreff: Re: AW: namd-l: CUDA problem?
> >
> >
> >
> > Hello:
> > thank you very much for kind messages.
> > Is there an solution for this problem?
> >
> > best
> > A
> >
> > On 04/05/2012 08:12 AM, Norman Geist wrote:
> >
> > Hi,
> >
> >
> >
> > there seems to be something wrong within the new gpu accelerated
> minimization as Francesco posted the same issue and I answered him a
> few second ago. I first thought this could also be an hardware issue of
> a single gpu, but two people with a broken gpu is really unlikely. So
> it’s the developers turn.
> >
> >
> >
> > Best wishes
> >
> >
> >
> > Norman Geist.
> >
> >
> >
> > Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
> Auftrag von Albert
> > Gesendet: Mittwoch, 4. April 2012 21:03
> > An: namd-l_at_ks.uiuc.edu
> > Betreff: namd-l: CUDA problem?
> >
> >
> >
> > Dear:
> > I've built a membrane system from CHARMM GUI and use the
> equilibration protocol to relax my system. Everything goes well if I
> use the default setting and it was finished under CUDA mode. However,
> there is a ligand in my system and I would like to restrain it during
> step 6.1(see below of the file). Here is what I did to add constrain
> for my ligand
> >
> > set sel [atomselect top all]
> > $sel set beta 0
> > set fix [atomselect top "protein and backbone or (resname LIG and not
> hydrogen)"]
> > $fix set beta 1
> > $sel writepdb bb_rmsd.ref
> >
> >
> >
> > after that I am trying to run this 6.1.inp by command:
> >
> > charmrun ++local +p4 namd2 +idlepoll step6.1_equilibration.inp > log
> >
> >
> >
> >
> > a few minutes later, it stopped with following logs:
> >
> >
> > ---------log----------------
> > LINE MINIMIZER BRACKET: DX 7.96611e-05 0.000159322 DU -84.715 50.7203
> DUDX -1.52698e+06 -592619 1.21989e+06
> > ENERGY: 1776 5819.8403 10258.1721 9471.5998 94.8591 -182114.5405
> 16169.6595 0.0000 3.2133 0.0000 -140297.1965 0.0000 -140297.1965 -
> 140297.1965 0.0000 3492.2283 3770.7578 593110.5555 3492.2283 3770.7578
> >
> > LINE MINIMIZER BRACKET: DX 5.18225e-05 0.0001075 DU -15.3777 66.098
> DUDX -592619 3098.88 1.21989e+06
> > ENERGY: 1777 5817.4042 10259.2760 9467.1949 94.8526 -182109.7937
> 16170.7783 0.0000 3.2124 0.0000 -140297.0753 0.0000 -140297.0753 -
> 140297.0753 0.0000 3495.3068 3772.9724 593110.5555 3495.3068 3772.9724
> >
> > LINE MINIMIZER BRACKET: DX 5.18225e-06 0.0001075 DU -0.121147 66.098
> DUDX -56148.6 3098.88 1.21989e+06
> > ------------- Processor 2 Exiting: Called CmiAbort ------------
> > Reason: FATAL ERROR: cuda_check_remote_progress polled 1000000 times
> over 101.085352 s on step 1778
> >
> > FATAL ERROR: cuda_check_remote_progress polled 1000000 times over
> 101.085352 s on step 1778
> > Charm++ fatal error:
> > FATAL ERROR: cuda_check_remote_progress polled 1000000 times over
> 101.085352 s on step 1778
> >
> >
> > However, if I don't use CUDA mode, everthing goes well.... and the
> simulation can be finished without any error.... Would you please give
> me some advices for this?
> >
> >
> > ----------step 6.1.inp-------------
> > structure ../step5_assembly.xplor_ext.psf
> > coordinates ../step5_assembly.pdb
> >
> > set temp 310;
> > set outputname step6.1_equilibration;
> >
> > # read system values written by CHARMM (need to convert uppercases to
> lowercases)
> > exec tr "\[:upper:\]" "\[:lower:\]" < ../step5_assembly.str | sed -e
> "s/ = //g" > step5_assembly.namd.str
> > source step5_assembly.namd.str
> >
> > temperature $temp;
> >
> > outputName step6.1_equilibration_a; # base name for output from this
> run
> > # NAMD writes two files at the end, final coord and vel
> > # in the format of first-dyn.coor and first-dyn.vel
> > firsttimestep 0; # last step of previous run
> > restartfreq 500; # 500 steps = every 1ps
> > dcdfreq 1000;
> > dcdUnitCell yes; # the file will contain unit cell info in the style
> of
> > # charmm dcd files. if yes, the dcd files will contain
> > # unit cell information in the style of charmm DCD files.
> > xstFreq 1000; # XSTFreq: control how often the extended systen
> configuration
> > # will be appended to the XST file
> > outputEnergies 125; # 125 steps = every 0.25ps
> > # The number of timesteps between each energy output of NAMD
> > outputTiming 1000; # The number of timesteps between each timing
> output shows
> > # time per step and time to completion
> >
> > # Force-Field Parameters
> > paraTypeCharmm on; # We're using charmm type parameter file(s)
> > # multiple definitions may be used but only one file per definition
> >
> > exec mkdir -p toppar
> > exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g"
> ./toppar/par_all22_prot.prm > toppar/par_all22_prot.prm
> > exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g"
> ../toppar/par_all27_na.prm > toppar/par_all27_na.prm
> > exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g"
> ./toppar/par_all36_carb.prm > toppar/par_all36_carb.prm
> > exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g"
> ./toppar/par_all36_lipid.prm > toppar/par_all36_lipid.prm
> > exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g"
> ./toppar/par_all36_cgenff.prm > toppar/par_all36_cgenff.prm
> > exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g" -e "1,/read para/d" \
> > -e "278,296d" -e "s/^BOM/!&/g" -e "s/^WRN/!&/g"
> ./toppar/toppar_water_ions.str > toppar/toppar_water_ions.str
> > exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g" -e "1,/read para/d" \
> > -e "278,296d" -e "s/^BOM/!&/g" -e "s/^WRN/!&/g"
> ./toppar/toppar_all36_lipid_cholesterol.str >
> toppar/toppar_all36_lipid_cholesterol.str
> >
> > parameters toppar/par_all27_prot_na.prm;
> > parameters toppar/par_all36_lipid.prm;
> > parameters toppar/par_all22_prot.prm;
> > parameters toppar/par_all27_na.prm;
> > parameters toppar/par_all36_carb.prm;
> > parameters toppar/par_all36_cgenff.prm;
> > parameters toppar/par_all35_ethers.prm;
> > parameters toppar/lig.prm;
> >
> >
> > parameters toppar/toppar_water_ions.str;
> > parameters toppar/toppar_all36_lipid_cholesterol.str;
> >
> > # These are specified by CHARMM
> > exclude scaled1-4 # non-bonded exclusion policy to use "none,1-2,1-
> 3,1-4,or scaled1-4"
> > # 1-2: all atoms pairs that are bonded are going to be ignored
> > # 1-3: 3 consecutively bonded are excluded
> > # scaled1-4: include all the 1-3, and modified 1-4 interactions
> > # electrostatic scaled by 1-4scaling factor 1.0
> > # vdW special 1-4 parameters in charmm parameter file.
> > 1-4scaling 1.0
> > switching on
> > vdwForceSwitching yes; # New option for force-based switching of vdW
> > # if both switching and vdwForceSwitching are on CHARMM force
> > # switching is used for vdW forces.
> > seed 1333525265 # Specifies a specific seed
> >
> > # You have some freedom choosing the cutoff
> > cutoff 12.0; # may use smaller, maybe 10., with PME
> > switchdist 10.0; # cutoff - 2.
> > # switchdist - where you start to switch
> > # cutoff - where you stop accounting for nonbond interactions.
> > # correspondence in charmm:
> > # (cutnb,ctofnb,ctonnb = pairlistdist,cutoff,switchdist)
> > pairlistdist 16.0; # stores the all the pairs with in the distance it
> should be larger
> > # than cutoff( + 2.)
> > stepspercycle 20; # 20 redo pairlists every ten steps
> > pairlistsPerCycle 2; # 2 is the default
> > # cycle represents the number of steps between atom reassignments
> > # this means every 20/2=10 steps the pairlist will be updated
> >
> > # Integrator Parameters
> > timestep 1.0; # fs/step
> > rigidBonds all; # Bound constraint all bonds involving H are fixed in
> length
> > nonbondedFreq 1; # nonbonded forces every step
> > fullElectFrequency 1; # PME every step
> >
> >
> > # Constant Temperature Control ONLY DURING EQUILB
> > reassignFreq 500; # reassignFreq: use this to reassign velocity every
> 500 steps
> > reassignTemp $temp;
> >
> > # Periodic Boundary conditions. Need this since for a start...
> > cellBasisVector1 $a 0.0 0.0; # vector to the next image
> > cellBasisVector2 0.0 $b 0.0;
> > cellBasisVector3 0.0 0.0 $c;
> > cellOrigin 0.0 0.0 $zcen; # the *center* of the cell
> >
> > wrapWater on; # wrap water to central cell
> > wrapAll on; # wrap other molecules too
> > wrapNearest off; # use for non-rectangular cells (wrap to the nearest
> image)
> >
> > # PME (for full-system periodic electrostatics)
> > exec python ../checkfft.py $a $b $c > checkfft.str
> > source checkfft.str
> >
> > PME yes;
> > PMEInterpOrder 6; # interpolation order (spline order 6 in charmm)
> > PMEGridSizeX $fftx; # should be close to the cell size
> > PMEGridSizeY $ffty; # corresponds to the charmm input fftx/y/z
> > PMEGridSizeZ $fftz;
> >
> > # Pressure and volume control
> > useGroupPressure yes; # use a hydrogen-group based pseudo-molecular
> viral to calcualte pressure and
> > # has less fluctuation, is needed for rigid bonds (rigidBonds/SHAKE)
> > useFlexibleCell yes; # yes for anisotropic system like membrane
> > useConstantRatio yes; # keeps the ratio of the unit cell in the x-y
> plane constant A=B
> >
> > langevin on
> > langevinDamping 10
> > langevinTemp $temp
> > langevinHydrogen no
> >
> > # planar restraint
> > colvars on
> > exec sed -e "s/Constant \$fc/Constant 5/g" -e "s/\$bb/10.0/g" -e
> "s/\$sc/5.0/g" membrane_lipid_restraint.namd.col >
> restraints/$outputname.col
> > colvarsConfig restraints/$outputname.col
> >
> > # dihedral restraint
> > extraBonds yes
> > exec sed -e "s/\$FC/500/g" restraints/dihe.txt >
> restraints/$outputname.dihe
> > extraBondsFile restraints/$outputname.dihe
> >
> > minimize 10000
> >
> > numsteps 90000000
> > run 3000000 ; 3ns
> >
> >
> >
> >

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:21:26 CST