Re: AW: AW: CUDA problem?

From: Jim Phillips (jim_at_ks.uiuc.edu)
Date: Thu Apr 05 2012 - 22:37:53 CDT

This is the real error:

FATAL ERROR: cuda_check_remote_progress polled 1000000 times over
101.085352 s on step 1778

What it means is that NAMD has been waiting 101s for the CUDA event
indicating that the kernel has completed and NAMD is exiting rather than
likely hanging indefinitely. I've noticed that these errors were more
likely with energy evaluation (hence the connection to minimization),
certain compiler settings (-ftz), and particular devices on the Forge
cluster at NCSA that later crashed, suggesting this this is some kind of
hardware issue (GPU or PCIe bus) or driver/runtime/compiler fault. The
alternative is that I've missed a race condition that leads to an infinite
loop in the kernel.

I'm really hoping someone will find a way to trigger this consistently
since in my experience it has been too rare to identify a cause.

-Jim

On Thu, 5 Apr 2012, Norman Geist wrote:

> I guess the developers will fix this soon as 2.9b2 is a beta, bugs are expected. And reports a wished.
>
>
>
> Norman Geist.
>
>
>
> Von: Albert [mailto:mailmd2011_at_gmail.com]
> Gesendet: Donnerstag, 5. April 2012 08:16
> An: Norman Geist; namd-l_at_ks.uiuc.edu
> Betreff: Re: AW: namd-l: CUDA problem?
>
>
>
> Hello:
> thank you very much for kind messages.
> Is there an solution for this problem?
>
> best
> A
>
> On 04/05/2012 08:12 AM, Norman Geist wrote:
>
> Hi,
>
>
>
> there seems to be something wrong within the new gpu accelerated minimization as Francesco posted the same issue and I answered him a few second ago. I first thought this could also be an hardware issue of a single gpu, but two people with a broken gpu is really unlikely. So it’s the developers turn.
>
>
>
> Best wishes
>
>
>
> Norman Geist.
>
>
>
> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im Auftrag von Albert
> Gesendet: Mittwoch, 4. April 2012 21:03
> An: namd-l_at_ks.uiuc.edu
> Betreff: namd-l: CUDA problem?
>
>
>
> Dear:
> I've built a membrane system from CHARMM GUI and use the equilibration protocol to relax my system. Everything goes well if I use the default setting and it was finished under CUDA mode. However, there is a ligand in my system and I would like to restrain it during step 6.1(see below of the file). Here is what I did to add constrain for my ligand
>
> set sel [atomselect top all]
> $sel set beta 0
> set fix [atomselect top "protein and backbone or (resname LIG and not hydrogen)"]
> $fix set beta 1
> $sel writepdb bb_rmsd.ref
>
>
>
> after that I am trying to run this 6.1.inp by command:
>
> charmrun ++local +p4 namd2 +idlepoll step6.1_equilibration.inp > log
>
>
>
>
> a few minutes later, it stopped with following logs:
>
>
> ---------log----------------
> LINE MINIMIZER BRACKET: DX 7.96611e-05 0.000159322 DU -84.715 50.7203 DUDX -1.52698e+06 -592619 1.21989e+06
> ENERGY: 1776 5819.8403 10258.1721 9471.5998 94.8591 -182114.5405 16169.6595 0.0000 3.2133 0.0000 -140297.1965 0.0000 -140297.1965 -140297.1965 0.0000 3492.2283 3770.7578 593110.5555 3492.2283 3770.7578
>
> LINE MINIMIZER BRACKET: DX 5.18225e-05 0.0001075 DU -15.3777 66.098 DUDX -592619 3098.88 1.21989e+06
> ENERGY: 1777 5817.4042 10259.2760 9467.1949 94.8526 -182109.7937 16170.7783 0.0000 3.2124 0.0000 -140297.0753 0.0000 -140297.0753 -140297.0753 0.0000 3495.3068 3772.9724 593110.5555 3495.3068 3772.9724
>
> LINE MINIMIZER BRACKET: DX 5.18225e-06 0.0001075 DU -0.121147 66.098 DUDX -56148.6 3098.88 1.21989e+06
> ------------- Processor 2 Exiting: Called CmiAbort ------------
> Reason: FATAL ERROR: cuda_check_remote_progress polled 1000000 times over 101.085352 s on step 1778
>
> FATAL ERROR: cuda_check_remote_progress polled 1000000 times over 101.085352 s on step 1778
> Charm++ fatal error:
> FATAL ERROR: cuda_check_remote_progress polled 1000000 times over 101.085352 s on step 1778
>
>
> However, if I don't use CUDA mode, everthing goes well.... and the simulation can be finished without any error.... Would you please give me some advices for this?
>
>
> ----------step 6.1.inp-------------
> structure ../step5_assembly.xplor_ext.psf
> coordinates ../step5_assembly.pdb
>
> set temp 310;
> set outputname step6.1_equilibration;
>
> # read system values written by CHARMM (need to convert uppercases to lowercases)
> exec tr "\[:upper:\]" "\[:lower:\]" < ../step5_assembly.str | sed -e "s/ = //g" > step5_assembly.namd.str
> source step5_assembly.namd.str
>
> temperature $temp;
>
> outputName step6.1_equilibration_a; # base name for output from this run
> # NAMD writes two files at the end, final coord and vel
> # in the format of first-dyn.coor and first-dyn.vel
> firsttimestep 0; # last step of previous run
> restartfreq 500; # 500 steps = every 1ps
> dcdfreq 1000;
> dcdUnitCell yes; # the file will contain unit cell info in the style of
> # charmm dcd files. if yes, the dcd files will contain
> # unit cell information in the style of charmm DCD files.
> xstFreq 1000; # XSTFreq: control how often the extended systen configuration
> # will be appended to the XST file
> outputEnergies 125; # 125 steps = every 0.25ps
> # The number of timesteps between each energy output of NAMD
> outputTiming 1000; # The number of timesteps between each timing output shows
> # time per step and time to completion
>
> # Force-Field Parameters
> paraTypeCharmm on; # We're using charmm type parameter file(s)
> # multiple definitions may be used but only one file per definition
>
> exec mkdir -p toppar
> exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g" ./toppar/par_all22_prot.prm > toppar/par_all22_prot.prm
> exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g" ../toppar/par_all27_na.prm > toppar/par_all27_na.prm
> exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g" ./toppar/par_all36_carb.prm > toppar/par_all36_carb.prm
> exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g" ./toppar/par_all36_lipid.prm > toppar/par_all36_lipid.prm
> exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g" ./toppar/par_all36_cgenff.prm > toppar/par_all36_cgenff.prm
> exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g" -e "1,/read para/d" \
> -e "278,296d" -e "s/^BOM/!&/g" -e "s/^WRN/!&/g" ./toppar/toppar_water_ions.str > toppar/toppar_water_ions.str
> exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g" -e "1,/read para/d" \
> -e "278,296d" -e "s/^BOM/!&/g" -e "s/^WRN/!&/g" ./toppar/toppar_all36_lipid_cholesterol.str > toppar/toppar_all36_lipid_cholesterol.str
>
> parameters toppar/par_all27_prot_na.prm;
> parameters toppar/par_all36_lipid.prm;
> parameters toppar/par_all22_prot.prm;
> parameters toppar/par_all27_na.prm;
> parameters toppar/par_all36_carb.prm;
> parameters toppar/par_all36_cgenff.prm;
> parameters toppar/par_all35_ethers.prm;
> parameters toppar/lig.prm;
>
>
> parameters toppar/toppar_water_ions.str;
> parameters toppar/toppar_all36_lipid_cholesterol.str;
>
> # These are specified by CHARMM
> exclude scaled1-4 # non-bonded exclusion policy to use "none,1-2,1-3,1-4,or scaled1-4"
> # 1-2: all atoms pairs that are bonded are going to be ignored
> # 1-3: 3 consecutively bonded are excluded
> # scaled1-4: include all the 1-3, and modified 1-4 interactions
> # electrostatic scaled by 1-4scaling factor 1.0
> # vdW special 1-4 parameters in charmm parameter file.
> 1-4scaling 1.0
> switching on
> vdwForceSwitching yes; # New option for force-based switching of vdW
> # if both switching and vdwForceSwitching are on CHARMM force
> # switching is used for vdW forces.
> seed 1333525265 # Specifies a specific seed
>
> # You have some freedom choosing the cutoff
> cutoff 12.0; # may use smaller, maybe 10., with PME
> switchdist 10.0; # cutoff - 2.
> # switchdist - where you start to switch
> # cutoff - where you stop accounting for nonbond interactions.
> # correspondence in charmm:
> # (cutnb,ctofnb,ctonnb = pairlistdist,cutoff,switchdist)
> pairlistdist 16.0; # stores the all the pairs with in the distance it should be larger
> # than cutoff( + 2.)
> stepspercycle 20; # 20 redo pairlists every ten steps
> pairlistsPerCycle 2; # 2 is the default
> # cycle represents the number of steps between atom reassignments
> # this means every 20/2=10 steps the pairlist will be updated
>
> # Integrator Parameters
> timestep 1.0; # fs/step
> rigidBonds all; # Bound constraint all bonds involving H are fixed in length
> nonbondedFreq 1; # nonbonded forces every step
> fullElectFrequency 1; # PME every step
>
>
> # Constant Temperature Control ONLY DURING EQUILB
> reassignFreq 500; # reassignFreq: use this to reassign velocity every 500 steps
> reassignTemp $temp;
>
> # Periodic Boundary conditions. Need this since for a start...
> cellBasisVector1 $a 0.0 0.0; # vector to the next image
> cellBasisVector2 0.0 $b 0.0;
> cellBasisVector3 0.0 0.0 $c;
> cellOrigin 0.0 0.0 $zcen; # the *center* of the cell
>
> wrapWater on; # wrap water to central cell
> wrapAll on; # wrap other molecules too
> wrapNearest off; # use for non-rectangular cells (wrap to the nearest image)
>
> # PME (for full-system periodic electrostatics)
> exec python ../checkfft.py $a $b $c > checkfft.str
> source checkfft.str
>
> PME yes;
> PMEInterpOrder 6; # interpolation order (spline order 6 in charmm)
> PMEGridSizeX $fftx; # should be close to the cell size
> PMEGridSizeY $ffty; # corresponds to the charmm input fftx/y/z
> PMEGridSizeZ $fftz;
>
> # Pressure and volume control
> useGroupPressure yes; # use a hydrogen-group based pseudo-molecular viral to calcualte pressure and
> # has less fluctuation, is needed for rigid bonds (rigidBonds/SHAKE)
> useFlexibleCell yes; # yes for anisotropic system like membrane
> useConstantRatio yes; # keeps the ratio of the unit cell in the x-y plane constant A=B
>
> langevin on
> langevinDamping 10
> langevinTemp $temp
> langevinHydrogen no
>
> # planar restraint
> colvars on
> exec sed -e "s/Constant \$fc/Constant 5/g" -e "s/\$bb/10.0/g" -e "s/\$sc/5.0/g" membrane_lipid_restraint.namd.col > restraints/$outputname.col
> colvarsConfig restraints/$outputname.col
>
> # dihedral restraint
> extraBonds yes
> exec sed -e "s/\$FC/500/g" restraints/dihe.txt > restraints/$outputname.dihe
> extraBondsFile restraints/$outputname.dihe
>
> minimize 10000
>
> numsteps 90000000
> run 3000000 ; 3ns
>
>
>
>

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:21:24 CST