Re: AW: AW: CUDA problem?

From: Albert (mailmd2011_at_gmail.com)
Date: Fri Apr 06 2012 - 00:07:53 CDT

to my great surprise, this issue doesn't happen to 2.9beta1

On 04/06/2012 05:37 AM, Jim Phillips wrote:
>
> This is the real error:
>
> FATAL ERROR: cuda_check_remote_progress polled 1000000 times over
> 101.085352 s on step 1778
>
> What it means is that NAMD has been waiting 101s for the CUDA event
> indicating that the kernel has completed and NAMD is exiting rather
> than likely hanging indefinitely. I've noticed that these errors were
> more likely with energy evaluation (hence the connection to
> minimization), certain compiler settings (-ftz), and particular
> devices on the Forge cluster at NCSA that later crashed, suggesting
> this this is some kind of hardware issue (GPU or PCIe bus) or
> driver/runtime/compiler fault. The alternative is that I've missed a
> race condition that leads to an infinite loop in the kernel.
>
> I'm really hoping someone will find a way to trigger this consistently
> since in my experience it has been too rare to identify a cause.
>
> -Jim
>
>
> On Thu, 5 Apr 2012, Norman Geist wrote:
>
>> I guess the developers will fix this soon as 2.9b2 is a beta, bugs
>> are expected. And reports a wished.
>>
>>
>>
>> Norman Geist.
>>
>>
>>
>> Von: Albert [mailto:mailmd2011_at_gmail.com]
>> Gesendet: Donnerstag, 5. April 2012 08:16
>> An: Norman Geist; namd-l_at_ks.uiuc.edu
>> Betreff: Re: AW: namd-l: CUDA problem?
>>
>>
>>
>> Hello:
>> thank you very much for kind messages.
>> Is there an solution for this problem?
>>
>> best
>> A
>>
>> On 04/05/2012 08:12 AM, Norman Geist wrote:
>>
>> Hi,
>>
>>
>>
>> there seems to be something wrong within the new gpu accelerated
>> minimization as Francesco posted the same issue and I answered him a
>> few second ago. I first thought this could also be an hardware issue
>> of a single gpu, but two people with a broken gpu is really unlikely.
>> So it’s the developers turn.
>>
>>
>>
>> Best wishes
>>
>>
>>
>> Norman Geist.
>>
>>
>>
>> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
>> Auftrag von Albert
>> Gesendet: Mittwoch, 4. April 2012 21:03
>> An: namd-l_at_ks.uiuc.edu
>> Betreff: namd-l: CUDA problem?
>>
>>
>>
>> Dear:
>> I've built a membrane system from CHARMM GUI and use the
>> equilibration protocol to relax my system. Everything goes well if I
>> use the default setting and it was finished under CUDA mode. However,
>> there is a ligand in my system and I would like to restrain it during
>> step 6.1(see below of the file). Here is what I did to add constrain
>> for my ligand
>>
>> set sel [atomselect top all]
>> $sel set beta 0
>> set fix [atomselect top "protein and backbone or (resname LIG and not
>> hydrogen)"]
>> $fix set beta 1
>> $sel writepdb bb_rmsd.ref
>>
>>
>>
>> after that I am trying to run this 6.1.inp by command:
>>
>> charmrun ++local +p4 namd2 +idlepoll step6.1_equilibration.inp > log
>>
>>
>>
>>
>> a few minutes later, it stopped with following logs:
>>
>>
>> ---------log----------------
>> LINE MINIMIZER BRACKET: DX 7.96611e-05 0.000159322 DU -84.715 50.7203
>> DUDX -1.52698e+06 -592619 1.21989e+06
>> ENERGY: 1776 5819.8403 10258.1721 9471.5998 94.8591 -182114.5405
>> 16169.6595 0.0000 3.2133 0.0000 -140297.1965 0.0000 -140297.1965
>> -140297.1965 0.0000 3492.2283 3770.7578 593110.5555 3492.2283 3770.7578
>>
>> LINE MINIMIZER BRACKET: DX 5.18225e-05 0.0001075 DU -15.3777 66.098
>> DUDX -592619 3098.88 1.21989e+06
>> ENERGY: 1777 5817.4042 10259.2760 9467.1949 94.8526 -182109.7937
>> 16170.7783 0.0000 3.2124 0.0000 -140297.0753 0.0000 -140297.0753
>> -140297.0753 0.0000 3495.3068 3772.9724 593110.5555 3495.3068 3772.9724
>>
>> LINE MINIMIZER BRACKET: DX 5.18225e-06 0.0001075 DU -0.121147 66.098
>> DUDX -56148.6 3098.88 1.21989e+06
>> ------------- Processor 2 Exiting: Called CmiAbort ------------
>> Reason: FATAL ERROR: cuda_check_remote_progress polled 1000000 times
>> over 101.085352 s on step 1778
>>
>> FATAL ERROR: cuda_check_remote_progress polled 1000000 times over
>> 101.085352 s on step 1778
>> Charm++ fatal error:
>> FATAL ERROR: cuda_check_remote_progress polled 1000000 times over
>> 101.085352 s on step 1778
>>
>>
>> However, if I don't use CUDA mode, everthing goes well.... and the
>> simulation can be finished without any error.... Would you please
>> give me some advices for this?
>>
>>
>> ----------step 6.1.inp-------------
>> structure ../step5_assembly.xplor_ext.psf
>> coordinates ../step5_assembly.pdb
>>
>> set temp 310;
>> set outputname step6.1_equilibration;
>>
>> # read system values written by CHARMM (need to convert uppercases to
>> lowercases)
>> exec tr "\[:upper:\]" "\[:lower:\]" < ../step5_assembly.str | sed -e
>> "s/ = //g" > step5_assembly.namd.str
>> source step5_assembly.namd.str
>>
>> temperature $temp;
>>
>> outputName step6.1_equilibration_a; # base name for output from this run
>> # NAMD writes two files at the end, final coord and vel
>> # in the format of first-dyn.coor and first-dyn.vel
>> firsttimestep 0; # last step of previous run
>> restartfreq 500; # 500 steps = every 1ps
>> dcdfreq 1000;
>> dcdUnitCell yes; # the file will contain unit cell info in the style of
>> # charmm dcd files. if yes, the dcd files will contain
>> # unit cell information in the style of charmm DCD files.
>> xstFreq 1000; # XSTFreq: control how often the extended systen
>> configuration
>> # will be appended to the XST file
>> outputEnergies 125; # 125 steps = every 0.25ps
>> # The number of timesteps between each energy output of NAMD
>> outputTiming 1000; # The number of timesteps between each timing
>> output shows
>> # time per step and time to completion
>>
>> # Force-Field Parameters
>> paraTypeCharmm on; # We're using charmm type parameter file(s)
>> # multiple definitions may be used but only one file per definition
>>
>> exec mkdir -p toppar
>> exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g"
>> ./toppar/par_all22_prot.prm > toppar/par_all22_prot.prm
>> exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g"
>> ../toppar/par_all27_na.prm > toppar/par_all27_na.prm
>> exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g"
>> ./toppar/par_all36_carb.prm > toppar/par_all36_carb.prm
>> exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g"
>> ./toppar/par_all36_lipid.prm > toppar/par_all36_lipid.prm
>> exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g"
>> ./toppar/par_all36_cgenff.prm > toppar/par_all36_cgenff.prm
>> exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g" -e "1,/read para/d" \
>> -e "278,296d" -e "s/^BOM/!&/g" -e "s/^WRN/!&/g"
>> ./toppar/toppar_water_ions.str > toppar/toppar_water_ions.str
>> exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g" -e "1,/read para/d" \
>> -e "278,296d" -e "s/^BOM/!&/g" -e "s/^WRN/!&/g"
>> ./toppar/toppar_all36_lipid_cholesterol.str >
>> toppar/toppar_all36_lipid_cholesterol.str
>>
>> parameters toppar/par_all27_prot_na.prm;
>> parameters toppar/par_all36_lipid.prm;
>> parameters toppar/par_all22_prot.prm;
>> parameters toppar/par_all27_na.prm;
>> parameters toppar/par_all36_carb.prm;
>> parameters toppar/par_all36_cgenff.prm;
>> parameters toppar/par_all35_ethers.prm;
>> parameters toppar/lig.prm;
>>
>>
>> parameters toppar/toppar_water_ions.str;
>> parameters toppar/toppar_all36_lipid_cholesterol.str;
>>
>> # These are specified by CHARMM
>> exclude scaled1-4 # non-bonded exclusion policy to use
>> "none,1-2,1-3,1-4,or scaled1-4"
>> # 1-2: all atoms pairs that are bonded are going to be ignored
>> # 1-3: 3 consecutively bonded are excluded
>> # scaled1-4: include all the 1-3, and modified 1-4 interactions
>> # electrostatic scaled by 1-4scaling factor 1.0
>> # vdW special 1-4 parameters in charmm parameter file.
>> 1-4scaling 1.0
>> switching on
>> vdwForceSwitching yes; # New option for force-based switching of vdW
>> # if both switching and vdwForceSwitching are on CHARMM force
>> # switching is used for vdW forces.
>> seed 1333525265 # Specifies a specific seed
>>
>> # You have some freedom choosing the cutoff
>> cutoff 12.0; # may use smaller, maybe 10., with PME
>> switchdist 10.0; # cutoff - 2.
>> # switchdist - where you start to switch
>> # cutoff - where you stop accounting for nonbond interactions.
>> # correspondence in charmm:
>> # (cutnb,ctofnb,ctonnb = pairlistdist,cutoff,switchdist)
>> pairlistdist 16.0; # stores the all the pairs with in the distance it
>> should be larger
>> # than cutoff( + 2.)
>> stepspercycle 20; # 20 redo pairlists every ten steps
>> pairlistsPerCycle 2; # 2 is the default
>> # cycle represents the number of steps between atom reassignments
>> # this means every 20/2=10 steps the pairlist will be updated
>>
>> # Integrator Parameters
>> timestep 1.0; # fs/step
>> rigidBonds all; # Bound constraint all bonds involving H are fixed in
>> length
>> nonbondedFreq 1; # nonbonded forces every step
>> fullElectFrequency 1; # PME every step
>>
>>
>> # Constant Temperature Control ONLY DURING EQUILB
>> reassignFreq 500; # reassignFreq: use this to reassign velocity every
>> 500 steps
>> reassignTemp $temp;
>>
>> # Periodic Boundary conditions. Need this since for a start...
>> cellBasisVector1 $a 0.0 0.0; # vector to the next image
>> cellBasisVector2 0.0 $b 0.0;
>> cellBasisVector3 0.0 0.0 $c;
>> cellOrigin 0.0 0.0 $zcen; # the *center* of the cell
>>
>> wrapWater on; # wrap water to central cell
>> wrapAll on; # wrap other molecules too
>> wrapNearest off; # use for non-rectangular cells (wrap to the nearest
>> image)
>>
>> # PME (for full-system periodic electrostatics)
>> exec python ../checkfft.py $a $b $c > checkfft.str
>> source checkfft.str
>>
>> PME yes;
>> PMEInterpOrder 6; # interpolation order (spline order 6 in charmm)
>> PMEGridSizeX $fftx; # should be close to the cell size
>> PMEGridSizeY $ffty; # corresponds to the charmm input fftx/y/z
>> PMEGridSizeZ $fftz;
>>
>> # Pressure and volume control
>> useGroupPressure yes; # use a hydrogen-group based pseudo-molecular
>> viral to calcualte pressure and
>> # has less fluctuation, is needed for rigid bonds (rigidBonds/SHAKE)
>> useFlexibleCell yes; # yes for anisotropic system like membrane
>> useConstantRatio yes; # keeps the ratio of the unit cell in the x-y
>> plane constant A=B
>>
>> langevin on
>> langevinDamping 10
>> langevinTemp $temp
>> langevinHydrogen no
>>
>> # planar restraint
>> colvars on
>> exec sed -e "s/Constant \$fc/Constant 5/g" -e "s/\$bb/10.0/g" -e
>> "s/\$sc/5.0/g" membrane_lipid_restraint.namd.col >
>> restraints/$outputname.col
>> colvarsConfig restraints/$outputname.col
>>
>> # dihedral restraint
>> extraBonds yes
>> exec sed -e "s/\$FC/500/g" restraints/dihe.txt >
>> restraints/$outputname.dihe
>> extraBondsFile restraints/$outputname.dihe
>>
>> minimize 10000
>>
>> numsteps 90000000
>> run 3000000 ; 3ns
>>
>>
>>
>>

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:21:51 CST