Re: AW: AW: CUDA problem?

From: Jim Phillips (jim_at_ks.uiuc.edu)
Date: Fri Apr 06 2012 - 08:26:18 CDT

Interesting. NAMD 2.9b2 includes compute_20 ptx code to run on Kepler
while 2.9b1 only had sm_11 and sm_20 binary. Are you saying you can
reproduce the problem consistently in 2.9b2 but never in 2.9b1? If so,
what does NAMD report for the CUDA device description?

-Jim

On Fri, 6 Apr 2012, Albert wrote:

>
> to my great surprise, this issue doesn't happen to 2.9beta1
>
>
> On 04/06/2012 05:37 AM, Jim Phillips wrote:
>>
>> This is the real error:
>>
>> FATAL ERROR: cuda_check_remote_progress polled 1000000 times over
>> 101.085352 s on step 1778
>>
>> What it means is that NAMD has been waiting 101s for the CUDA event
>> indicating that the kernel has completed and NAMD is exiting rather than
>> likely hanging indefinitely. I've noticed that these errors were more
>> likely with energy evaluation (hence the connection to minimization),
>> certain compiler settings (-ftz), and particular devices on the Forge
>> cluster at NCSA that later crashed, suggesting this this is some kind of
>> hardware issue (GPU or PCIe bus) or driver/runtime/compiler fault. The
>> alternative is that I've missed a race condition that leads to an infinite
>> loop in the kernel.
>>
>> I'm really hoping someone will find a way to trigger this consistently
>> since in my experience it has been too rare to identify a cause.
>>
>> -Jim
>>
>>
>> On Thu, 5 Apr 2012, Norman Geist wrote:
>>
>>> I guess the developers will fix this soon as 2.9b2 is a beta, bugs are
>>> expected. And reports a wished.
>>>
>>>
>>>
>>> Norman Geist.
>>>
>>>
>>>
>>> Von: Albert [mailto:mailmd2011_at_gmail.com]
>>> Gesendet: Donnerstag, 5. April 2012 08:16
>>> An: Norman Geist; namd-l_at_ks.uiuc.edu
>>> Betreff: Re: AW: namd-l: CUDA problem?
>>>
>>>
>>>
>>> Hello:
>>> thank you very much for kind messages.
>>> Is there an solution for this problem?
>>>
>>> best
>>> A
>>>
>>> On 04/05/2012 08:12 AM, Norman Geist wrote:
>>>
>>> Hi,
>>>
>>>
>>>
>>> there seems to be something wrong within the new gpu accelerated
>>> minimization as Francesco posted the same issue and I answered him a few
>>> second ago. I first thought this could also be an hardware issue of a
>>> single gpu, but two people with a broken gpu is really unlikely. So it’s
>>> the developers turn.
>>>
>>>
>>>
>>> Best wishes
>>>
>>>
>>>
>>> Norman Geist.
>>>
>>>
>>>
>>> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im Auftrag
>>> von Albert
>>> Gesendet: Mittwoch, 4. April 2012 21:03
>>> An: namd-l_at_ks.uiuc.edu
>>> Betreff: namd-l: CUDA problem?
>>>
>>>
>>>
>>> Dear:
>>> I've built a membrane system from CHARMM GUI and use the equilibration
>>> protocol to relax my system. Everything goes well if I use the default
>>> setting and it was finished under CUDA mode. However, there is a ligand in
>>> my system and I would like to restrain it during step 6.1(see below of the
>>> file). Here is what I did to add constrain for my ligand
>>>
>>> set sel [atomselect top all]
>>> $sel set beta 0
>>> set fix [atomselect top "protein and backbone or (resname LIG and not
>>> hydrogen)"]
>>> $fix set beta 1
>>> $sel writepdb bb_rmsd.ref
>>>
>>>
>>>
>>> after that I am trying to run this 6.1.inp by command:
>>>
>>> charmrun ++local +p4 namd2 +idlepoll step6.1_equilibration.inp > log
>>>
>>>
>>>
>>>
>>> a few minutes later, it stopped with following logs:
>>>
>>>
>>> ---------log----------------
>>> LINE MINIMIZER BRACKET: DX 7.96611e-05 0.000159322 DU -84.715 50.7203 DUDX
>>> -1.52698e+06 -592619 1.21989e+06
>>> ENERGY: 1776 5819.8403 10258.1721 9471.5998 94.8591 -182114.5405
>>> 16169.6595 0.0000 3.2133 0.0000 -140297.1965 0.0000 -140297.1965
>>> -140297.1965 0.0000 3492.2283 3770.7578 593110.5555 3492.2283 3770.7578
>>>
>>> LINE MINIMIZER BRACKET: DX 5.18225e-05 0.0001075 DU -15.3777 66.098 DUDX
>>> -592619 3098.88 1.21989e+06
>>> ENERGY: 1777 5817.4042 10259.2760 9467.1949 94.8526 -182109.7937
>>> 16170.7783 0.0000 3.2124 0.0000 -140297.0753 0.0000 -140297.0753
>>> -140297.0753 0.0000 3495.3068 3772.9724 593110.5555 3495.3068 3772.9724
>>>
>>> LINE MINIMIZER BRACKET: DX 5.18225e-06 0.0001075 DU -0.121147 66.098 DUDX
>>> -56148.6 3098.88 1.21989e+06
>>> ------------- Processor 2 Exiting: Called CmiAbort ------------
>>> Reason: FATAL ERROR: cuda_check_remote_progress polled 1000000 times over
>>> 101.085352 s on step 1778
>>>
>>> FATAL ERROR: cuda_check_remote_progress polled 1000000 times over
>>> 101.085352 s on step 1778
>>> Charm++ fatal error:
>>> FATAL ERROR: cuda_check_remote_progress polled 1000000 times over
>>> 101.085352 s on step 1778
>>>
>>>
>>> However, if I don't use CUDA mode, everthing goes well.... and the
>>> simulation can be finished without any error.... Would you please give me
>>> some advices for this?
>>>
>>>
>>> ----------step 6.1.inp-------------
>>> structure ../step5_assembly.xplor_ext.psf
>>> coordinates ../step5_assembly.pdb
>>>
>>> set temp 310;
>>> set outputname step6.1_equilibration;
>>>
>>> # read system values written by CHARMM (need to convert uppercases to
>>> lowercases)
>>> exec tr "\[:upper:\]" "\[:lower:\]" < ../step5_assembly.str | sed -e "s/ =
>>> //g" > step5_assembly.namd.str
>>> source step5_assembly.namd.str
>>>
>>> temperature $temp;
>>>
>>> outputName step6.1_equilibration_a; # base name for output from this run
>>> # NAMD writes two files at the end, final coord and vel
>>> # in the format of first-dyn.coor and first-dyn.vel
>>> firsttimestep 0; # last step of previous run
>>> restartfreq 500; # 500 steps = every 1ps
>>> dcdfreq 1000;
>>> dcdUnitCell yes; # the file will contain unit cell info in the style of
>>> # charmm dcd files. if yes, the dcd files will contain
>>> # unit cell information in the style of charmm DCD files.
>>> xstFreq 1000; # XSTFreq: control how often the extended systen
>>> configuration
>>> # will be appended to the XST file
>>> outputEnergies 125; # 125 steps = every 0.25ps
>>> # The number of timesteps between each energy output of NAMD
>>> outputTiming 1000; # The number of timesteps between each timing output
>>> shows
>>> # time per step and time to completion
>>>
>>> # Force-Field Parameters
>>> paraTypeCharmm on; # We're using charmm type parameter file(s)
>>> # multiple definitions may be used but only one file per definition
>>>
>>> exec mkdir -p toppar
>>> exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g" ./toppar/par_all22_prot.prm >
>>> toppar/par_all22_prot.prm
>>> exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g" ../toppar/par_all27_na.prm >
>>> toppar/par_all27_na.prm
>>> exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g" ./toppar/par_all36_carb.prm >
>>> toppar/par_all36_carb.prm
>>> exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g" ./toppar/par_all36_lipid.prm
>>> > toppar/par_all36_lipid.prm
>>> exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g" ./toppar/par_all36_cgenff.prm
>>> > toppar/par_all36_cgenff.prm
>>> exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g" -e "1,/read para/d" \
>>> -e "278,296d" -e "s/^BOM/!&/g" -e "s/^WRN/!&/g"
>>> ./toppar/toppar_water_ions.str > toppar/toppar_water_ions.str
>>> exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g" -e "1,/read para/d" \
>>> -e "278,296d" -e "s/^BOM/!&/g" -e "s/^WRN/!&/g"
>>> ./toppar/toppar_all36_lipid_cholesterol.str >
>>> toppar/toppar_all36_lipid_cholesterol.str
>>>
>>> parameters toppar/par_all27_prot_na.prm;
>>> parameters toppar/par_all36_lipid.prm;
>>> parameters toppar/par_all22_prot.prm;
>>> parameters toppar/par_all27_na.prm;
>>> parameters toppar/par_all36_carb.prm;
>>> parameters toppar/par_all36_cgenff.prm;
>>> parameters toppar/par_all35_ethers.prm;
>>> parameters toppar/lig.prm;
>>>
>>>
>>> parameters toppar/toppar_water_ions.str;
>>> parameters toppar/toppar_all36_lipid_cholesterol.str;
>>>
>>> # These are specified by CHARMM
>>> exclude scaled1-4 # non-bonded exclusion policy to use
>>> "none,1-2,1-3,1-4,or scaled1-4"
>>> # 1-2: all atoms pairs that are bonded are going to be ignored
>>> # 1-3: 3 consecutively bonded are excluded
>>> # scaled1-4: include all the 1-3, and modified 1-4 interactions
>>> # electrostatic scaled by 1-4scaling factor 1.0
>>> # vdW special 1-4 parameters in charmm parameter file.
>>> 1-4scaling 1.0
>>> switching on
>>> vdwForceSwitching yes; # New option for force-based switching of vdW
>>> # if both switching and vdwForceSwitching are on CHARMM force
>>> # switching is used for vdW forces.
>>> seed 1333525265 # Specifies a specific seed
>>>
>>> # You have some freedom choosing the cutoff
>>> cutoff 12.0; # may use smaller, maybe 10., with PME
>>> switchdist 10.0; # cutoff - 2.
>>> # switchdist - where you start to switch
>>> # cutoff - where you stop accounting for nonbond interactions.
>>> # correspondence in charmm:
>>> # (cutnb,ctofnb,ctonnb = pairlistdist,cutoff,switchdist)
>>> pairlistdist 16.0; # stores the all the pairs with in the distance it
>>> should be larger
>>> # than cutoff( + 2.)
>>> stepspercycle 20; # 20 redo pairlists every ten steps
>>> pairlistsPerCycle 2; # 2 is the default
>>> # cycle represents the number of steps between atom reassignments
>>> # this means every 20/2=10 steps the pairlist will be updated
>>>
>>> # Integrator Parameters
>>> timestep 1.0; # fs/step
>>> rigidBonds all; # Bound constraint all bonds involving H are fixed in
>>> length
>>> nonbondedFreq 1; # nonbonded forces every step
>>> fullElectFrequency 1; # PME every step
>>>
>>>
>>> # Constant Temperature Control ONLY DURING EQUILB
>>> reassignFreq 500; # reassignFreq: use this to reassign velocity every 500
>>> steps
>>> reassignTemp $temp;
>>>
>>> # Periodic Boundary conditions. Need this since for a start...
>>> cellBasisVector1 $a 0.0 0.0; # vector to the next image
>>> cellBasisVector2 0.0 $b 0.0;
>>> cellBasisVector3 0.0 0.0 $c;
>>> cellOrigin 0.0 0.0 $zcen; # the *center* of the cell
>>>
>>> wrapWater on; # wrap water to central cell
>>> wrapAll on; # wrap other molecules too
>>> wrapNearest off; # use for non-rectangular cells (wrap to the nearest
>>> image)
>>>
>>> # PME (for full-system periodic electrostatics)
>>> exec python ../checkfft.py $a $b $c > checkfft.str
>>> source checkfft.str
>>>
>>> PME yes;
>>> PMEInterpOrder 6; # interpolation order (spline order 6 in charmm)
>>> PMEGridSizeX $fftx; # should be close to the cell size
>>> PMEGridSizeY $ffty; # corresponds to the charmm input fftx/y/z
>>> PMEGridSizeZ $fftz;
>>>
>>> # Pressure and volume control
>>> useGroupPressure yes; # use a hydrogen-group based pseudo-molecular viral
>>> to calcualte pressure and
>>> # has less fluctuation, is needed for rigid bonds (rigidBonds/SHAKE)
>>> useFlexibleCell yes; # yes for anisotropic system like membrane
>>> useConstantRatio yes; # keeps the ratio of the unit cell in the x-y plane
>>> constant A=B
>>>
>>> langevin on
>>> langevinDamping 10
>>> langevinTemp $temp
>>> langevinHydrogen no
>>>
>>> # planar restraint
>>> colvars on
>>> exec sed -e "s/Constant \$fc/Constant 5/g" -e "s/\$bb/10.0/g" -e
>>> "s/\$sc/5.0/g" membrane_lipid_restraint.namd.col >
>>> restraints/$outputname.col
>>> colvarsConfig restraints/$outputname.col
>>>
>>> # dihedral restraint
>>> extraBonds yes
>>> exec sed -e "s/\$FC/500/g" restraints/dihe.txt >
>>> restraints/$outputname.dihe
>>> extraBondsFile restraints/$outputname.dihe
>>>
>>> minimize 10000
>>>
>>> numsteps 90000000
>>> run 3000000 ; 3ns
>>>
>>>
>>>
>>>
>

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:21:25 CST