Re: AW: AW: CUDA problem?

From: Eric Hill (ehh713_at_gmail.com)
Date: Tue Jan 15 2013 - 01:12:53 CST

Hello all,

I have this problem also, using multiple of the latest nightly X64_CUDA
builds as well as the last 4 stable builds. I am using an NVIDIA GTX580
card with 3GB memory. My simulations using a GTX260 on this machine have
always worked in the past, but after upgrading to this card I have been
experiencing this issue. I am not performing minimization, and the
simulation seems to run fine until the error occurs. An example output
is shown below:

"ETITLE: TS BOND ANGLE DIHED
IMPRP ELECT VDW BOUNDARY MISC
KINETIC TOTAL TEMP POTENTIAL TOTAL3
TEMPAVG PRESSURE GPRESSURE VOLUME PRESSAVG
GPRESSAVG

ENERGY: 266250 3784.0171 13350.0661 8739.4586
67.6172 -87451.4515 800.1736 0.0000 0.0000
26330.8002 -34379.3187 306.5203 -60710.1189 -34176.8848
307.3187 44.7648 37.6307 375950.5041
-5.2822 -5.4192

FATAL ERROR: cuda_check_remote_progress polled 1000000 times over
102.629924 s on step 266375
[16] Stack Traceback:
   [16:0] CmiAbort+0x95 [0xccd29d]
   [16:1] _Z8NAMD_diePKc+0x62 [0x60529a]
   [16:2] _Z26cuda_check_remote_progressPvd+0xd6 [0x7f0360]
   [16:3] [0xcdab14]
   [16:4] CcdCallBacks+0x7d [0xcda99d]
   [16:5] CsdScheduleForever+0x113 [0xcd470b]
   [16:6] CsdScheduler+0x1c [0xcd4264]
   [16:7] _Z10slave_initiPPc+0x50 [0x60e718]
   [16:8] [0xcd315c]
   [16:9] [0xccd70f]
   [16:10] +0x7efc [0x7ff28b8f3efc]
   [16:11] clone+0x6d [0x7ff28ac88f8d]
[16] Stack Traceback:
   [16:0] [0xcce1cd]
   [16:1] CmiAbort+0xd3 [0xccd2db]
   [16:2] _Z8NAMD_diePKc+0x62 [0x60529a]
   [16:3] _Z26cuda_check_remote_progressPvd+0xd6 [0x7f0360]
   [16:4] [0xcdab14]
   [16:5] CcdCallBacks+0x7d [0xcda99d]
   [16:6] CsdScheduleForever+0x113 [0xcd470b]
   [16:7] CsdScheduler+0x1c [0xcd4264]
   [16:8] _Z10slave_initiPPc+0x50 [0x60e718]
   [16:9] [0xcd315c]
   [16:10] [0xccd70f]
   [16:11] +0x7efc [0x7ff28b8f3efc]
   [16:12] clone+0x6d [0x7ff28ac88f8d]
"
I have performed CUDA_GPU_MEMTEST on this GPU and it has passed, and it
also has no issues with deviceQuery (in the CUDA GPU computing SDK) or
the GPU implementation of AMBER12. Has the cause of this error been
determined yet? It seems like the answer must not be an issue with the
GPU since I have the same GPU on two other machines and both run NAMD
fine, but I cannot be sure. If the cause of this error is still unknown
then I hope this information is helpful to someone.

Best regards,
Eric H.

>
> On 04/06/2012 05:37 AM, Jim Phillips wrote:
> /> /
> /> This is the real error: /
> /> /
> /> FATAL ERROR: cuda_check_remote_progress polled 1000000 times over /
> /> 101.085352 s on step 1778 /
> /> /
> /> What it means is that NAMD has been waiting 101s for the CUDA event /
> /> indicating that the kernel has completed and NAMD is exiting rather /
> /> than likely hanging indefinitely. I've noticed that these errors were /
> /> more likely with energy evaluation (hence the connection to /
> /> minimization), certain compiler settings (-ftz), and particular /
> /> devices on the Forge cluster at NCSA that later crashed, suggesting /
> /> this this is some kind of hardware issue (GPU or PCIe bus) or /
> /> driver/runtime/compiler fault. The alternative is that I've missed a /
> /> race condition that leads to an infinite loop in the kernel. /
> /> /
> /> I'm really hoping someone will find a way to trigger this
> consistently /
> /> since in my experience it has been too rare to identify a cause. /
> /> /
> /> -Jim /
> /> /
> /> /
> /> On Thu, 5 Apr 2012, Norman Geist wrote: /
> /> /
> />> I guess the developers will fix this soon as 2.9b2 is a beta, bugs /
> />> are expected. And reports a wished. /
> />> /
> />> /
> />> /
> />> Norman Geist. /
> />> /
> />> /
> />> /
> />> Von: Albert [mailto:mailmd2011_at_gmail.com
> <mailto:mailmd2011_at_gmail.com?Subject=Re:%20AW:%20AW:%20%20CUDA%20problem?>]
> /
> />> Gesendet: Donnerstag, 5. April 2012 08:16 /
> />> An: Norman Geist; namd-l_at_ks.uiuc.edu
> <mailto:namd-l_at_ks.uiuc.edu?Subject=Re:%20AW:%20AW:%20%20CUDA%20problem?>
> /
> />> Betreff: Re: AW: namd-l: CUDA problem? /
> />> /
> />> /
> />> /
> />> Hello: /
> />> thank you very much for kind messages. /
> />> Is there an solution for this problem? /
> />> /
> />> best /
> />> A /
> />> /
> />> On 04/05/2012 08:12 AM, Norman Geist wrote: /
> />> /
> />> Hi, /
> />> /
> />> /
> />> /
> />> there seems to be something wrong within the new gpu accelerated /
> />> minimization as Francesco posted the same issue and I answered him a /
> />> few second ago. I first thought this could also be an hardware issue /
> />> of a single gpu, but two people with a broken gpu is really
> unlikely. /
> />> So it's the developers turn. /
> />> /
> />> /
> />> /
> />> Best wishes /
> />> /
> />> /
> />> /
> />> Norman Geist. /
> />> /
> />> /
> />> /
> />> Von: owner-namd-l_at_ks.uiuc.edu
> <mailto:owner-namd-l_at_ks.uiuc.edu?Subject=Re:%20AW:%20AW:%20%20CUDA%20problem?>
> [mailto:owner-namd-l_at_ks.uiuc.edu
> <mailto:owner-namd-l_at_ks.uiuc.edu?Subject=Re:%20AW:%20AW:%20%20CUDA%20problem?>]
> Im /
> />> Auftrag von Albert /
> />> Gesendet: Mittwoch, 4. April 2012 21:03 /
> />> An: namd-l_at_ks.uiuc.edu
> <mailto:namd-l_at_ks.uiuc.edu?Subject=Re:%20AW:%20AW:%20%20CUDA%20problem?>
> /
> />> Betreff: namd-l: CUDA problem? /
> />> /
> />> /
> />> /
> />> Dear: /
> />> I've built a membrane system from CHARMM GUI and use the /
> />> equilibration protocol to relax my system. Everything goes well if I /
> />> use the default setting and it was finished under CUDA mode.
> However, /
> />> there is a ligand in my system and I would like to restrain it
> during /
> />> step 6.1(see below of the file). Here is what I did to add constrain /
> />> for my ligand /
> />> /
> />> set sel [atomselect top all] /
> />> $sel set beta 0 /
> />> set fix [atomselect top "protein and backbone or (resname LIG and
> not /
> />> hydrogen)"] /
> />> $fix set beta 1 /
> />> $sel writepdb bb_rmsd.ref /
> />> /
> />> /
> />> /
> />> after that I am trying to run this 6.1.inp by command: /
> />> /
> />> charmrun ++local +p4 namd2 +idlepoll step6.1_equilibration.inp > log /
> />> /
> />> /
> />> /
> />> /
> />> a few minutes later, it stopped with following logs: /
> />> /
> />> /
> />> ---------log---------------- /
> />> LINE MINIMIZER BRACKET: DX 7.96611e-05 0.000159322 DU -84.715
> 50.7203 /
> />> DUDX -1.52698e+06 -592619 1.21989e+06 /
> />> ENERGY: 1776 5819.8403 10258.1721 9471.5998 94.8591 -182114.5405 /
> />> 16169.6595 0.0000 3.2133 0.0000 -140297.1965 0.0000 -140297.1965 /
> />> -140297.1965 0.0000 3492.2283 3770.7578 593110.5555 3492.2283
> 3770.7578 /
> />> /
> />> LINE MINIMIZER BRACKET: DX 5.18225e-05 0.0001075 DU -15.3777 66.098 /
> />> DUDX -592619 3098.88 1.21989e+06 /
> />> ENERGY: 1777 5817.4042 10259.2760 9467.1949 94.8526 -182109.7937 /
> />> 16170.7783 0.0000 3.2124 0.0000 -140297.0753 0.0000 -140297.0753 /
> />> -140297.0753 0.0000 3495.3068 3772.9724 593110.5555 3495.3068
> 3772.9724 /
> />> /
> />> LINE MINIMIZER BRACKET: DX 5.18225e-06 0.0001075 DU -0.121147 66.098 /
> />> DUDX -56148.6 3098.88 1.21989e+06 /
> />> ------------- Processor 2 Exiting: Called CmiAbort ------------ /
> />> Reason: FATAL ERROR: cuda_check_remote_progress polled 1000000 times /
> />> over 101.085352 s on step 1778 /
> />> /
> />> FATAL ERROR: cuda_check_remote_progress polled 1000000 times over /
> />> 101.085352 s on step 1778 /
> />> Charm++ fatal error: /
> />> FATAL ERROR: cuda_check_remote_progress polled 1000000 times over /
> />> 101.085352 s on step 1778 /
> />> /
> />> /
> />> However, if I don't use CUDA mode, everthing goes well.... and the /
> />> simulation can be finished without any error.... Would you please /
> />> give me some advices for this? /
> />> /
> />> /
> />> ----------step 6.1.inp------------- /
> />> structure ../step5_assembly.xplor_ext.psf /
> />> coordinates ../step5_assembly.pdb /
> />> /
> />> set temp 310; /
> />> set outputname step6.1_equilibration; /
> />> /
> />> # read system values written by CHARMM (need to convert uppercases
> to /
> />> lowercases) /
> />> exec tr "\[:upper:\]" "\[:lower:\]" < ../step5_assembly.str | sed -e /
> />> "s/ = //g" > step5_assembly.namd.str /
> />> source step5_assembly.namd.str /
> />> /
> />> temperature $temp; /
> />> /
> />> outputName step6.1_equilibration_a; # base name for output from
> this run /
> />> # NAMD writes two files at the end, final coord and vel /
> />> # in the format of first-dyn.coor and first-dyn.vel /
> />> firsttimestep 0; # last step of previous run /
> />> restartfreq 500; # 500 steps = every 1ps /
> />> dcdfreq 1000; /
> />> dcdUnitCell yes; # the file will contain unit cell info in the
> style of /
> />> # charmm dcd files. if yes, the dcd files will contain /
> />> # unit cell information in the style of charmm DCD files. /
> />> xstFreq 1000; # XSTFreq: control how often the extended systen /
> />> configuration /
> />> # will be appended to the XST file /
> />> outputEnergies 125; # 125 steps = every 0.25ps /
> />> # The number of timesteps between each energy output of NAMD /
> />> outputTiming 1000; # The number of timesteps between each timing /
> />> output shows /
> />> # time per step and time to completion /
> />> /
> />> # Force-Field Parameters /
> />> paraTypeCharmm on; # We're using charmm type parameter file(s) /
> />> # multiple definitions may be used but only one file per definition /
> />> /
> />> exec mkdir -p toppar /
> />> exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g" /
> />> ./toppar/par_all22_prot.prm > toppar/par_all22_prot.prm /
> />> exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g" /
> />> ../toppar/par_all27_na.prm > toppar/par_all27_na.prm /
> />> exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g" /
> />> ./toppar/par_all36_carb.prm > toppar/par_all36_carb.prm /
> />> exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g" /
> />> ./toppar/par_all36_lipid.prm > toppar/par_all36_lipid.prm /
> />> exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g" /
> />> ./toppar/par_all36_cgenff.prm > toppar/par_all36_cgenff.prm /
> />> exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g" -e "1,/read para/d" \ /
> />> -e "278,296d" -e "s/^BOM/!&/g" -e "s/^WRN/!&/g" /
> />> ./toppar/toppar_water_ions.str > toppar/toppar_water_ions.str /
> />> exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g" -e "1,/read para/d" \ /
> />> -e "278,296d" -e "s/^BOM/!&/g" -e "s/^WRN/!&/g" /
> />> ./toppar/toppar_all36_lipid_cholesterol.str > /
> />> toppar/toppar_all36_lipid_cholesterol.str /
> />> /
> />> parameters toppar/par_all27_prot_na.prm; /
> />> parameters toppar/par_all36_lipid.prm; /
> />> parameters toppar/par_all22_prot.prm; /
> />> parameters toppar/par_all27_na.prm; /
> />> parameters toppar/par_all36_carb.prm; /
> />> parameters toppar/par_all36_cgenff.prm; /
> />> parameters toppar/par_all35_ethers.prm; /
> />> parameters toppar/lig.prm; /
> />> /
> />> /
> />> parameters toppar/toppar_water_ions.str; /
> />> parameters toppar/toppar_all36_lipid_cholesterol.str; /
> />> /
> />> # These are specified by CHARMM /
> />> exclude scaled1-4 # non-bonded exclusion policy to use /
> />> "none,1-2,1-3,1-4,or scaled1-4" /
> />> # 1-2: all atoms pairs that are bonded are going to be ignored /
> />> # 1-3: 3 consecutively bonded are excluded /
> />> # scaled1-4: include all the 1-3, and modified 1-4 interactions /
> />> # electrostatic scaled by 1-4scaling factor 1.0 /
> />> # vdW special 1-4 parameters in charmm parameter file. /
> />> 1-4scaling 1.0 /
> />> switching on /
> />> vdwForceSwitching yes; # New option for force-based switching of vdW /
> />> # if both switching and vdwForceSwitching are on CHARMM force /
> />> # switching is used for vdW forces. /
> />> seed 1333525265 # Specifies a specific seed /
> />> /
> />> # You have some freedom choosing the cutoff /
> />> cutoff 12.0; # may use smaller, maybe 10., with PME /
> />> switchdist 10.0; # cutoff - 2. /
> />> # switchdist - where you start to switch /
> />> # cutoff - where you stop accounting for nonbond interactions. /
> />> # correspondence in charmm: /
> />> # (cutnb,ctofnb,ctonnb = pairlistdist,cutoff,switchdist) /
> />> pairlistdist 16.0; # stores the all the pairs with in the distance
> it /
> />> should be larger /
> />> # than cutoff( + 2.) /
> />> stepspercycle 20; # 20 redo pairlists every ten steps /
> />> pairlistsPerCycle 2; # 2 is the default /
> />> # cycle represents the number of steps between atom reassignments /
> />> # this means every 20/2=10 steps the pairlist will be updated /
> />> /
> />> # Integrator Parameters /
> />> timestep 1.0; # fs/step /
> />> rigidBonds all; # Bound constraint all bonds involving H are fixed
> in /
> />> length /
> />> nonbondedFreq 1; # nonbonded forces every step /
> />> fullElectFrequency 1; # PME every step /
> />> /
> />> /
> />> # Constant Temperature Control ONLY DURING EQUILB /
> />> reassignFreq 500; # reassignFreq: use this to reassign velocity
> every /
> />> 500 steps /
> />> reassignTemp $temp; /
> />> /
> />> # Periodic Boundary conditions. Need this since for a start... /
> />> cellBasisVector1 $a 0.0 0.0; # vector to the next image /
> />> cellBasisVector2 0.0 $b 0.0; /
> />> cellBasisVector3 0.0 0.0 $c; /
> />> cellOrigin 0.0 0.0 $zcen; # the *center* of the cell /
> />> /
> />> wrapWater on; # wrap water to central cell /
> />> wrapAll on; # wrap other molecules too /
> />> wrapNearest off; # use for non-rectangular cells (wrap to the
> nearest /
> />> image) /
> />> /
> />> # PME (for full-system periodic electrostatics) /
> />> exec python ../checkfft.py $a $b $c > checkfft.str /
> />> source checkfft.str /
> />> /
> />> PME yes; /
> />> PMEInterpOrder 6; # interpolation order (spline order 6 in charmm) /
> />> PMEGridSizeX $fftx; # should be close to the cell size /
> />> PMEGridSizeY $ffty; # corresponds to the charmm input fftx/y/z /
> />> PMEGridSizeZ $fftz; /
> />> /
> />> # Pressure and volume control /
> />> useGroupPressure yes; # use a hydrogen-group based pseudo-molecular /
> />> viral to calcualte pressure and /
> />> # has less fluctuation, is needed for rigid bonds (rigidBonds/SHAKE) /
> />> useFlexibleCell yes; # yes for anisotropic system like membrane /
> />> useConstantRatio yes; # keeps the ratio of the unit cell in the x-y /
> />> plane constant A=B /
> />> /
> />> langevin on /
> />> langevinDamping 10 /
> />> langevinTemp $temp /
> />> langevinHydrogen no /
> />> /
> />> # planar restraint /
> />> colvars on /
> />> exec sed -e "s/Constant \$fc/Constant 5/g" -e "s/\$bb/10.0/g" -e /
> />> "s/\$sc/5.0/g" membrane_lipid_restraint.namd.col > /
> />> restraints/$outputname.col /
> />> colvarsConfig restraints/$outputname.col /
> />> /
> />> # dihedral restraint /
> />> extraBonds yes /
> />> exec sed -e "s/\$FC/500/g" restraints/dihe.txt > /
> />> restraints/$outputname.dihe /
> />> extraBondsFile restraints/$outputname.dihe /
> />> /
> />> minimize 10000 /
> />> /
> />> numsteps 90000000 /
> />> run 3000000 ; 3ns /
> />> /
> />> /
> />> /
> />> /

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:22:53 CST