AW: Re: AW: AW: CUDA problem?

From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Wed Jan 16 2013 - 00:21:57 CST

Hi,

 

either this is a software bug, you could check this when doing the same
simulation (same inputs) on another machine to see if you hit a specific
case causing a bug, or a hardware/configuration/driver/setup issue. Either
the GPU just don't return, or the host just misses the answer. If the GPU
doesn't return, it is a bug, if the host misses the answer, that's bad. It
could be due conflicting interrupt settings maybe, or a missing error
correction/control on the pcie. Also, as it occurs so randomly, it could
really be a race condition, too. Therefore, it would be nice to know if a
heavy loaded system benefits the appearance of this problem.

 

Maybe we should collect some data about this:

1. Is there anyone having this error with an ECC enabled host machine?

2. Is there anyone having this error with an ECC enabled GPU?

3. Which cuda versions does the error occur with?

4. Which driver version(family) does the error occur with?

5. Which namd version throws the error?

 

IMHO memcheck doesn't say much about hardware issues on a GPU as it only
checks the vram not the GPU itself and not the communication across the
pcie. When I had a broken GPU, everything looked fine. No errors no display
problems. Also the simulations where running until they crashed with "atoms
moving too fast" error". Memcheck told nothing aswell.

 

Norman Geist.

 

Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im Auftrag
von Eric Hill
Gesendet: Dienstag, 15. Januar 2013 08:13
An: Namd List
Betreff: namd-l: Re: AW: AW: CUDA problem?

 

Hello all,

I have this problem also, using multiple of the latest nightly X64_CUDA
builds as well as the last 4 stable builds. I am using an NVIDIA GTX580 card
with 3GB memory. My simulations using a GTX260 on this machine have always
worked in the past, but after upgrading to this card I have been
experiencing this issue. I am not performing minimization, and the
simulation seems to run fine until the error occurs. An example output is
shown below:

"ETITLE: TS BOND ANGLE DIHED IMPRP
ELECT VDW BOUNDARY MISC KINETIC
TOTAL TEMP POTENTIAL TOTAL3 TEMPAVG
PRESSURE GPRESSURE VOLUME PRESSAVG GPRESSAVG

ENERGY: 266250 3784.0171 13350.0661 8739.4586 67.6172
-87451.4515 800.1736 0.0000 0.0000 26330.8002
-34379.3187 306.5203 -60710.1189 -34176.8848 307.3187
44.7648 37.6307 375950.5041 -5.2822 -5.4192

FATAL ERROR: cuda_check_remote_progress polled 1000000 times over 102.629924
s on step 266375
[16] Stack Traceback:
  [16:0] CmiAbort+0x95 [0xccd29d]
  [16:1] _Z8NAMD_diePKc+0x62 [0x60529a]
  [16:2] _Z26cuda_check_remote_progressPvd+0xd6 [0x7f0360]
  [16:3] [0xcdab14]
  [16:4] CcdCallBacks+0x7d [0xcda99d]
  [16:5] CsdScheduleForever+0x113 [0xcd470b]
  [16:6] CsdScheduler+0x1c [0xcd4264]
  [16:7] _Z10slave_initiPPc+0x50 [0x60e718]
  [16:8] [0xcd315c]
  [16:9] [0xccd70f]
  [16:10] +0x7efc [0x7ff28b8f3efc]
  [16:11] clone+0x6d [0x7ff28ac88f8d]
[16] Stack Traceback:
  [16:0] [0xcce1cd]
  [16:1] CmiAbort+0xd3 [0xccd2db]
  [16:2] _Z8NAMD_diePKc+0x62 [0x60529a]
  [16:3] _Z26cuda_check_remote_progressPvd+0xd6 [0x7f0360]
  [16:4] [0xcdab14]
  [16:5] CcdCallBacks+0x7d [0xcda99d]
  [16:6] CsdScheduleForever+0x113 [0xcd470b]
  [16:7] CsdScheduler+0x1c [0xcd4264]
  [16:8] _Z10slave_initiPPc+0x50 [0x60e718]
  [16:9] [0xcd315c]
  [16:10] [0xccd70f]
  [16:11] +0x7efc [0x7ff28b8f3efc]
  [16:12] clone+0x6d [0x7ff28ac88f8d]
"
I have performed CUDA_GPU_MEMTEST on this GPU and it has passed, and it also
has no issues with deviceQuery (in the CUDA GPU computing SDK) or the GPU
implementation of AMBER12. Has the cause of this error been determined yet?
It seems like the answer must not be an issue with the GPU since I have the
same GPU on two other machines and both run NAMD fine, but I cannot be sure.
If the cause of this error is still unknown then I hope this information is
helpful to someone.

Best regards,
Eric H.

On 04/06/2012 05:37 AM, Jim Phillips wrote:
>
> This is the real error:
>
> FATAL ERROR: cuda_check_remote_progress polled 1000000 times over
> 101.085352 s on step 1778
>
> What it means is that NAMD has been waiting 101s for the CUDA event
> indicating that the kernel has completed and NAMD is exiting rather
> than likely hanging indefinitely. I've noticed that these errors were
> more likely with energy evaluation (hence the connection to
> minimization), certain compiler settings (-ftz), and particular
> devices on the Forge cluster at NCSA that later crashed, suggesting
> this this is some kind of hardware issue (GPU or PCIe bus) or
> driver/runtime/compiler fault. The alternative is that I've missed a
> race condition that leads to an infinite loop in the kernel.
>
> I'm really hoping someone will find a way to trigger this consistently
> since in my experience it has been too rare to identify a cause.
>
> -Jim
>
>
> On Thu, 5 Apr 2012, Norman Geist wrote:
>
>> I guess the developers will fix this soon as 2.9b2 is a beta, bugs
>> are expected. And reports a wished.
>>
>>
>>
>> Norman Geist.
>>
>>
>>
>> Von: Albert [mailto:mailmd2011_at_gmail.com
<mailto:mailmd2011_at_gmail.com?Subject=Re:%20AW:%20AW:%20%20CUDA%20problem?
> ]
>> Gesendet: Donnerstag, 5. April 2012 08:16
>> An: Norman Geist; namd-l_at_ks.uiuc.edu
<mailto:namd-l_at_ks.uiuc.edu?Subject=Re:%20AW:%20AW:%20%20CUDA%20problem?>

>> Betreff: Re: AW: namd-l: CUDA problem?
>>
>>
>>
>> Hello:
>> thank you very much for kind messages.
>> Is there an solution for this problem?
>>
>> best
>> A
>>
>> On 04/05/2012 08:12 AM, Norman Geist wrote:
>>
>> Hi,
>>
>>
>>
>> there seems to be something wrong within the new gpu accelerated
>> minimization as Francesco posted the same issue and I answered him a
>> few second ago. I first thought this could also be an hardware issue
>> of a single gpu, but two people with a broken gpu is really unlikely.
>> So it's the developers turn.
>>
>>
>>
>> Best wishes
>>
>>
>>
>> Norman Geist.
>>
>>
>>
>> Von: owner-namd-l_at_ks.uiuc.edu
<mailto:owner-namd-l_at_ks.uiuc.edu?Subject=Re:%20AW:%20AW:%20%20CUDA%20prob
lem?> [mailto:owner-namd-l_at_ks.uiuc.edu
<mailto:owner-namd-l_at_ks.uiuc.edu?Subject=Re:%20AW:%20AW:%20%20CUDA%20prob
lem?> ] Im
>> Auftrag von Albert
>> Gesendet: Mittwoch, 4. April 2012 21:03
>> An: namd-l_at_ks.uiuc.edu
<mailto:namd-l_at_ks.uiuc.edu?Subject=Re:%20AW:%20AW:%20%20CUDA%20problem?>

>> Betreff: namd-l: CUDA problem?
>>
>>
>>
>> Dear:
>> I've built a membrane system from CHARMM GUI and use the
>> equilibration protocol to relax my system. Everything goes well if I
>> use the default setting and it was finished under CUDA mode. However,
>> there is a ligand in my system and I would like to restrain it during
>> step 6.1(see below of the file). Here is what I did to add constrain
>> for my ligand
>>
>> set sel [atomselect top all]
>> $sel set beta 0
>> set fix [atomselect top "protein and backbone or (resname LIG and not
>> hydrogen)"]
>> $fix set beta 1
>> $sel writepdb bb_rmsd.ref
>>
>>
>>
>> after that I am trying to run this 6.1.inp by command:
>>
>> charmrun ++local +p4 namd2 +idlepoll step6.1_equilibration.inp > log
>>
>>
>>
>>
>> a few minutes later, it stopped with following logs:
>>
>>
>> ---------log----------------
>> LINE MINIMIZER BRACKET: DX 7.96611e-05 0.000159322 DU -84.715 50.7203
>> DUDX -1.52698e+06 -592619 1.21989e+06
>> ENERGY: 1776 5819.8403 10258.1721 9471.5998 94.8591 -182114.5405
>> 16169.6595 0.0000 3.2133 0.0000 -140297.1965 0.0000 -140297.1965
>> -140297.1965 0.0000 3492.2283 3770.7578 593110.5555 3492.2283 3770.7578
>>
>> LINE MINIMIZER BRACKET: DX 5.18225e-05 0.0001075 DU -15.3777 66.098
>> DUDX -592619 3098.88 1.21989e+06
>> ENERGY: 1777 5817.4042 10259.2760 9467.1949 94.8526 -182109.7937
>> 16170.7783 0.0000 3.2124 0.0000 -140297.0753 0.0000 -140297.0753
>> -140297.0753 0.0000 3495.3068 3772.9724 593110.5555 3495.3068 3772.9724
>>
>> LINE MINIMIZER BRACKET: DX 5.18225e-06 0.0001075 DU -0.121147 66.098
>> DUDX -56148.6 3098.88 1.21989e+06
>> ------------- Processor 2 Exiting: Called CmiAbort ------------
>> Reason: FATAL ERROR: cuda_check_remote_progress polled 1000000 times
>> over 101.085352 s on step 1778
>>
>> FATAL ERROR: cuda_check_remote_progress polled 1000000 times over
>> 101.085352 s on step 1778
>> Charm++ fatal error:
>> FATAL ERROR: cuda_check_remote_progress polled 1000000 times over
>> 101.085352 s on step 1778
>>
>>
>> However, if I don't use CUDA mode, everthing goes well.... and the
>> simulation can be finished without any error.... Would you please
>> give me some advices for this?
>>
>>
>> ----------step 6.1.inp-------------
>> structure ../step5_assembly.xplor_ext.psf
>> coordinates ../step5_assembly.pdb
>>
>> set temp 310;
>> set outputname step6.1_equilibration;
>>
>> # read system values written by CHARMM (need to convert uppercases to
>> lowercases)
>> exec tr "\[:upper:\]" "\[:lower:\]" < ../step5_assembly.str | sed -e
>> "s/ = //g" > step5_assembly.namd.str
>> source step5_assembly.namd.str
>>
>> temperature $temp;
>>
>> outputName step6.1_equilibration_a; # base name for output from this run
>> # NAMD writes two files at the end, final coord and vel
>> # in the format of first-dyn.coor and first-dyn.vel
>> firsttimestep 0; # last step of previous run
>> restartfreq 500; # 500 steps = every 1ps
>> dcdfreq 1000;
>> dcdUnitCell yes; # the file will contain unit cell info in the style of
>> # charmm dcd files. if yes, the dcd files will contain
>> # unit cell information in the style of charmm DCD files.
>> xstFreq 1000; # XSTFreq: control how often the extended systen
>> configuration
>> # will be appended to the XST file
>> outputEnergies 125; # 125 steps = every 0.25ps
>> # The number of timesteps between each energy output of NAMD
>> outputTiming 1000; # The number of timesteps between each timing
>> output shows
>> # time per step and time to completion
>>
>> # Force-Field Parameters
>> paraTypeCharmm on; # We're using charmm type parameter file(s)
>> # multiple definitions may be used but only one file per definition
>>
>> exec mkdir -p toppar
>> exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g"
>> ./toppar/par_all22_prot.prm > toppar/par_all22_prot.prm
>> exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g"
>> ../toppar/par_all27_na.prm > toppar/par_all27_na.prm
>> exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g"
>> ./toppar/par_all36_carb.prm > toppar/par_all36_carb.prm
>> exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g"
>> ./toppar/par_all36_lipid.prm > toppar/par_all36_lipid.prm
>> exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g"
>> ./toppar/par_all36_cgenff.prm > toppar/par_all36_cgenff.prm
>> exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g" -e "1,/read para/d" \
>> -e "278,296d" -e "s/^BOM/!&/g" -e "s/^WRN/!&/g"
>> ./toppar/toppar_water_ions.str > toppar/toppar_water_ions.str
>> exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g" -e "1,/read para/d" \
>> -e "278,296d" -e "s/^BOM/!&/g" -e "s/^WRN/!&/g"
>> ./toppar/toppar_all36_lipid_cholesterol.str >
>> toppar/toppar_all36_lipid_cholesterol.str
>>
>> parameters toppar/par_all27_prot_na.prm;
>> parameters toppar/par_all36_lipid.prm;
>> parameters toppar/par_all22_prot.prm;
>> parameters toppar/par_all27_na.prm;
>> parameters toppar/par_all36_carb.prm;
>> parameters toppar/par_all36_cgenff.prm;
>> parameters toppar/par_all35_ethers.prm;
>> parameters toppar/lig.prm;
>>
>>
>> parameters toppar/toppar_water_ions.str;
>> parameters toppar/toppar_all36_lipid_cholesterol.str;
>>
>> # These are specified by CHARMM
>> exclude scaled1-4 # non-bonded exclusion policy to use
>> "none,1-2,1-3,1-4,or scaled1-4"
>> # 1-2: all atoms pairs that are bonded are going to be ignored
>> # 1-3: 3 consecutively bonded are excluded
>> # scaled1-4: include all the 1-3, and modified 1-4 interactions
>> # electrostatic scaled by 1-4scaling factor 1.0
>> # vdW special 1-4 parameters in charmm parameter file.
>> 1-4scaling 1.0
>> switching on
>> vdwForceSwitching yes; # New option for force-based switching of vdW
>> # if both switching and vdwForceSwitching are on CHARMM force
>> # switching is used for vdW forces.
>> seed 1333525265 # Specifies a specific seed
>>
>> # You have some freedom choosing the cutoff
>> cutoff 12.0; # may use smaller, maybe 10., with PME
>> switchdist 10.0; # cutoff - 2.
>> # switchdist - where you start to switch
>> # cutoff - where you stop accounting for nonbond interactions.
>> # correspondence in charmm:
>> # (cutnb,ctofnb,ctonnb = pairlistdist,cutoff,switchdist)
>> pairlistdist 16.0; # stores the all the pairs with in the distance it
>> should be larger
>> # than cutoff( + 2.)
>> stepspercycle 20; # 20 redo pairlists every ten steps
>> pairlistsPerCycle 2; # 2 is the default
>> # cycle represents the number of steps between atom reassignments
>> # this means every 20/2=10 steps the pairlist will be updated
>>
>> # Integrator Parameters
>> timestep 1.0; # fs/step
>> rigidBonds all; # Bound constraint all bonds involving H are fixed in
>> length
>> nonbondedFreq 1; # nonbonded forces every step
>> fullElectFrequency 1; # PME every step
>>
>>
>> # Constant Temperature Control ONLY DURING EQUILB
>> reassignFreq 500; # reassignFreq: use this to reassign velocity every
>> 500 steps
>> reassignTemp $temp;
>>
>> # Periodic Boundary conditions. Need this since for a start...
>> cellBasisVector1 $a 0.0 0.0; # vector to the next image
>> cellBasisVector2 0.0 $b 0.0;
>> cellBasisVector3 0.0 0.0 $c;
>> cellOrigin 0.0 0.0 $zcen; # the *center* of the cell
>>
>> wrapWater on; # wrap water to central cell
>> wrapAll on; # wrap other molecules too
>> wrapNearest off; # use for non-rectangular cells (wrap to the nearest
>> image)
>>
>> # PME (for full-system periodic electrostatics)
>> exec python ../checkfft.py $a $b $c > checkfft.str
>> source checkfft.str
>>
>> PME yes;
>> PMEInterpOrder 6; # interpolation order (spline order 6 in charmm)
>> PMEGridSizeX $fftx; # should be close to the cell size
>> PMEGridSizeY $ffty; # corresponds to the charmm input fftx/y/z
>> PMEGridSizeZ $fftz;
>>
>> # Pressure and volume control
>> useGroupPressure yes; # use a hydrogen-group based pseudo-molecular
>> viral to calcualte pressure and
>> # has less fluctuation, is needed for rigid bonds (rigidBonds/SHAKE)
>> useFlexibleCell yes; # yes for anisotropic system like membrane
>> useConstantRatio yes; # keeps the ratio of the unit cell in the x-y
>> plane constant A=B
>>
>> langevin on
>> langevinDamping 10
>> langevinTemp $temp
>> langevinHydrogen no
>>
>> # planar restraint
>> colvars on
>> exec sed -e "s/Constant \$fc/Constant 5/g" -e "s/\$bb/10.0/g" -e
>> "s/\$sc/5.0/g" membrane_lipid_restraint.namd.col >
>> restraints/$outputname.col
>> colvarsConfig restraints/$outputname.col
>>
>> # dihedral restraint
>> extraBonds yes
>> exec sed -e "s/\$FC/500/g" restraints/dihe.txt >
>> restraints/$outputname.dihe
>> extraBondsFile restraints/$outputname.dihe
>>
>> minimize 10000
>>
>> numsteps 90000000
>> run 3000000 ; 3ns
>>
>>
>>
>>

 

This archive was generated by hypermail 2.1.6 : Wed Dec 31 2014 - 23:20:51 CST