Re: AW: AW: CUDA problem?

From: Jim Phillips (jim_at_ks.uiuc.edu)
Date: Fri Apr 06 2012 - 08:10:47 CDT

The minimizer seems to have stalled because the gradient and energies are
inconsistent (which is not unusual due to numerical imprecision very near
a minimum). Your IMPRP and VDW energies don't look unusual compared to
the DIHED and much larger negative ELECT energies, and it is impossible to
simultaneously minimize VDW and ELECT, so you would expect one to increase
as the other decreases near a minimum. Minimization is a convergent
process, so the energy won't drop forever.

-Jim

On Fri, 6 Apr 2012, Francesco Pietra wrote:

> Hi:
> I wonder whether normal completion of minimization with no-cuda namd
> reported by Albert means successful minimization.
>
>
> I have now also tried Linux-x86_64-multicore (64-bit Intel/AMD single
> node), with the same files used beforewith the cuda 2.8 and 2.9b2
> versions. The requested 10,000 steps were completed without error
> messages, however there was no minimization at all, as shown by the
> starting and ending log below:
>
> TCL: Minimizing for 10000 steps
> ETITLE:      TS           BOND          ANGLE          DIHED
> IMPRP               ELECT            VDW       BOUNDARY           MISC
>        KINETIC               TOTAL           TEMP      POTENTIAL
>   TOTAL3        TEMPAVG            PRESSURE      GPRESSURE
> VOLUME       PRESSAVG      GPRESSAVG
>
> ENERGY:       0    131511.9006     15951.4182      1089.1031
> 80.7094        -208755.0384  69052471.1297         0.0000
> 0.0000         0.0000       68992349.2227         0.0000
> 68992349.2227  68992349.2227         0.0000       28546289.9186
> 28560982.4534    672033.8185  28546289.9186  28560982.4534
>
> MINIMIZER SLOWLY MOVING 192 ATOMS WITH BAD CONTACTS DOWNHILL
> ENERGY:       1    131687.0782     15972.6729      1093.8588
> 82.4127        -208979.1378   4016972.9223         0.0000
> 0.0000         0.0000        3956829.8071         0.0000
> 3956829.8071   3956829.8071         0.0000        1640862.7514
> 1655451.5178    672033.8185   1640862.7514   1655451.5178
>
> MINIMIZER SLOWLY MOVING 103 ATOMS WITH BAD CONTACTS DOWNHILL
> ENERGY:       2    131746.9427     15987.1900      1096.4212
> 85.2803        -209095.4978    451409.8409         0.0000
> 0.0000         0.0000         391230.1773         0.0000
> 391230.1773    391230.1773         0.0000         165310.8297
> 179644.2803    672033.8185    165310.8297    179644.2803
> ..............................
> ..............................
>
> LINE MINIMIZER BRACKET: DX 1.88138e-301 3.76275e-301 DU -4.19171e-06
> 8.16787e-06 DUDX 1.22061e+06 1.22061e+06 1.22061e+06
> ENERGY: 9996 121119.4585 15410.7498 1102.0830
> 129.7461 -213732.3772 18984.3934 0.0000
> 0.0000 0.0000 -56985.9464 0.0000
> -56985.9464 -56985.9464 0.0000 -14603.1699
> -621.2528 672033.8185 -14603.1699 -621.2528
>
> LINE MINIMIZER BRACKET: DX 1.88138e-302 3.76275e-301 DU -1.14297e-05
> 8.16787e-06 DUDX 1.22061e+06 1.22061e+06 1.22061e+06
> ENERGY: 9997 121119.4585 15410.7498 1102.0830
> 129.7461 -213732.3772 18984.3934 0.0000
> 0.0000 0.0000 -56985.9464 0.0000
> -56985.9464 -56985.9464 0.0000 -14603.1699
> -621.2528 672033.8185 -14603.1699 -621.2528
>
> LINE MINIMIZER BRACKET: DX 1.88138e-303 3.76275e-301 DU -4.75305e-05
> 8.16787e-06 DUDX 1.22061e+06 1.22061e+06 1.22061e+06
> ENERGY: 9998 121119.4585 15410.7498 1102.0830
> 129.7461 -213732.3771 18984.3934 0.0000
> 0.0000 0.0000 -56985.9464 0.0000
> -56985.9464 -56985.9464 0.0000 -14603.1699
> -621.2528 672033.8185 -14603.1699 -621.2528
>
> LINE MINIMIZER BRACKET: DX 1.88138e-304 3.76275e-301 DU -5.37204e-05
> 8.16787e-06 DUDX 1.22061e+06 1.22061e+06 1.22061e+06
> ENERGY: 9999 121119.4585 15410.7498 1102.0830
> 129.7461 -213732.3772 18984.3934 0.0000
> 0.0000 0.0000 -56985.9464 0.0000
> -56985.9464 -56985.9464 0.0000 -14603.1700
> -621.2528 672033.8185 -14603.1700 -621.2528
>
> LINE MINIMIZER BRACKET: DX 1.88138e-305 3.76275e-301 DU -2.59996e-06
> 8.16787e-06 DUDX 1.22061e+06 1.22061e+06 1.22061e+06
> TIMING: 10000 CPU: 1566.85, 0.153887/step Wall: 1566.85,
> 0.153887/step, 0 hours remaining, 514.656250 MB of memory in use.
> ETITLE: TS BOND ANGLE DIHED
> IMPRP ELECT VDW BOUNDARY MISC
> KINETIC TOTAL TEMP POTENTIAL
> TOTAL3 TEMPAVG PRESSURE GPRESSURE
> VOLUME PRESSAVG GPRESSAVG
>
> ENERGY: 10000 121119.4585 15410.7498 1102.0830
> 129.7461 -213732.3772 18984.3934 0.0000
> 0.0000 0.0000 -56985.9464 0.0000
> -56985.9464 -56985.9464 0.0000 -14603.1699
> -621.2528 672033.8185 -14603.1699 -621.2528
>
> WRITING EXTENDED SYSTEM TO RESTART FILE AT STEP 10000
> WRITING COORDINATES TO RESTART FILE AT STEP 10000
> FINISHED WRITING RESTART COORDINATES
> WRITING VELOCITIES TO RESTART FILE AT STEP 10000
> FINISHED WRITING RESTART VELOCITIES
> WRITING EXTENDED SYSTEM TO OUTPUT FILE AT STEP 10000
> WRITING COORDINATES TO OUTPUT FILE AT STEP 10000
> WRITING VELOCITIES TO OUTPUT FILE AT STEP 10000
> ====================================================
>
> WallClock: 1592.629395 CPUTime: 1592.629395 Memory: 514.656250 MB
> Program finished.
>
> *************
> The gradient:
>
> LINE MINIMIZER REDUCING GRADIENT FROM 4.52147e+08 TO 452147
> MINIMIZER RESTARTING CONJUGATE GRADIENT ALGORITHM
> LINE MINIMIZER REDUCING GRADIENT FROM 4.54669e+08 TO 454669
> .....................
> .....................
> MINIMIZER RESTARTING CONJUGATE GRADIENT ALGORITHM
> LINE MINIMIZER REDUCING GRADIENT FROM 4.54669e+08 TO 454669
> LINE MINIMIZER REDUCING GRADIENT FROM 4.54669e+08 TO 454669
> LINE MINIMIZER REDUCING GRADIENT FROM 4.54665e+08 TO 454665
> LINE MINIMIZER REDUCING GRADIENT FROM 4.54509e+08 TO 454509
> LINE MINIMIZER REDUCING GRADIENT FROM 4.54096e+08 TO 454096
>
> scores very badly, i.e., the minimizer was unable to deal with a badly
> parameterized system.
>
> I wonder whether Albert got the cuda error along a successful minimization.
>
> In my case, the two metal clusters reproduce nicely the crystal data
> and min-restart-coor after the attemped 10,000 step minimization do
> not show any wrong structural element at the naked eye. The ensemble
> is in a water box, which also does not show distortions. I was using
> 0.1fs ts and overall a min.conf that was successful in all previous
> cases of metalloproteins parameterized at home.
>
> My question was, and remains, how to get a clue abot atom-atom
> interactions that may explain the high (and un-minimizable) VDW and
> IMPR. My naive view is that once that adjustment in the input files is
> done, neither no-cuda, nor cuda will show problems any more. I regret
> to be unable to furnish more elements for debugging, however the
> software is not helping me by showing flying out atoms.
>
> Thanks for advice
>
> francesco
>
> On Fri, Apr 6, 2012 at 5:37 AM, Jim Phillips <jim_at_ks.uiuc.edu> wrote:
>>
>> This is the real error:
>>
>> FATAL ERROR: cuda_check_remote_progress polled 1000000 times over 101.085352
>> s on step 1778
>>
>> What it means is that NAMD has been waiting 101s for the CUDA event
>> indicating that the kernel has completed and NAMD is exiting rather than
>> likely hanging indefinitely.  I've noticed that these errors were more
>> likely with energy evaluation (hence the connection to minimization),
>> certain compiler settings (-ftz), and particular devices on the Forge
>> cluster at NCSA that later crashed, suggesting this this is some kind of
>> hardware issue (GPU or PCIe bus) or driver/runtime/compiler fault.  The
>> alternative is that I've missed a race condition that leads to an infinite
>> loop in the kernel.
>>
>> I'm really hoping someone will find a way to trigger this consistently since
>> in my experience it has been too rare to identify a cause.
>>
>> -Jim
>>
>>
>> On Thu, 5 Apr 2012, Norman Geist wrote:
>>
>>> I guess the developers will fix this soon as 2.9b2 is a beta, bugs are
>>> expected. And reports a wished.
>>>
>>>
>>>
>>> Norman Geist.
>>>
>>>
>>>
>>> Von: Albert [mailto:mailmd2011_at_gmail.com]
>>> Gesendet: Donnerstag, 5. April 2012 08:16
>>> An: Norman Geist; namd-l_at_ks.uiuc.edu
>>> Betreff: Re: AW: namd-l: CUDA problem?
>>>
>>>
>>>
>>> Hello:
>>>  thank you very much for kind messages.
>>> Is there an solution for this problem?
>>>
>>> best
>>> A
>>>
>>> On 04/05/2012 08:12 AM, Norman Geist wrote:
>>>
>>> Hi,
>>>
>>>
>>>
>>> there seems to be something wrong within the new gpu accelerated
>>> minimization as Francesco posted the same issue and I answered him a few
>>> second ago. I first thought this could also be an hardware issue of a single
>>> gpu, but two people with a broken gpu is really unlikely. So it’s the
>>> developers turn.
>>>
>>>
>>>
>>> Best wishes
>>>
>>>
>>>
>>> Norman Geist.
>>>
>>>
>>>
>>> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im Auftrag
>>> von Albert
>>> Gesendet: Mittwoch, 4. April 2012 21:03
>>> An: namd-l_at_ks.uiuc.edu
>>> Betreff: namd-l: CUDA problem?
>>>
>>>
>>>
>>> Dear:
>>>  I've built a membrane system from CHARMM GUI and use the equilibration
>>> protocol to relax my system. Everything goes well if I use the default
>>> setting and it was finished under CUDA mode. However, there is a ligand in
>>> my system and I would like to restrain it during step 6.1(see below of the
>>> file). Here is what I did to add constrain for my ligand
>>>
>>> set sel [atomselect top all]
>>> $sel set beta 0
>>> set fix [atomselect top "protein and backbone or (resname LIG and not
>>> hydrogen)"]
>>> $fix set beta 1
>>> $sel writepdb bb_rmsd.ref
>>>
>>>
>>>
>>> after that I am trying to run this 6.1.inp by command:
>>>
>>> charmrun ++local +p4 namd2 +idlepoll step6.1_equilibration.inp > log
>>>
>>>
>>>
>>>
>>> a few minutes later, it stopped with following logs:
>>>
>>>
>>> ---------log----------------
>>> LINE MINIMIZER BRACKET: DX 7.96611e-05 0.000159322 DU -84.715 50.7203 DUDX
>>> -1.52698e+06 -592619 1.21989e+06
>>> ENERGY: 1776 5819.8403 10258.1721 9471.5998 94.8591 -182114.5405
>>> 16169.6595 0.0000 3.2133 0.0000 -140297.1965 0.0000 -140297.1965
>>> -140297.1965 0.0000 3492.2283 3770.7578 593110.5555 3492.2283 3770.7578
>>>
>>> LINE MINIMIZER BRACKET: DX 5.18225e-05 0.0001075 DU -15.3777 66.098 DUDX
>>> -592619 3098.88 1.21989e+06
>>> ENERGY: 1777 5817.4042 10259.2760 9467.1949 94.8526 -182109.7937
>>> 16170.7783 0.0000 3.2124 0.0000 -140297.0753 0.0000 -140297.0753
>>> -140297.0753 0.0000 3495.3068 3772.9724 593110.5555 3495.3068 3772.9724
>>>
>>> LINE MINIMIZER BRACKET: DX 5.18225e-06 0.0001075 DU -0.121147 66.098 DUDX
>>> -56148.6 3098.88 1.21989e+06
>>> ------------- Processor 2 Exiting: Called CmiAbort ------------
>>> Reason: FATAL ERROR: cuda_check_remote_progress polled 1000000 times over
>>> 101.085352 s on step 1778
>>>
>>> FATAL ERROR: cuda_check_remote_progress polled 1000000 times over
>>> 101.085352 s on step 1778
>>> Charm++ fatal error:
>>> FATAL ERROR: cuda_check_remote_progress polled 1000000 times over
>>> 101.085352 s on step 1778
>>>
>>>
>>> However, if I don't use CUDA mode, everthing goes well.... and the
>>> simulation can be finished without any error.... Would you please give me
>>> some advices for this?
>>>
>>>
>>> ----------step 6.1.inp-------------
>>> structure ../step5_assembly.xplor_ext.psf
>>> coordinates ../step5_assembly.pdb
>>>
>>> set temp 310;
>>> set outputname step6.1_equilibration;
>>>
>>> # read system values written by CHARMM (need to convert uppercases to
>>> lowercases)
>>> exec tr "\[:upper:\]" "\[:lower:\]" < ../step5_assembly.str | sed -e "s/ =
>>> //g" > step5_assembly.namd.str
>>> source step5_assembly.namd.str
>>>
>>> temperature $temp;
>>>
>>> outputName step6.1_equilibration_a; # base name for output from this run
>>> # NAMD writes two files at the end, final coord and vel
>>> # in the format of first-dyn.coor and first-dyn.vel
>>> firsttimestep 0; # last step of previous run
>>> restartfreq 500; # 500 steps = every 1ps
>>> dcdfreq 1000;
>>> dcdUnitCell yes; # the file will contain unit cell info in the style of
>>> # charmm dcd files. if yes, the dcd files will contain
>>> # unit cell information in the style of charmm DCD files.
>>> xstFreq 1000; # XSTFreq: control how often the extended systen
>>> configuration
>>> # will be appended to the XST file
>>> outputEnergies 125; # 125 steps = every 0.25ps
>>> # The number of timesteps between each energy output of NAMD
>>> outputTiming 1000; # The number of timesteps between each timing output
>>> shows
>>> # time per step and time to completion
>>>
>>> # Force-Field Parameters
>>> paraTypeCharmm on; # We're using charmm type parameter file(s)
>>> # multiple definitions may be used but only one file per definition
>>>
>>> exec mkdir -p toppar
>>> exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g" ./toppar/par_all22_prot.prm >
>>> toppar/par_all22_prot.prm
>>> exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g" ../toppar/par_all27_na.prm >
>>> toppar/par_all27_na.prm
>>> exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g" ./toppar/par_all36_carb.prm >
>>> toppar/par_all36_carb.prm
>>> exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g" ./toppar/par_all36_lipid.prm
>>>> toppar/par_all36_lipid.prm
>>> exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g" ./toppar/par_all36_cgenff.prm
>>>> toppar/par_all36_cgenff.prm
>>> exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g" -e "1,/read para/d" \
>>> -e "278,296d" -e "s/^BOM/!&/g" -e "s/^WRN/!&/g"
>>> ./toppar/toppar_water_ions.str > toppar/toppar_water_ions.str
>>> exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g" -e "1,/read para/d" \
>>> -e "278,296d" -e "s/^BOM/!&/g" -e "s/^WRN/!&/g"
>>> ./toppar/toppar_all36_lipid_cholesterol.str >
>>> toppar/toppar_all36_lipid_cholesterol.str
>>>
>>> parameters toppar/par_all27_prot_na.prm;
>>> parameters toppar/par_all36_lipid.prm;
>>> parameters toppar/par_all22_prot.prm;
>>> parameters toppar/par_all27_na.prm;
>>> parameters toppar/par_all36_carb.prm;
>>> parameters toppar/par_all36_cgenff.prm;
>>> parameters toppar/par_all35_ethers.prm;
>>> parameters toppar/lig.prm;
>>>
>>>
>>> parameters toppar/toppar_water_ions.str;
>>> parameters toppar/toppar_all36_lipid_cholesterol.str;
>>>
>>> # These are specified by CHARMM
>>> exclude scaled1-4 # non-bonded exclusion policy to use
>>> "none,1-2,1-3,1-4,or scaled1-4"
>>> # 1-2: all atoms pairs that are bonded are going to be ignored
>>> # 1-3: 3 consecutively bonded are excluded
>>> # scaled1-4: include all the 1-3, and modified 1-4 interactions
>>> # electrostatic scaled by 1-4scaling factor 1.0
>>> # vdW special 1-4 parameters in charmm parameter file.
>>> 1-4scaling 1.0
>>> switching on
>>> vdwForceSwitching yes; # New option for force-based switching of vdW
>>> # if both switching and vdwForceSwitching are on CHARMM force
>>> # switching is used for vdW forces.
>>> seed 1333525265 # Specifies a specific seed
>>>
>>> # You have some freedom choosing the cutoff
>>> cutoff 12.0; # may use smaller, maybe 10., with PME
>>> switchdist 10.0; # cutoff - 2.
>>> # switchdist - where you start to switch
>>> # cutoff - where you stop accounting for nonbond interactions.
>>> # correspondence in charmm:
>>> # (cutnb,ctofnb,ctonnb = pairlistdist,cutoff,switchdist)
>>> pairlistdist 16.0; # stores the all the pairs with in the distance it
>>> should be larger
>>> # than cutoff( + 2.)
>>> stepspercycle 20; # 20 redo pairlists every ten steps
>>> pairlistsPerCycle 2; # 2 is the default
>>> # cycle represents the number of steps between atom reassignments
>>> # this means every 20/2=10 steps the pairlist will be updated
>>>
>>> # Integrator Parameters
>>> timestep 1.0; # fs/step
>>> rigidBonds all; # Bound constraint all bonds involving H are fixed in
>>> length
>>> nonbondedFreq 1; # nonbonded forces every step
>>> fullElectFrequency 1; # PME every step
>>>
>>>
>>> # Constant Temperature Control ONLY DURING EQUILB
>>> reassignFreq 500; # reassignFreq: use this to reassign velocity every 500
>>> steps
>>> reassignTemp $temp;
>>>
>>> # Periodic Boundary conditions. Need this since for a start...
>>> cellBasisVector1 $a 0.0 0.0; # vector to the next image
>>> cellBasisVector2 0.0 $b 0.0;
>>> cellBasisVector3 0.0 0.0 $c;
>>> cellOrigin 0.0 0.0 $zcen; # the *center* of the cell
>>>
>>> wrapWater on; # wrap water to central cell
>>> wrapAll on; # wrap other molecules too
>>> wrapNearest off; # use for non-rectangular cells (wrap to the nearest
>>> image)
>>>
>>> # PME (for full-system periodic electrostatics)
>>> exec python ../checkfft.py $a $b $c > checkfft.str
>>> source checkfft.str
>>>
>>> PME yes;
>>> PMEInterpOrder 6; # interpolation order (spline order 6 in charmm)
>>> PMEGridSizeX $fftx; # should be close to the cell size
>>> PMEGridSizeY $ffty; # corresponds to the charmm input fftx/y/z
>>> PMEGridSizeZ $fftz;
>>>
>>> # Pressure and volume control
>>> useGroupPressure yes; # use a hydrogen-group based pseudo-molecular viral
>>> to calcualte pressure and
>>> # has less fluctuation, is needed for rigid bonds (rigidBonds/SHAKE)
>>> useFlexibleCell yes; # yes for anisotropic system like membrane
>>> useConstantRatio yes; # keeps the ratio of the unit cell in the x-y plane
>>> constant A=B
>>>
>>> langevin on
>>> langevinDamping 10
>>> langevinTemp $temp
>>> langevinHydrogen no
>>>
>>> # planar restraint
>>> colvars on
>>> exec sed -e "s/Constant \$fc/Constant 5/g" -e "s/\$bb/10.0/g" -e
>>> "s/\$sc/5.0/g" membrane_lipid_restraint.namd.col >
>>> restraints/$outputname.col
>>> colvarsConfig restraints/$outputname.col
>>>
>>> # dihedral restraint
>>> extraBonds yes
>>> exec sed -e "s/\$FC/500/g" restraints/dihe.txt >
>>> restraints/$outputname.dihe
>>> extraBondsFile restraints/$outputname.dihe
>>>
>>> minimize 10000
>>>
>>> numsteps 90000000
>>> run 3000000 ; 3ns
>>>
>>>
>>>
>>
>

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:21:25 CST