Re: AW: AW: CUDA problem?

From: Francesco Pietra (chiendarret_at_gmail.com)
Date: Fri Apr 06 2012 - 02:59:59 CDT

Hi:
I wonder whether normal completion of minimization with no-cuda namd
reported by Albert means successful minimization.

I have now also tried Linux-x86_64-multicore (64-bit Intel/AMD single
node), with the same files used beforewith the cuda 2.8 and 2.9b2
versions. The requested 10,000 steps were completed without error
messages, however there was no minimization at all, as shown by the
starting and ending log below:

TCL: Minimizing for 10000 steps
ETITLE:      TS           BOND          ANGLE          DIHED
IMPRP               ELECT            VDW       BOUNDARY           MISC
       KINETIC               TOTAL           TEMP      POTENTIAL
  TOTAL3        TEMPAVG            PRESSURE      GPRESSURE
VOLUME       PRESSAVG      GPRESSAVG

ENERGY:       0    131511.9006     15951.4182      1089.1031
80.7094        -208755.0384  69052471.1297         0.0000
0.0000         0.0000       68992349.2227         0.0000
68992349.2227  68992349.2227         0.0000       28546289.9186
28560982.4534    672033.8185  28546289.9186  28560982.4534

MINIMIZER SLOWLY MOVING 192 ATOMS WITH BAD CONTACTS DOWNHILL
ENERGY:       1    131687.0782     15972.6729      1093.8588
82.4127        -208979.1378   4016972.9223         0.0000
0.0000         0.0000        3956829.8071         0.0000
3956829.8071   3956829.8071         0.0000        1640862.7514
1655451.5178    672033.8185   1640862.7514   1655451.5178

MINIMIZER SLOWLY MOVING 103 ATOMS WITH BAD CONTACTS DOWNHILL
ENERGY:       2    131746.9427     15987.1900      1096.4212
85.2803        -209095.4978    451409.8409         0.0000
0.0000         0.0000         391230.1773         0.0000
391230.1773    391230.1773         0.0000         165310.8297
179644.2803    672033.8185    165310.8297    179644.2803
..............................
..............................

LINE MINIMIZER BRACKET: DX 1.88138e-301 3.76275e-301 DU -4.19171e-06
8.16787e-06 DUDX 1.22061e+06 1.22061e+06 1.22061e+06
ENERGY: 9996 121119.4585 15410.7498 1102.0830
129.7461 -213732.3772 18984.3934 0.0000
0.0000 0.0000 -56985.9464 0.0000
-56985.9464 -56985.9464 0.0000 -14603.1699
-621.2528 672033.8185 -14603.1699 -621.2528

LINE MINIMIZER BRACKET: DX 1.88138e-302 3.76275e-301 DU -1.14297e-05
8.16787e-06 DUDX 1.22061e+06 1.22061e+06 1.22061e+06
ENERGY: 9997 121119.4585 15410.7498 1102.0830
129.7461 -213732.3772 18984.3934 0.0000
0.0000 0.0000 -56985.9464 0.0000
-56985.9464 -56985.9464 0.0000 -14603.1699
-621.2528 672033.8185 -14603.1699 -621.2528

LINE MINIMIZER BRACKET: DX 1.88138e-303 3.76275e-301 DU -4.75305e-05
8.16787e-06 DUDX 1.22061e+06 1.22061e+06 1.22061e+06
ENERGY: 9998 121119.4585 15410.7498 1102.0830
129.7461 -213732.3771 18984.3934 0.0000
0.0000 0.0000 -56985.9464 0.0000
-56985.9464 -56985.9464 0.0000 -14603.1699
-621.2528 672033.8185 -14603.1699 -621.2528

LINE MINIMIZER BRACKET: DX 1.88138e-304 3.76275e-301 DU -5.37204e-05
8.16787e-06 DUDX 1.22061e+06 1.22061e+06 1.22061e+06
ENERGY: 9999 121119.4585 15410.7498 1102.0830
129.7461 -213732.3772 18984.3934 0.0000
0.0000 0.0000 -56985.9464 0.0000
-56985.9464 -56985.9464 0.0000 -14603.1700
-621.2528 672033.8185 -14603.1700 -621.2528

LINE MINIMIZER BRACKET: DX 1.88138e-305 3.76275e-301 DU -2.59996e-06
8.16787e-06 DUDX 1.22061e+06 1.22061e+06 1.22061e+06
TIMING: 10000 CPU: 1566.85, 0.153887/step Wall: 1566.85,
0.153887/step, 0 hours remaining, 514.656250 MB of memory in use.
ETITLE: TS BOND ANGLE DIHED
IMPRP ELECT VDW BOUNDARY MISC
       KINETIC TOTAL TEMP POTENTIAL
  TOTAL3 TEMPAVG PRESSURE GPRESSURE
VOLUME PRESSAVG GPRESSAVG

ENERGY: 10000 121119.4585 15410.7498 1102.0830
129.7461 -213732.3772 18984.3934 0.0000
0.0000 0.0000 -56985.9464 0.0000
-56985.9464 -56985.9464 0.0000 -14603.1699
-621.2528 672033.8185 -14603.1699 -621.2528

WRITING EXTENDED SYSTEM TO RESTART FILE AT STEP 10000
WRITING COORDINATES TO RESTART FILE AT STEP 10000
FINISHED WRITING RESTART COORDINATES
WRITING VELOCITIES TO RESTART FILE AT STEP 10000
FINISHED WRITING RESTART VELOCITIES
WRITING EXTENDED SYSTEM TO OUTPUT FILE AT STEP 10000
WRITING COORDINATES TO OUTPUT FILE AT STEP 10000
WRITING VELOCITIES TO OUTPUT FILE AT STEP 10000
====================================================

WallClock: 1592.629395 CPUTime: 1592.629395 Memory: 514.656250 MB
Program finished.

*************
The gradient:

LINE MINIMIZER REDUCING GRADIENT FROM 4.52147e+08 TO 452147
MINIMIZER RESTARTING CONJUGATE GRADIENT ALGORITHM
LINE MINIMIZER REDUCING GRADIENT FROM 4.54669e+08 TO 454669
.....................
.....................
MINIMIZER RESTARTING CONJUGATE GRADIENT ALGORITHM
LINE MINIMIZER REDUCING GRADIENT FROM 4.54669e+08 TO 454669
LINE MINIMIZER REDUCING GRADIENT FROM 4.54669e+08 TO 454669
LINE MINIMIZER REDUCING GRADIENT FROM 4.54665e+08 TO 454665
LINE MINIMIZER REDUCING GRADIENT FROM 4.54509e+08 TO 454509
LINE MINIMIZER REDUCING GRADIENT FROM 4.54096e+08 TO 454096

scores very badly, i.e., the minimizer was unable to deal with a badly
parameterized system.

I wonder whether Albert got the cuda error along a successful minimization.

In my case, the two metal clusters reproduce nicely the crystal data
and min-restart-coor after the attemped 10,000 step minimization do
not show any wrong structural element at the naked eye. The ensemble
is in a water box, which also does not show distortions. I was using
0.1fs ts and overall a min.conf that was successful in all previous
cases of metalloproteins parameterized at home.

My question was, and remains, how to get a clue abot atom-atom
interactions that may explain the high (and un-minimizable) VDW and
IMPR. My naive view is that once that adjustment in the input files is
done, neither no-cuda, nor cuda will show problems any more. I regret
to be unable to furnish more elements for debugging, however the
software is not helping me by showing flying out atoms.

Thanks for advice

francesco

On Fri, Apr 6, 2012 at 5:37 AM, Jim Phillips <jim_at_ks.uiuc.edu> wrote:
>
> This is the real error:
>
> FATAL ERROR: cuda_check_remote_progress polled 1000000 times over 101.085352
> s on step 1778
>
> What it means is that NAMD has been waiting 101s for the CUDA event
> indicating that the kernel has completed and NAMD is exiting rather than
> likely hanging indefinitely.  I've noticed that these errors were more
> likely with energy evaluation (hence the connection to minimization),
> certain compiler settings (-ftz), and particular devices on the Forge
> cluster at NCSA that later crashed, suggesting this this is some kind of
> hardware issue (GPU or PCIe bus) or driver/runtime/compiler fault.  The
> alternative is that I've missed a race condition that leads to an infinite
> loop in the kernel.
>
> I'm really hoping someone will find a way to trigger this consistently since
> in my experience it has been too rare to identify a cause.
>
> -Jim
>
>
> On Thu, 5 Apr 2012, Norman Geist wrote:
>
>> I guess the developers will fix this soon as 2.9b2 is a beta, bugs are
>> expected. And reports a wished.
>>
>>
>>
>> Norman Geist.
>>
>>
>>
>> Von: Albert [mailto:mailmd2011_at_gmail.com]
>> Gesendet: Donnerstag, 5. April 2012 08:16
>> An: Norman Geist; namd-l_at_ks.uiuc.edu
>> Betreff: Re: AW: namd-l: CUDA problem?
>>
>>
>>
>> Hello:
>>  thank you very much for kind messages.
>> Is there an solution for this problem?
>>
>> best
>> A
>>
>> On 04/05/2012 08:12 AM, Norman Geist wrote:
>>
>> Hi,
>>
>>
>>
>> there seems to be something wrong within the new gpu accelerated
>> minimization as Francesco posted the same issue and I answered him a few
>> second ago. I first thought this could also be an hardware issue of a single
>> gpu, but two people with a broken gpu is really unlikely. So it’s the
>> developers turn.
>>
>>
>>
>> Best wishes
>>
>>
>>
>> Norman Geist.
>>
>>
>>
>> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im Auftrag
>> von Albert
>> Gesendet: Mittwoch, 4. April 2012 21:03
>> An: namd-l_at_ks.uiuc.edu
>> Betreff: namd-l: CUDA problem?
>>
>>
>>
>> Dear:
>>  I've built a membrane system from CHARMM GUI and use the equilibration
>> protocol to relax my system. Everything goes well if I use the default
>> setting and it was finished under CUDA mode. However, there is a ligand in
>> my system and I would like to restrain it during step 6.1(see below of the
>> file). Here is what I did to add constrain for my ligand
>>
>> set sel [atomselect top all]
>> $sel set beta 0
>> set fix [atomselect top "protein and backbone or (resname LIG and not
>> hydrogen)"]
>> $fix set beta 1
>> $sel writepdb bb_rmsd.ref
>>
>>
>>
>> after that I am trying to run this 6.1.inp by command:
>>
>> charmrun ++local +p4 namd2 +idlepoll step6.1_equilibration.inp > log
>>
>>
>>
>>
>> a few minutes later, it stopped with following logs:
>>
>>
>> ---------log----------------
>> LINE MINIMIZER BRACKET: DX 7.96611e-05 0.000159322 DU -84.715 50.7203 DUDX
>> -1.52698e+06 -592619 1.21989e+06
>> ENERGY: 1776 5819.8403 10258.1721 9471.5998 94.8591 -182114.5405
>> 16169.6595 0.0000 3.2133 0.0000 -140297.1965 0.0000 -140297.1965
>> -140297.1965 0.0000 3492.2283 3770.7578 593110.5555 3492.2283 3770.7578
>>
>> LINE MINIMIZER BRACKET: DX 5.18225e-05 0.0001075 DU -15.3777 66.098 DUDX
>> -592619 3098.88 1.21989e+06
>> ENERGY: 1777 5817.4042 10259.2760 9467.1949 94.8526 -182109.7937
>> 16170.7783 0.0000 3.2124 0.0000 -140297.0753 0.0000 -140297.0753
>> -140297.0753 0.0000 3495.3068 3772.9724 593110.5555 3495.3068 3772.9724
>>
>> LINE MINIMIZER BRACKET: DX 5.18225e-06 0.0001075 DU -0.121147 66.098 DUDX
>> -56148.6 3098.88 1.21989e+06
>> ------------- Processor 2 Exiting: Called CmiAbort ------------
>> Reason: FATAL ERROR: cuda_check_remote_progress polled 1000000 times over
>> 101.085352 s on step 1778
>>
>> FATAL ERROR: cuda_check_remote_progress polled 1000000 times over
>> 101.085352 s on step 1778
>> Charm++ fatal error:
>> FATAL ERROR: cuda_check_remote_progress polled 1000000 times over
>> 101.085352 s on step 1778
>>
>>
>> However, if I don't use CUDA mode, everthing goes well.... and the
>> simulation can be finished without any error.... Would you please give me
>> some advices for this?
>>
>>
>> ----------step 6.1.inp-------------
>> structure ../step5_assembly.xplor_ext.psf
>> coordinates ../step5_assembly.pdb
>>
>> set temp 310;
>> set outputname step6.1_equilibration;
>>
>> # read system values written by CHARMM (need to convert uppercases to
>> lowercases)
>> exec tr "\[:upper:\]" "\[:lower:\]" < ../step5_assembly.str | sed -e "s/ =
>> //g" > step5_assembly.namd.str
>> source step5_assembly.namd.str
>>
>> temperature $temp;
>>
>> outputName step6.1_equilibration_a; # base name for output from this run
>> # NAMD writes two files at the end, final coord and vel
>> # in the format of first-dyn.coor and first-dyn.vel
>> firsttimestep 0; # last step of previous run
>> restartfreq 500; # 500 steps = every 1ps
>> dcdfreq 1000;
>> dcdUnitCell yes; # the file will contain unit cell info in the style of
>> # charmm dcd files. if yes, the dcd files will contain
>> # unit cell information in the style of charmm DCD files.
>> xstFreq 1000; # XSTFreq: control how often the extended systen
>> configuration
>> # will be appended to the XST file
>> outputEnergies 125; # 125 steps = every 0.25ps
>> # The number of timesteps between each energy output of NAMD
>> outputTiming 1000; # The number of timesteps between each timing output
>> shows
>> # time per step and time to completion
>>
>> # Force-Field Parameters
>> paraTypeCharmm on; # We're using charmm type parameter file(s)
>> # multiple definitions may be used but only one file per definition
>>
>> exec mkdir -p toppar
>> exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g" ./toppar/par_all22_prot.prm >
>> toppar/par_all22_prot.prm
>> exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g" ../toppar/par_all27_na.prm >
>> toppar/par_all27_na.prm
>> exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g" ./toppar/par_all36_carb.prm >
>> toppar/par_all36_carb.prm
>> exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g" ./toppar/par_all36_lipid.prm
>> > toppar/par_all36_lipid.prm
>> exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g" ./toppar/par_all36_cgenff.prm
>> > toppar/par_all36_cgenff.prm
>> exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g" -e "1,/read para/d" \
>> -e "278,296d" -e "s/^BOM/!&/g" -e "s/^WRN/!&/g"
>> ./toppar/toppar_water_ions.str > toppar/toppar_water_ions.str
>> exec sed -e "s/^ATOM/!&/g" -e "s/^MASS/!&/g" -e "1,/read para/d" \
>> -e "278,296d" -e "s/^BOM/!&/g" -e "s/^WRN/!&/g"
>> ./toppar/toppar_all36_lipid_cholesterol.str >
>> toppar/toppar_all36_lipid_cholesterol.str
>>
>> parameters toppar/par_all27_prot_na.prm;
>> parameters toppar/par_all36_lipid.prm;
>> parameters toppar/par_all22_prot.prm;
>> parameters toppar/par_all27_na.prm;
>> parameters toppar/par_all36_carb.prm;
>> parameters toppar/par_all36_cgenff.prm;
>> parameters toppar/par_all35_ethers.prm;
>> parameters toppar/lig.prm;
>>
>>
>> parameters toppar/toppar_water_ions.str;
>> parameters toppar/toppar_all36_lipid_cholesterol.str;
>>
>> # These are specified by CHARMM
>> exclude scaled1-4 # non-bonded exclusion policy to use
>> "none,1-2,1-3,1-4,or scaled1-4"
>> # 1-2: all atoms pairs that are bonded are going to be ignored
>> # 1-3: 3 consecutively bonded are excluded
>> # scaled1-4: include all the 1-3, and modified 1-4 interactions
>> # electrostatic scaled by 1-4scaling factor 1.0
>> # vdW special 1-4 parameters in charmm parameter file.
>> 1-4scaling 1.0
>> switching on
>> vdwForceSwitching yes; # New option for force-based switching of vdW
>> # if both switching and vdwForceSwitching are on CHARMM force
>> # switching is used for vdW forces.
>> seed 1333525265 # Specifies a specific seed
>>
>> # You have some freedom choosing the cutoff
>> cutoff 12.0; # may use smaller, maybe 10., with PME
>> switchdist 10.0; # cutoff - 2.
>> # switchdist - where you start to switch
>> # cutoff - where you stop accounting for nonbond interactions.
>> # correspondence in charmm:
>> # (cutnb,ctofnb,ctonnb = pairlistdist,cutoff,switchdist)
>> pairlistdist 16.0; # stores the all the pairs with in the distance it
>> should be larger
>> # than cutoff( + 2.)
>> stepspercycle 20; # 20 redo pairlists every ten steps
>> pairlistsPerCycle 2; # 2 is the default
>> # cycle represents the number of steps between atom reassignments
>> # this means every 20/2=10 steps the pairlist will be updated
>>
>> # Integrator Parameters
>> timestep 1.0; # fs/step
>> rigidBonds all; # Bound constraint all bonds involving H are fixed in
>> length
>> nonbondedFreq 1; # nonbonded forces every step
>> fullElectFrequency 1; # PME every step
>>
>>
>> # Constant Temperature Control ONLY DURING EQUILB
>> reassignFreq 500; # reassignFreq: use this to reassign velocity every 500
>> steps
>> reassignTemp $temp;
>>
>> # Periodic Boundary conditions. Need this since for a start...
>> cellBasisVector1 $a 0.0 0.0; # vector to the next image
>> cellBasisVector2 0.0 $b 0.0;
>> cellBasisVector3 0.0 0.0 $c;
>> cellOrigin 0.0 0.0 $zcen; # the *center* of the cell
>>
>> wrapWater on; # wrap water to central cell
>> wrapAll on; # wrap other molecules too
>> wrapNearest off; # use for non-rectangular cells (wrap to the nearest
>> image)
>>
>> # PME (for full-system periodic electrostatics)
>> exec python ../checkfft.py $a $b $c > checkfft.str
>> source checkfft.str
>>
>> PME yes;
>> PMEInterpOrder 6; # interpolation order (spline order 6 in charmm)
>> PMEGridSizeX $fftx; # should be close to the cell size
>> PMEGridSizeY $ffty; # corresponds to the charmm input fftx/y/z
>> PMEGridSizeZ $fftz;
>>
>> # Pressure and volume control
>> useGroupPressure yes; # use a hydrogen-group based pseudo-molecular viral
>> to calcualte pressure and
>> # has less fluctuation, is needed for rigid bonds (rigidBonds/SHAKE)
>> useFlexibleCell yes; # yes for anisotropic system like membrane
>> useConstantRatio yes; # keeps the ratio of the unit cell in the x-y plane
>> constant A=B
>>
>> langevin on
>> langevinDamping 10
>> langevinTemp $temp
>> langevinHydrogen no
>>
>> # planar restraint
>> colvars on
>> exec sed -e "s/Constant \$fc/Constant 5/g" -e "s/\$bb/10.0/g" -e
>> "s/\$sc/5.0/g" membrane_lipid_restraint.namd.col >
>> restraints/$outputname.col
>> colvarsConfig restraints/$outputname.col
>>
>> # dihedral restraint
>> extraBonds yes
>> exec sed -e "s/\$FC/500/g" restraints/dihe.txt >
>> restraints/$outputname.dihe
>> extraBondsFile restraints/$outputname.dihe
>>
>> minimize 10000
>>
>> numsteps 90000000
>> run 3000000 ; 3ns
>>
>>
>>
>

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:21:51 CST