CUDA error (as a misleding error message)

From: Francesco Pietra (chiendarret_at_gmail.com)
Date: Wed Nov 07 2012 - 05:33:41 CST

Hello:
System: protein in a water box.

Minimization (namd 2.9 cuda linux) went on regularly up to very low gradient.

Gentle heating crashed because of atoms moving too fast. I was unable
to detect clashes. various methods revealed none. Reducing ts,
increasing outputenergy, increasing margin, diminishing heating per
step, did not help.

Further minimization. Now, CUDA error (which could be reproduced;
check with another system: CUDA OK)

francesco_at_gig64:~/MD$ charmrun namd2 heat-01.conf +p6 +idlepoll 2>&1 |
tee heat-01.log
Running command: namd2 heat-01.conf +p6 +idlepoll

Charm++: standalone mode (not using charmrun)
Converse/Charm++ Commit ID: v6.4.0-beta1-0-g5776d21
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (12-way SMP).
Charm++> cpu topology info is gathered in 0.001 seconds.
Info: NAMD CVS-2012-09-26 for Linux-x86_64-multicore-CUDA
Info:
Info: Please visit http://www.ks.uiuc.edu/Research/namd/
Info: for updates, documentation, and support information.
Info:
Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005)
Info: in all publications reporting results obtained with NAMD.
Info:
Info: Based on Charm++/Converse 60400 for multicore-linux64-iccstatic
Info: Built Wed Sep 26 02:25:08 CDT 2012 by jim on lisboa.ks.uiuc.edu
Info: 1 NAMD CVS-2012-09-26 Linux-x86_64-multicore-CUDA 6 gig64 francesco
Info: Running on 6 processors, 1 nodes, 1 physical nodes.
Info: CPU topology information available.
Info: Charm++/Converse parallel runtime startup completed at 0.0102639 s
Pe 3 physical rank 3 will use CUDA device of pe 4
Pe 5 physical rank 5 will use CUDA device of pe 4
Pe 1 physical rank 1 will use CUDA device of pe 2
Pe 2 physical rank 2 binding to CUDA device 0 on gig64: 'GeForce GTX
680' Mem: 2047MB Rev: 3.0
Pe 4 physical rank 4 binding to CUDA device 1 on gig64: 'GeForce GTX
680' Mem: 2047MB Rev: 3.0
Did not find +devices i,j,k,... argument, using all
Pe 0 physical rank 0 will use CUDA device of pe 2
Info: 8.22656 MB of memory in use based on /proc/self/stat
Info: Configuration file is heat-01.conf
Info: Working in the current directory /home/francesco/work_heme-oxygenase/MD
TCL: Suspending until startup complete.
Info: EXTENDED SYSTEM FILE ./min-02.restart.xsc
Info: SIMULATION PARAMETERS:
Info: TIMESTEP 0.01
Info: NUMBER OF STEPS 0
Info: STEPS PER CYCLE 10
Info: PERIODIC CELL BASIS 1 81.8 0 0
Info: PERIODIC CELL BASIS 2 0 78.66 0
Info: PERIODIC CELL BASIS 3 0 0 82.31
Info: PERIODIC CELL CENTER -100.627 -13.9064 -83.667
Info: WRAPPING ALL CLUSTERS AROUND PERIODIC BOUNDARIES ON OUTPUT.
Info: WRAPPING TO IMAGE NEAREST TO PERIODIC CELL CENTER.
Info: LOAD BALANCER Centralized
Info: LOAD BALANCING STRATEGY New Load Balancers -- DEFAULT
Info: LDB PERIOD 2000 steps
Info: FIRST LDB TIMESTEP 50
Info: LAST LDB TIMESTEP -1
Info: LDB BACKGROUND SCALING 1
Info: HOM BACKGROUND SCALING 1
Info: PME BACKGROUND SCALING 1
Info: MIN ATOMS PER PATCH 40
Info: INITIAL TEMPERATURE 0.5
Info: CENTER OF MASS MOVING INITIALLY? NO
Info: DIELECTRIC 1
Info: EXCLUDE SCALED ONE-FOUR
Info: 1-4 ELECTROSTATICS SCALED BY 1
Info: MODIFIED 1-4 VDW PARAMETERS WILL BE USED
Info: NO DCD TRAJECTORY OUTPUT
Info: NO EXTENDED SYSTEM TRAJECTORY OUTPUT
Info: NO VELOCITY DCD OUTPUT
Info: NO FORCE DCD OUTPUT
Info: OUTPUT FILENAME ./heat-01
Info: RESTART FILENAME ./heat-01.restart
Info: RESTART FREQUENCY 100
Info: BINARY RESTART FILES WILL BE USED
Info: SWITCHING ACTIVE
Info: SWITCHING ON 10
Info: SWITCHING OFF 12
Info: PAIRLIST DISTANCE 13.5
Info: PAIRLIST SHRINK RATE 0.01
Info: PAIRLIST GROW RATE 0.01
Info: PAIRLIST TRIGGER 0.3
Info: PAIRLISTS PER CYCLE 2
Info: PAIRLISTS ENABLED
Info: MARGIN 100
Info: HYDROGEN GROUP CUTOFF 2.5
Info: PATCH DIMENSION 116
Info: ENERGY OUTPUT STEPS 1000
Info: CROSSTERM ENERGY INCLUDED IN DIHEDRAL
Info: TIMING OUTPUT STEPS 10000
Info: LANGEVIN DYNAMICS ACTIVE
Info: LANGEVIN TEMPERATURE 0
Info: LANGEVIN USING BBK INTEGRATOR
Info: LANGEVIN DAMPING COEFFICIENT IS 1 INVERSE PS
Info: LANGEVIN DYNAMICS NOT APPLIED TO HYDROGENS
Info: VELOCITY REASSIGNMENT FREQ 50
Info: VELOCITY REASSIGNMENT TEMP 1
Info: VELOCITY REASSIGNMENT INCR 1
Info: VELOCITY REASSIGNMENT HOLD 311
Info: PARTICLE MESH EWALD (PME) ACTIVE
Info: PME TOLERANCE 1e-06
Info: PME EWALD COEFFICIENT 0.257952
Info: PME INTERPOLATION ORDER 4
Info: PME GRID DIMENSIONS 90 81 90
Info: PME MAXIMUM GRID SPACING 1
Info: Attempting to read FFTW data from
FFTW_NAMD_CVS-2012-09-26_Linux-x86_64-multicore-CUDA.txt
Info: Optimizing 6 FFT steps. 1... 2... 3... 4... 5... 6... Done.
Info: Writing FFTW data to
FFTW_NAMD_CVS-2012-09-26_Linux-x86_64-multicore-CUDA.txt
Info: FULL ELECTROSTATIC EVALUATION FREQUENCY 5
Info: USING VERLET I (r-RESPA) MTS SCHEME.
Info: C1 SPLITTING OF LONG RANGE ELECTROSTATICS
Info: PLACING ATOMS IN PATCHES BY HYDROGEN GROUPS
Info: RIGID BONDS TO HYDROGEN : WATER
Info: ERROR TOLERANCE : 1e-06
Info: MAX ITERATIONS : 100
Info: RIGID WATER USING SETTLE ALGORITHM
Info: RANDOM NUMBER SEED 12347
Info: USE HYDROGEN BONDS? NO
Info: COORDINATE PDB ./WTS_WTBOX_ION.pdb
Info: STRUCTURE FILE ./WTS_WTBOX_ION.psf
Info: PARAMETER file: CHARMM format!
Info: PARAMETERS ./par_all27_prot_lipid.prm
Info: PARAMETERS ./toppar_all22_prot_heme.str
Info: USING ARITHMETIC MEAN TO COMBINE L-J SIGMA PARAMETERS
Info: BINARY COORDINATES ./min-02.restart.coor
Info: SKIPPING rtf SECTION IN STREAM FILE
Info: SUMMARY OF PARAMETERS:
Info: 185 BONDS
Info: 467 ANGLES
Info: 601 DIHEDRAL
Info: 47 IMPROPER
Info: 6 CROSSTERM
Info: 121 VDW
Info: 0 VDW_PAIRS
Info: 0 NBTHOLE_PAIRS
Info: TIME FOR READING PSF FILE: 0.239432
Info: TIME FOR READING PDB FILE: 0.0656481
Info:
Info: ****************************
Info: STRUCTURE SUMMARY:
Info: 49686 ATOMS
Info: 34287 BONDS
Info: 21815 ANGLES
Info: 9486 DIHEDRALS
Info: 641 IMPROPERS
Info: 211 CROSSTERMS
Info: 0 EXCLUSIONS
Info: 46071 RIGID BONDS
Info: 102987 DEGREES OF FREEDOM
Info: 17232 HYDROGEN GROUPS
Info: 4 ATOMS IN LARGEST HYDROGEN GROUP
Info: 17232 MIGRATION GROUPS
Info: 4 ATOMS IN LARGEST MIGRATION GROUP
Info: TOTAL MASS = 304563 amu
Info: TOTAL CHARGE = 2.58163e-06 e
Info: MASS DENSITY = 0.954941 g/cm^3
Info: ATOM DENSITY = 0.0938154 atoms/A^3
Info: *****************************
Info: Reading from binary file ./min-02.restart.coor
Info:
Info: Entering startup at 0.392675 s, 28.2539 MB of memory in use
Info: Startup phase 0 took 9.60827e-05 s, 28.2656 MB of memory in use
Info: ADDED 65398 IMPLICIT EXCLUSIONS
Info: Startup phase 1 took 0.02037 s, 36.7812 MB of memory in use
Info: Startup phase 2 took 8.70228e-05 s, 36.8711 MB of memory in use
Info: Startup phase 3 took 3.98159e-05 s, 36.8711 MB of memory in use
Info: Startup phase 4 took 0.000478029 s, 39.2539 MB of memory in use
Info: Startup phase 5 took 4.69685e-05 s, 39.4023 MB of memory in use
Info: PATCH GRID IS 2 (PERIODIC) BY 2 (PERIODIC) BY 2 (PERIODIC)
Info: PATCH GRID IS 2-AWAY BY 2-AWAY BY 2-AWAY
Info: REMOVING COM VELOCITY 0.000933667 0.00123997 -0.000505543
Info: LARGEST PATCH (4) HAS 6331 ATOMS
Info: Startup phase 6 took 0.011183 s, 48.918 MB of memory in use
Info: PME using 6 and 6 processors for FFT and reciprocal sum.
Info: PME USING 1 GRID NODES AND 1 TRANS NODES
Info: PME GRID LOCATIONS: 0 1 2 3 4 5
Info: PME TRANS LOCATIONS: 0 1 2 3 4 5
Info: Optimizing 4 FFT steps. 1... 2... 3... 4... Done.
Info: Startup phase 7 took 0.000873089 s, 51.1914 MB of memory in use
Info: Startup phase 8 took 0.000119925 s, 51.1914 MB of memory in use
LDB: Central LB being created...
Info: Startup phase 9 took 0.000138044 s, 51.4766 MB of memory in use
Info: CREATING 602 COMPUTE OBJECTS
Info: NONBONDED TABLE R-SQUARED SPACING: 0.0625
Info: NONBONDED TABLE SIZE: 769 POINTS
Info: INCONSISTENCY IN FAST TABLE ENERGY VS FORCE: 0.000325096 AT 11.9556
Info: INCONSISTENCY IN SCOR TABLE ENERGY VS FORCE: 0.000324844 AT 11.9556
Info: INCONSISTENCY IN VDWA TABLE ENERGY VS FORCE: 0.0040507 AT 0.251946
Info: INCONSISTENCY IN VDWB TABLE ENERGY VS FORCE: 0.00150189 AT 0.251946
Pe 4 hosts 0 local and 1 remote patches for pe 4
Pe 5 hosts 1 local and 0 remote patches for pe 4
Pe 0 hosts 0 local and 1 remote patches for pe 4
Pe 3 hosts 1 local and 1 remote patches for pe 4
Pe 1 hosts 1 local and 0 remote patches for pe 4
Pe 0 hosts 0 local and 1 remote patches for pe 2
Pe 3 hosts 1 local and 1 remote patches for pe 2
Pe 4 hosts 0 local and 1 remote patches for pe 2
Pe 1 hosts 1 local and 0 remote patches for pe 2
Pe 2 hosts 1 local and 1 remote patches for pe 2
Pe 5 hosts 1 local and 0 remote patches for pe 2
Pe 2 hosts 1 local and 1 remote patches for pe 4
Info: useSync: 1 useProxySync: 0
Info: Startup phase 10 took 0.107934 s, 93.6289 MB of memory in use
Info: Startup phase 11 took 7.60555e-05 s, 93.6797 MB of memory in use
Info: Startup phase 12 took 5.88894e-05 s, 93.6797 MB of memory in use
Info: Finished startup at 0.534176 s, 93.6914 MB of memory in use

TCL: Running for 20000 steps
REASSIGNING VELOCITIES AT STEP 0 TO 1 KELVIN.
Pe 2 has 4 local and 4 remote patches and 250 local and 250 remote computes.
Pe 4 has 4 local and 4 remote patches and 250 local and 250 remote computes.
ETITLE: TS BOND ANGLE DIHED
IMPRP ELECT VDW BOUNDARY MISC
       KINETIC TOTAL TEMP POTENTIAL
  TOTAL3 TEMPAVG PRESSURE GPRESSURE
VOLUME PRESSAVG GPRESSAVG

ENERGY: 0 219.4599 632.6613 597.7338
27.3461 -231987.2055 28848.8514 0.0000
0.0000 102.3071 -201558.8457 0.9998
-201661.1529 -201558.8439 0.9998 104.1807
297.1944 529614.4763 104.1807 297.1944

LDB: ============= START OF LOAD BALANCING ============== 2.65752
LDB: ============== END OF LOAD BALANCING =============== 2.65758
Info: useSync: 1 useProxySync: 0
LDB: =============== DONE WITH MIGRATION ================ 2.65797
FATAL ERROR: CUDA error in cuda_check_remote_progress on Pe 2 (gig64
device 0): unspecified launch failure
------------- Processor 2 Exiting: Called CmiAbort ------------
Reason: FATAL ERROR: CUDA error in cuda_check_remote_progress on Pe 2
(gig64 device 0): unspecified launch failure

Charm++ fatal error:
FATAL ERROR: CUDA error in cuda_check_remote_progress on Pe 2 (gig64
device 0): unspecified launch failure.
***************

Being short of ideas at this point, hope someone can think better.

Thanks

francesco pietra

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:22:43 CST