Re: CUDA error (as a misleding error message)

From: Aron Broom (broomsday_at_gmail.com)
Date: Wed Nov 07 2012 - 10:01:51 CST

it almost seems like maybe you don't have the proper access to that
system. I see you are using charmrun, is it possible that somehow your
permissions or something get lost? I'm really just grasping around, I've
never seen that error myself.

~Aron

On Wed, Nov 7, 2012 at 6:33 AM, Francesco Pietra <chiendarret_at_gmail.com>wrote:

> Hello:
> System: protein in a water box.
>
> Minimization (namd 2.9 cuda linux) went on regularly up to very low
> gradient.
>
> Gentle heating crashed because of atoms moving too fast. I was unable
> to detect clashes. various methods revealed none. Reducing ts,
> increasing outputenergy, increasing margin, diminishing heating per
> step, did not help.
>
> Further minimization. Now, CUDA error (which could be reproduced;
> check with another system: CUDA OK)
>
> francesco_at_gig64:~/MD$ charmrun namd2 heat-01.conf +p6 +idlepoll 2>&1 |
> tee heat-01.log
> Running command: namd2 heat-01.conf +p6 +idlepoll
>
> Charm++: standalone mode (not using charmrun)
> Converse/Charm++ Commit ID: v6.4.0-beta1-0-g5776d21
> CharmLB> Load balancer assumes all CPUs are same.
> Charm++> Running on 1 unique compute nodes (12-way SMP).
> Charm++> cpu topology info is gathered in 0.001 seconds.
> Info: NAMD CVS-2012-09-26 for Linux-x86_64-multicore-CUDA
> Info:
> Info: Please visit http://www.ks.uiuc.edu/Research/namd/
> Info: for updates, documentation, and support information.
> Info:
> Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005)
> Info: in all publications reporting results obtained with NAMD.
> Info:
> Info: Based on Charm++/Converse 60400 for multicore-linux64-iccstatic
> Info: Built Wed Sep 26 02:25:08 CDT 2012 by jim on lisboa.ks.uiuc.edu
> Info: 1 NAMD CVS-2012-09-26 Linux-x86_64-multicore-CUDA 6 gig64
> francesco
> Info: Running on 6 processors, 1 nodes, 1 physical nodes.
> Info: CPU topology information available.
> Info: Charm++/Converse parallel runtime startup completed at 0.0102639 s
> Pe 3 physical rank 3 will use CUDA device of pe 4
> Pe 5 physical rank 5 will use CUDA device of pe 4
> Pe 1 physical rank 1 will use CUDA device of pe 2
> Pe 2 physical rank 2 binding to CUDA device 0 on gig64: 'GeForce GTX
> 680' Mem: 2047MB Rev: 3.0
> Pe 4 physical rank 4 binding to CUDA device 1 on gig64: 'GeForce GTX
> 680' Mem: 2047MB Rev: 3.0
> Did not find +devices i,j,k,... argument, using all
> Pe 0 physical rank 0 will use CUDA device of pe 2
> Info: 8.22656 MB of memory in use based on /proc/self/stat
> Info: Configuration file is heat-01.conf
> Info: Working in the current directory
> /home/francesco/work_heme-oxygenase/MD
> TCL: Suspending until startup complete.
> Info: EXTENDED SYSTEM FILE ./min-02.restart.xsc
> Info: SIMULATION PARAMETERS:
> Info: TIMESTEP 0.01
> Info: NUMBER OF STEPS 0
> Info: STEPS PER CYCLE 10
> Info: PERIODIC CELL BASIS 1 81.8 0 0
> Info: PERIODIC CELL BASIS 2 0 78.66 0
> Info: PERIODIC CELL BASIS 3 0 0 82.31
> Info: PERIODIC CELL CENTER -100.627 -13.9064 -83.667
> Info: WRAPPING ALL CLUSTERS AROUND PERIODIC BOUNDARIES ON OUTPUT.
> Info: WRAPPING TO IMAGE NEAREST TO PERIODIC CELL CENTER.
> Info: LOAD BALANCER Centralized
> Info: LOAD BALANCING STRATEGY New Load Balancers -- DEFAULT
> Info: LDB PERIOD 2000 steps
> Info: FIRST LDB TIMESTEP 50
> Info: LAST LDB TIMESTEP -1
> Info: LDB BACKGROUND SCALING 1
> Info: HOM BACKGROUND SCALING 1
> Info: PME BACKGROUND SCALING 1
> Info: MIN ATOMS PER PATCH 40
> Info: INITIAL TEMPERATURE 0.5
> Info: CENTER OF MASS MOVING INITIALLY? NO
> Info: DIELECTRIC 1
> Info: EXCLUDE SCALED ONE-FOUR
> Info: 1-4 ELECTROSTATICS SCALED BY 1
> Info: MODIFIED 1-4 VDW PARAMETERS WILL BE USED
> Info: NO DCD TRAJECTORY OUTPUT
> Info: NO EXTENDED SYSTEM TRAJECTORY OUTPUT
> Info: NO VELOCITY DCD OUTPUT
> Info: NO FORCE DCD OUTPUT
> Info: OUTPUT FILENAME ./heat-01
> Info: RESTART FILENAME ./heat-01.restart
> Info: RESTART FREQUENCY 100
> Info: BINARY RESTART FILES WILL BE USED
> Info: SWITCHING ACTIVE
> Info: SWITCHING ON 10
> Info: SWITCHING OFF 12
> Info: PAIRLIST DISTANCE 13.5
> Info: PAIRLIST SHRINK RATE 0.01
> Info: PAIRLIST GROW RATE 0.01
> Info: PAIRLIST TRIGGER 0.3
> Info: PAIRLISTS PER CYCLE 2
> Info: PAIRLISTS ENABLED
> Info: MARGIN 100
> Info: HYDROGEN GROUP CUTOFF 2.5
> Info: PATCH DIMENSION 116
> Info: ENERGY OUTPUT STEPS 1000
> Info: CROSSTERM ENERGY INCLUDED IN DIHEDRAL
> Info: TIMING OUTPUT STEPS 10000
> Info: LANGEVIN DYNAMICS ACTIVE
> Info: LANGEVIN TEMPERATURE 0
> Info: LANGEVIN USING BBK INTEGRATOR
> Info: LANGEVIN DAMPING COEFFICIENT IS 1 INVERSE PS
> Info: LANGEVIN DYNAMICS NOT APPLIED TO HYDROGENS
> Info: VELOCITY REASSIGNMENT FREQ 50
> Info: VELOCITY REASSIGNMENT TEMP 1
> Info: VELOCITY REASSIGNMENT INCR 1
> Info: VELOCITY REASSIGNMENT HOLD 311
> Info: PARTICLE MESH EWALD (PME) ACTIVE
> Info: PME TOLERANCE 1e-06
> Info: PME EWALD COEFFICIENT 0.257952
> Info: PME INTERPOLATION ORDER 4
> Info: PME GRID DIMENSIONS 90 81 90
> Info: PME MAXIMUM GRID SPACING 1
> Info: Attempting to read FFTW data from
> FFTW_NAMD_CVS-2012-09-26_Linux-x86_64-multicore-CUDA.txt
> Info: Optimizing 6 FFT steps. 1... 2... 3... 4... 5... 6... Done.
> Info: Writing FFTW data to
> FFTW_NAMD_CVS-2012-09-26_Linux-x86_64-multicore-CUDA.txt
> Info: FULL ELECTROSTATIC EVALUATION FREQUENCY 5
> Info: USING VERLET I (r-RESPA) MTS SCHEME.
> Info: C1 SPLITTING OF LONG RANGE ELECTROSTATICS
> Info: PLACING ATOMS IN PATCHES BY HYDROGEN GROUPS
> Info: RIGID BONDS TO HYDROGEN : WATER
> Info: ERROR TOLERANCE : 1e-06
> Info: MAX ITERATIONS : 100
> Info: RIGID WATER USING SETTLE ALGORITHM
> Info: RANDOM NUMBER SEED 12347
> Info: USE HYDROGEN BONDS? NO
> Info: COORDINATE PDB ./WTS_WTBOX_ION.pdb
> Info: STRUCTURE FILE ./WTS_WTBOX_ION.psf
> Info: PARAMETER file: CHARMM format!
> Info: PARAMETERS ./par_all27_prot_lipid.prm
> Info: PARAMETERS ./toppar_all22_prot_heme.str
> Info: USING ARITHMETIC MEAN TO COMBINE L-J SIGMA PARAMETERS
> Info: BINARY COORDINATES ./min-02.restart.coor
> Info: SKIPPING rtf SECTION IN STREAM FILE
> Info: SUMMARY OF PARAMETERS:
> Info: 185 BONDS
> Info: 467 ANGLES
> Info: 601 DIHEDRAL
> Info: 47 IMPROPER
> Info: 6 CROSSTERM
> Info: 121 VDW
> Info: 0 VDW_PAIRS
> Info: 0 NBTHOLE_PAIRS
> Info: TIME FOR READING PSF FILE: 0.239432
> Info: TIME FOR READING PDB FILE: 0.0656481
> Info:
> Info: ****************************
> Info: STRUCTURE SUMMARY:
> Info: 49686 ATOMS
> Info: 34287 BONDS
> Info: 21815 ANGLES
> Info: 9486 DIHEDRALS
> Info: 641 IMPROPERS
> Info: 211 CROSSTERMS
> Info: 0 EXCLUSIONS
> Info: 46071 RIGID BONDS
> Info: 102987 DEGREES OF FREEDOM
> Info: 17232 HYDROGEN GROUPS
> Info: 4 ATOMS IN LARGEST HYDROGEN GROUP
> Info: 17232 MIGRATION GROUPS
> Info: 4 ATOMS IN LARGEST MIGRATION GROUP
> Info: TOTAL MASS = 304563 amu
> Info: TOTAL CHARGE = 2.58163e-06 e
> Info: MASS DENSITY = 0.954941 g/cm^3
> Info: ATOM DENSITY = 0.0938154 atoms/A^3
> Info: *****************************
> Info: Reading from binary file ./min-02.restart.coor
> Info:
> Info: Entering startup at 0.392675 s, 28.2539 MB of memory in use
> Info: Startup phase 0 took 9.60827e-05 s, 28.2656 MB of memory in use
> Info: ADDED 65398 IMPLICIT EXCLUSIONS
> Info: Startup phase 1 took 0.02037 s, 36.7812 MB of memory in use
> Info: Startup phase 2 took 8.70228e-05 s, 36.8711 MB of memory in use
> Info: Startup phase 3 took 3.98159e-05 s, 36.8711 MB of memory in use
> Info: Startup phase 4 took 0.000478029 s, 39.2539 MB of memory in use
> Info: Startup phase 5 took 4.69685e-05 s, 39.4023 MB of memory in use
> Info: PATCH GRID IS 2 (PERIODIC) BY 2 (PERIODIC) BY 2 (PERIODIC)
> Info: PATCH GRID IS 2-AWAY BY 2-AWAY BY 2-AWAY
> Info: REMOVING COM VELOCITY 0.000933667 0.00123997 -0.000505543
> Info: LARGEST PATCH (4) HAS 6331 ATOMS
> Info: Startup phase 6 took 0.011183 s, 48.918 MB of memory in use
> Info: PME using 6 and 6 processors for FFT and reciprocal sum.
> Info: PME USING 1 GRID NODES AND 1 TRANS NODES
> Info: PME GRID LOCATIONS: 0 1 2 3 4 5
> Info: PME TRANS LOCATIONS: 0 1 2 3 4 5
> Info: Optimizing 4 FFT steps. 1... 2... 3... 4... Done.
> Info: Startup phase 7 took 0.000873089 s, 51.1914 MB of memory in use
> Info: Startup phase 8 took 0.000119925 s, 51.1914 MB of memory in use
> LDB: Central LB being created...
> Info: Startup phase 9 took 0.000138044 s, 51.4766 MB of memory in use
> Info: CREATING 602 COMPUTE OBJECTS
> Info: NONBONDED TABLE R-SQUARED SPACING: 0.0625
> Info: NONBONDED TABLE SIZE: 769 POINTS
> Info: INCONSISTENCY IN FAST TABLE ENERGY VS FORCE: 0.000325096 AT 11.9556
> Info: INCONSISTENCY IN SCOR TABLE ENERGY VS FORCE: 0.000324844 AT 11.9556
> Info: INCONSISTENCY IN VDWA TABLE ENERGY VS FORCE: 0.0040507 AT 0.251946
> Info: INCONSISTENCY IN VDWB TABLE ENERGY VS FORCE: 0.00150189 AT 0.251946
> Pe 4 hosts 0 local and 1 remote patches for pe 4
> Pe 5 hosts 1 local and 0 remote patches for pe 4
> Pe 0 hosts 0 local and 1 remote patches for pe 4
> Pe 3 hosts 1 local and 1 remote patches for pe 4
> Pe 1 hosts 1 local and 0 remote patches for pe 4
> Pe 0 hosts 0 local and 1 remote patches for pe 2
> Pe 3 hosts 1 local and 1 remote patches for pe 2
> Pe 4 hosts 0 local and 1 remote patches for pe 2
> Pe 1 hosts 1 local and 0 remote patches for pe 2
> Pe 2 hosts 1 local and 1 remote patches for pe 2
> Pe 5 hosts 1 local and 0 remote patches for pe 2
> Pe 2 hosts 1 local and 1 remote patches for pe 4
> Info: useSync: 1 useProxySync: 0
> Info: Startup phase 10 took 0.107934 s, 93.6289 MB of memory in use
> Info: Startup phase 11 took 7.60555e-05 s, 93.6797 MB of memory in use
> Info: Startup phase 12 took 5.88894e-05 s, 93.6797 MB of memory in use
> Info: Finished startup at 0.534176 s, 93.6914 MB of memory in use
>
> TCL: Running for 20000 steps
> REASSIGNING VELOCITIES AT STEP 0 TO 1 KELVIN.
> Pe 2 has 4 local and 4 remote patches and 250 local and 250 remote
> computes.
> Pe 4 has 4 local and 4 remote patches and 250 local and 250 remote
> computes.
> ETITLE: TS BOND ANGLE DIHED
> IMPRP ELECT VDW BOUNDARY MISC
> KINETIC TOTAL TEMP POTENTIAL
> TOTAL3 TEMPAVG PRESSURE GPRESSURE
> VOLUME PRESSAVG GPRESSAVG
>
> ENERGY: 0 219.4599 632.6613 597.7338
> 27.3461 -231987.2055 28848.8514 0.0000
> 0.0000 102.3071 -201558.8457 0.9998
> -201661.1529 -201558.8439 0.9998 104.1807
> 297.1944 529614.4763 104.1807 297.1944
>
> LDB: ============= START OF LOAD BALANCING ============== 2.65752
> LDB: ============== END OF LOAD BALANCING =============== 2.65758
> Info: useSync: 1 useProxySync: 0
> LDB: =============== DONE WITH MIGRATION ================ 2.65797
> FATAL ERROR: CUDA error in cuda_check_remote_progress on Pe 2 (gig64
> device 0): unspecified launch failure
> ------------- Processor 2 Exiting: Called CmiAbort ------------
> Reason: FATAL ERROR: CUDA error in cuda_check_remote_progress on Pe 2
> (gig64 device 0): unspecified launch failure
>
> Charm++ fatal error:
> FATAL ERROR: CUDA error in cuda_check_remote_progress on Pe 2 (gig64
> device 0): unspecified launch failure.
> ***************
>
> Being short of ideas at this point, hope someone can think better.
>
> Thanks
>
> francesco pietra
>
>

-- 
Aron Broom M.Sc
PhD Student
Department of Chemistry
University of Waterloo

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:22:14 CST