AW: NAMD 2.9b1 crashes during minimization

From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Fri Mar 23 2012 - 01:58:15 CDT

Hi Francesco,

I don't know the code as good as the namd people, but I try to point out what may be happened.

Namd does use ++idlepoll on gpu, that means in does lookup results from the gpu all over the time. Namd seems to have missed such an answer from the gpu and then timed out by waiting for the results for ever. This can IMHO happen for some reasons:

1. A bug in namd2.9b (new minimization on gpu? Or shared mem cuda? <- also new right?)
2. Error done by a gpu (memory error etc.)
3. Heavy (over)loaded machine
4. happenstance

If the error is still there when you restart, and is not there with namd2.8, it may be a bug of the new implemented minimization on the gpu, or one of your gpus->maybe try them separately.

Good luck
Norman Geist.

> -----Ursprüngliche Nachricht-----
> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
> Auftrag von Francesco Pietra
> Gesendet: Donnerstag, 22. März 2012 16:28
> An: NAMD
> Betreff: namd-l: NAMD 2.9b1 crashes during minimization
>
> Hello:
> I started minimization of a protein in a water box with 2.9b1
> linux-cuda, shared mem, two GTX580, one AMD Phenom(tm) II X6 1075T
> Processor (6 cpu cores) (version 2.20.00). It startd much fasster
> that with 2.8, but crashed at step 57 out of planned 10,000 (with
> version 2.8 there was no problem). Below the log file, as it begins
> and as it crashes:
>
>
> Charm++: standalone mode (not using charmrun)
> Converse/Charm++ Commit ID: v6.4.0-beta1-0-g5776d21
> CharmLB> Load balancer assumes all CPUs are same.
> Charm++> Running on 1 unique compute nodes (6-way SMP).
> Charm++> cpu topology info is gathered in 0.001 seconds.
> Info: NAMD 2.9b1 for Linux-x86_64-multicore-CUDA
> Info:
> Info: Please visit http://www.ks.uiuc.edu/Research/namd/
> Info: for updates, documentation, and support information.
> Info:
> Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005)
> Info: in all publications reporting results obtained with NAMD.
> Info:
> Info: Based on Charm++/Converse 60400 for multicore-linux64-iccstatic
> Info: Built Mon Mar 19 13:06:58 CDT 2012 by jim on naiad.ks.uiuc.edu
> Info: 1 NAMD 2.9b1 Linux-x86_64-multicore-CUDA 6 gig64 francesco
> Info: Running on 6 processors, 1 nodes, 1 physical nodes.
> Info: CPU topology information available.
> Info: Charm++/Converse parallel runtime startup completed at 0.019042 s
> Pe 2 physical rank 2 binding to CUDA device 0 on gig64: 'GeForce GTX
> 580' Mem: 1535MB Rev: 2.0
> Pe 1 physical rank 1 will use CUDA device of pe 2
> Pe 5 physical rank 5 will use CUDA device of pe 4
> Pe 3 physical rank 3 will use CUDA device of pe 4
> Pe 4 physical rank 4 binding to CUDA device 1 on gig64: 'GeForce GTX
> 580' Mem: 1535MB Rev: 2.0
> Did not find +devices i,j,k,... argument, using all
> Pe 0 physical rank 0 will use CUDA device of pe 2
> Info: 8.0625 MB of memory in use based on /proc/self/stat
> Info: Configuration file is min-01.conf
> Info: Working in the current directory
> /home/francesco/...........................
> TCL: Suspending until startup complete.
> Info: SIMULATION PARAMETERS:
> Info: TIMESTEP 1
> Info: NUMBER OF STEPS 0
> Info: STEPS PER CYCLE 10
> Info: PERIODIC CELL BASIS 1 86.21 0 0
> Info: PERIODIC CELL BASIS 2 0 76.19 0
> Info: PERIODIC CELL BASIS 3 0 0 86.12
> Info: PERIODIC CELL CENTER 44.4029 39.8852 44.57
> Info: LOAD BALANCER Centralized
> Info: LOAD BALANCING STRATEGY New Load Balancers -- DEFAULT
> Info: LDB PERIOD 2000 steps
> Info: FIRST LDB TIMESTEP 50
> Info: LAST LDB TIMESTEP -1
> Info: LDB BACKGROUND SCALING 1
> Info: HOM BACKGROUND SCALING 1
> Info: PME BACKGROUND SCALING 1
> Info: MIN ATOMS PER PATCH 40
> Info: INITIAL TEMPERATURE 0
> Info: CENTER OF MASS MOVING INITIALLY? NO
> Info: DIELECTRIC 1
> Info: EXCLUDE SCALED ONE-FOUR
> Info: 1-4 ELECTROSTATICS SCALED BY 0.833333
> Info: MODIFIED 1-4 VDW PARAMETERS WILL BE USED
> Info: DCD FILENAME ./min-01.dcd
> Info: DCD FREQUENCY 100
> Info: DCD FIRST STEP 100
> Info: DCD FILE WILL CONTAIN UNIT CELL DATA
> Info: XST FILENAME ./min-01.xst
> Info: XST FREQUENCY 100
> Info: NO VELOCITY DCD OUTPUT
> Info: NO FORCE DCD OUTPUT
> Info: OUTPUT FILENAME ./min-01
> Info: RESTART FILENAME ./min-01.rst
> Info: RESTART FREQUENCY 100
> Info: BINARY RESTART FILES WILL BE USED
> Info: SWITCHING ACTIVE
> Info: SWITCHING ON 6
> Info: SWITCHING OFF 9
> Info: PAIRLIST DISTANCE 11
> Info: PAIRLIST SHRINK RATE 0.01
> Info: PAIRLIST GROW RATE 0.01
> Info: PAIRLIST TRIGGER 0.3
> Info: PAIRLISTS PER CYCLE 2
> Info: PAIRLIST OUTPUT STEPS 1000
> Info: PAIRLISTS ENABLED
> Info: MARGIN 5
> Info: HYDROGEN GROUP CUTOFF 2.5
> Info: PATCH DIMENSION 18.5
> Info: ENERGY OUTPUT STEPS 100
> Info: CROSSTERM ENERGY INCLUDED IN DIHEDRAL
> Info: TIMING OUTPUT STEPS 1000
> Info: PARTICLE MESH EWALD (PME) ACTIVE
> Info: PME TOLERANCE 1e-06
> Info: PME EWALD COEFFICIENT 0.348832
> Info: PME INTERPOLATION ORDER 4
> Info: PME GRID DIMENSIONS 90 90 90
> Info: PME MAXIMUM GRID SPACING 1
> Info: Attempting to read FFTW data from
> FFTW_NAMD_2.9b1_Linux-x86_64-multicore-CUDA.txt
> Info: Optimizing 6 FFT steps. 1... 2... 3... 4... 5... 6... Done.
> Info: Writing FFTW data to FFTW_NAMD_2.9b1_Linux-x86_64-multicore-
> CUDA.txt
> Info: FULL ELECTROSTATIC EVALUATION FREQUENCY 1
> Info: USING VERLET I (r-RESPA) MTS SCHEME.
> Info: C1 SPLITTING OF LONG RANGE ELECTROSTATICS
> Info: PLACING ATOMS IN PATCHES BY HYDROGEN GROUPS
> Info: RANDOM NUMBER SEED 1332428639
> Info: USE HYDROGEN BONDS? NO
> Info: Using AMBER format force field!
> Info: AMBER PARM FILE ./PROT_box.prmtop
> Info: AMBER COORDINATE FILE ./PROT_box.inpcrd
> Info: Exclusions in PARM file will be ignored!
> Info: SCNB (VDW SCALING) 2
> Info: USING ARITHMETIC MEAN TO COMBINE L-J SIGMA PARAMETERS
> Reading parm file (./PROT_box.prmtop) ...
> PARM file in AMBER 7 format
> Warning: Encounter 10-12 H-bond term
> Warning: Found 15415 H-H bonds.
> ERROR
> Info: SUMMARY OF PARAMETERS:
> Info: 51 BONDS
> Info: 110 ANGLES
> Info: 110 HARMONIC
> Info: 0 COSINE-BASED
> Info: 44 DIHEDRAL
> Info: 0 IMPROPER
> Info: 0 CROSSTERM
> Info: 0 VDW
> Info: 190 VDW_PAIRS
> Info: 0 NBTHOLE_PAIRS
> Info: TIME FOR READING PDB FILE: 1.90735e-06
> Info:
> Info: ****************************
> Info: STRUCTURE SUMMARY:
> Info: 51884 ATOMS
> Info: 51952 BONDS
> Info: 10313 ANGLES
> Info: 21952 DIHEDRALS
> Info: 0 IMPROPERS
> Info: 0 CROSSTERMS
> Info: 0 EXCLUSIONS
> Info: 155649 DEGREES OF FREEDOM
> Info: 18257 HYDROGEN GROUPS
> Info: 4 ATOMS IN LARGEST HYDROGEN GROUP
> Info: 18257 MIGRATION GROUPS
> Info: 4 ATOMS IN LARGEST MIGRATION GROUP
> Info: TOTAL MASS = 318261 amu
> Info: TOTAL CHARGE = -1.20961e-05 e
> Info: MASS DENSITY = 0.934294 g/cm^3
> Info: ATOM DENSITY = 0.0917221 atoms/A^3
> Info: *****************************
> Info:
> Info: Entering startup at 22.9331 s, 30.2266 MB of memory in use
> Info: Startup phase 0 took 0.000324965 s, 30.25 MB of memory in use
> Info: Startup phase 1 took 0.0376842 s, 39.793 MB of memory in use
> Info: Startup phase 2 took 0.000640869 s, 41.7344 MB of memory in use
> Info: Startup phase 3 took 0.00019002 s, 41.75 MB of memory in use
> Info: PATCH GRID IS 4 (PERIODIC) BY 4 (PERIODIC) BY 4 (PERIODIC)
> Info: PATCH GRID IS 1-AWAY BY 1-AWAY BY 1-AWAY
> Info: REMOVING COM VELOCITY 0 0 0
> Info: LARGEST PATCH (25) HAS 880 ATOMS
> Info: Startup phase 4 took 0.020824 s, 50.7656 MB of memory in use
> Info: PME using 6 and 6 processors for FFT and reciprocal sum.
> Info: PME USING 1 GRID NODES AND 1 TRANS NODES
> Info: PME GRID LOCATIONS: 0 1 2 3 4 5
> Info: PME TRANS LOCATIONS: 0 1 2 3 4 5
> Info: Optimizing 4 FFT steps. 1... 2... 3... 4... Done.
> Info: Startup phase 5 took 0.00726509 s, 52.3047 MB of memory in use
> Info: Startup phase 6 took 0.000536919 s, 52.5469 MB of memory in use
> LDB: Central LB being created...
> Info: Startup phase 7 took 0.000446081 s, 52.8203 MB of memory in use
> Info: CREATING 1328 COMPUTE OBJECTS
> Info: NONBONDED TABLE R-SQUARED SPACING: 0.0625
> Info: NONBONDED TABLE SIZE: 705 POINTS
> Pe 2 hosts 5 local and 5 remote patches for pe 2
> Pe 5 hosts 0 local and 1 remote patches for pe 2
> Pe 3 hosts 5 local and 6 remote patches for pe 2
> Pe 1 hosts 6 local and 5 remote patches for pe 2
> Pe 0 hosts 5 local and 5 remote patches for pe 2
> Pe 3 hosts 5 local and 6 remote patches for pe 4
> Pe 1 hosts 3 local and 2 remote patches for pe 4
> Pe 4 hosts 6 local and 5 remote patches for pe 4
> Pe 0 hosts 5 local and 5 remote patches for pe 4
> Pe 5 hosts 5 local and 6 remote patches for pe 4
> Pe 2 hosts 2 local and 2 remote patches for pe 4
> Pe 4 hosts 3 local and 2 remote patches for pe 2
> Info: useSync: 1 useProxySync: 0
> Info: Startup phase 8 took 0.155836 s, 91.125 MB of memory in use
> Info: Startup phase 9 took 0.00121212 s, 91.1758 MB of memory in use
> Info: Startup phase 10 took 0.000585079 s, 91.4102 MB of memory in use
> Info: Finished startup at 23.1586 s, 91.4688 MB of memory in use
>
> TCL: Minimizing for 10000 steps
> Pe 4 has 26 local and 26 remote patches and 445 local and 446 remote
> computes.
> Pe 2 has 24 local and 24 remote patches and 419 local and 418 remote
> computes.
> ETITLE: TS BOND ANGLE DIHED
> IMPRP ELECT VDW BOUNDARY MISC
> KINETIC TOTAL TEMP POTENTIAL
> TOTAL3 TEMPAVG PRESSURE GPRESSURE
> VOLUME PRESSAVG GPRESSAVG
>
> ENERGY: 0 608.1068 1052.6951 3485.9453
> 0.0000 -156089.2177 21375000.4990 0.0000
> 0.0000 0.0000 21224058.0287 0.0000
> 21224058.0287 21224058.0287 0.0000 10520317.6506
> 10505666.4985 565665.4322 10520317.6506 10505666.4985
>
> ...................................
> ..........................................
> ENERGY: 56 12508.5664 1082.1368 3569.0850
> 0.0000 -193924.2577 17115.4831 0.0000
> 0.0000 0.0000 -159648.9862 0.0000
> -159648.9862 -159648.9862 0.0000 -7122.9231
> -6474.8098 565665.4322 -7122.9231 -6474.8098
>
> ENERGY: 57 21230.3212 1650.3913 3684.5584
> 0.0000 -195555.6143 19427.4344 0.0000
> 0.0000 0.0000 -149562.9090 0.0000
> -149562.9090 -149562.9090 0.0000 -16730.6831
> -5674.2667 565665.4322 -16730.6831 -5674.2667
>
> LINE MINIMIZER BRACKET: DX 0.000804502 0.001609 DU -943.587 10086.1
> DUDX -3.34677e+06 1.05924e+06 1.22224e+07
> FATAL ERROR: cuda_check_remote_progress polled 1000000 times over
> 101.946588 s on step 58
> ------------- Processor 4 Exiting: Called CmiAbort ------------
> Reason: FATAL ERROR: cuda_check_remote_progress polled 1000000 times
> over 101.946588 s on step 58
>
> Charm++ fatal error:
> FATAL ERROR: cuda_check_remote_progress polled 1000000 times over
> 101.946588 s on step 58
>
> [4] Stack Traceback:
> [4:0] CmiAbort+0x95 [0xbb5585]
> [4:1] _Z8NAMD_diePKc+0x62 [0x5821aa]
>
>
> The cards were activated with
>
> nvidia-smi -L
> nvidia-smi -pm 1
>
> and about cuda:
> root_at_gig64:...........# modinfo nvidia
> filename: /lib/modules/2.6.38-2-amd64/updates/dkms/nvidia.ko
> alias: char-major-195-*
> version: 295.20
> supported: external
> license: NVIDIA
> alias: pci:v000010DEd00000E00sv*sd*bc04sc80i00*
> alias: pci:v000010DEd00000AA3sv*sd*bc0Bsc40i00*
> alias: pci:v000010DEd*sv*sd*bc03sc02i00*
> alias: pci:v000010DEd*sv*sd*bc03sc00i00*
> depends: i2c-core
> vermagic: 2.6.38-2-amd64 SMP mod_unload modversions
> parm: NVreg_EnableVia4x:int
> parm: NVreg_EnableALiAGP:int
> parm: NVreg_ReqAGPRate:int
> parm: NVreg_EnableAGPSBA:int
> parm: NVreg_EnableAGPFW:int
> parm: NVreg_Mobile:int
> parm: NVreg_ResmanDebugLevel:int
> parm: NVreg_RmLogonRC:int
> parm: NVreg_ModifyDeviceFiles:int
> parm: NVreg_DeviceFileUID:int
> parm: NVreg_DeviceFileGID:int
> parm: NVreg_DeviceFileMode:int
> parm: NVreg_RemapLimit:int
> parm: NVreg_UpdateMemoryTypes:int
> parm: NVreg_InitializeSystemMemoryAllocations:int
> parm: NVreg_UseVBios:int
> parm: NVreg_RMEdgeIntrCheck:int
> parm: NVreg_UsePageAttributeTable:int
> parm: NVreg_EnableMSI:int
> parm: NVreg_MapRegistersEarly:int
> parm: NVreg_RegisterForACPIEvents:int
> parm: NVreg_RegistryDwords:charp
> parm: NVreg_RmMsg:charp
> parm: NVreg_NvAGP:int
> root_at_gig64:/home/francesco/.............
>
> I am using the updated driver as provided on Debian amd64 wheezy
>
>
> Thanks for advice
>
> francesco pietra

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:21:21 CST