NAMD 2.9b1 crashes during minimization

From: Francesco Pietra (chiendarret_at_gmail.com)
Date: Thu Mar 22 2012 - 10:28:04 CDT

Hello:
I started minimization of a protein in a water box with 2.9b1
linux-cuda, shared mem, two GTX580, one AMD Phenom(tm) II X6 1075T
Processor (6 cpu cores) (version 2.20.00). It startd much fasster
that with 2.8, but crashed at step 57 out of planned 10,000 (with
version 2.8 there was no problem). Below the log file, as it begins
and as it crashes:

Charm++: standalone mode (not using charmrun)
Converse/Charm++ Commit ID: v6.4.0-beta1-0-g5776d21
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (6-way SMP).
Charm++> cpu topology info is gathered in 0.001 seconds.
Info: NAMD 2.9b1 for Linux-x86_64-multicore-CUDA
Info:
Info: Please visit http://www.ks.uiuc.edu/Research/namd/
Info: for updates, documentation, and support information.
Info:
Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005)
Info: in all publications reporting results obtained with NAMD.
Info:
Info: Based on Charm++/Converse 60400 for multicore-linux64-iccstatic
Info: Built Mon Mar 19 13:06:58 CDT 2012 by jim on naiad.ks.uiuc.edu
Info: 1 NAMD 2.9b1 Linux-x86_64-multicore-CUDA 6 gig64 francesco
Info: Running on 6 processors, 1 nodes, 1 physical nodes.
Info: CPU topology information available.
Info: Charm++/Converse parallel runtime startup completed at 0.019042 s
Pe 2 physical rank 2 binding to CUDA device 0 on gig64: 'GeForce GTX
580' Mem: 1535MB Rev: 2.0
Pe 1 physical rank 1 will use CUDA device of pe 2
Pe 5 physical rank 5 will use CUDA device of pe 4
Pe 3 physical rank 3 will use CUDA device of pe 4
Pe 4 physical rank 4 binding to CUDA device 1 on gig64: 'GeForce GTX
580' Mem: 1535MB Rev: 2.0
Did not find +devices i,j,k,... argument, using all
Pe 0 physical rank 0 will use CUDA device of pe 2
Info: 8.0625 MB of memory in use based on /proc/self/stat
Info: Configuration file is min-01.conf
Info: Working in the current directory
/home/francesco/...........................
TCL: Suspending until startup complete.
Info: SIMULATION PARAMETERS:
Info: TIMESTEP 1
Info: NUMBER OF STEPS 0
Info: STEPS PER CYCLE 10
Info: PERIODIC CELL BASIS 1 86.21 0 0
Info: PERIODIC CELL BASIS 2 0 76.19 0
Info: PERIODIC CELL BASIS 3 0 0 86.12
Info: PERIODIC CELL CENTER 44.4029 39.8852 44.57
Info: LOAD BALANCER Centralized
Info: LOAD BALANCING STRATEGY New Load Balancers -- DEFAULT
Info: LDB PERIOD 2000 steps
Info: FIRST LDB TIMESTEP 50
Info: LAST LDB TIMESTEP -1
Info: LDB BACKGROUND SCALING 1
Info: HOM BACKGROUND SCALING 1
Info: PME BACKGROUND SCALING 1
Info: MIN ATOMS PER PATCH 40
Info: INITIAL TEMPERATURE 0
Info: CENTER OF MASS MOVING INITIALLY? NO
Info: DIELECTRIC 1
Info: EXCLUDE SCALED ONE-FOUR
Info: 1-4 ELECTROSTATICS SCALED BY 0.833333
Info: MODIFIED 1-4 VDW PARAMETERS WILL BE USED
Info: DCD FILENAME ./min-01.dcd
Info: DCD FREQUENCY 100
Info: DCD FIRST STEP 100
Info: DCD FILE WILL CONTAIN UNIT CELL DATA
Info: XST FILENAME ./min-01.xst
Info: XST FREQUENCY 100
Info: NO VELOCITY DCD OUTPUT
Info: NO FORCE DCD OUTPUT
Info: OUTPUT FILENAME ./min-01
Info: RESTART FILENAME ./min-01.rst
Info: RESTART FREQUENCY 100
Info: BINARY RESTART FILES WILL BE USED
Info: SWITCHING ACTIVE
Info: SWITCHING ON 6
Info: SWITCHING OFF 9
Info: PAIRLIST DISTANCE 11
Info: PAIRLIST SHRINK RATE 0.01
Info: PAIRLIST GROW RATE 0.01
Info: PAIRLIST TRIGGER 0.3
Info: PAIRLISTS PER CYCLE 2
Info: PAIRLIST OUTPUT STEPS 1000
Info: PAIRLISTS ENABLED
Info: MARGIN 5
Info: HYDROGEN GROUP CUTOFF 2.5
Info: PATCH DIMENSION 18.5
Info: ENERGY OUTPUT STEPS 100
Info: CROSSTERM ENERGY INCLUDED IN DIHEDRAL
Info: TIMING OUTPUT STEPS 1000
Info: PARTICLE MESH EWALD (PME) ACTIVE
Info: PME TOLERANCE 1e-06
Info: PME EWALD COEFFICIENT 0.348832
Info: PME INTERPOLATION ORDER 4
Info: PME GRID DIMENSIONS 90 90 90
Info: PME MAXIMUM GRID SPACING 1
Info: Attempting to read FFTW data from
FFTW_NAMD_2.9b1_Linux-x86_64-multicore-CUDA.txt
Info: Optimizing 6 FFT steps. 1... 2... 3... 4... 5... 6... Done.
Info: Writing FFTW data to FFTW_NAMD_2.9b1_Linux-x86_64-multicore-CUDA.txt
Info: FULL ELECTROSTATIC EVALUATION FREQUENCY 1
Info: USING VERLET I (r-RESPA) MTS SCHEME.
Info: C1 SPLITTING OF LONG RANGE ELECTROSTATICS
Info: PLACING ATOMS IN PATCHES BY HYDROGEN GROUPS
Info: RANDOM NUMBER SEED 1332428639
Info: USE HYDROGEN BONDS? NO
Info: Using AMBER format force field!
Info: AMBER PARM FILE ./PROT_box.prmtop
Info: AMBER COORDINATE FILE ./PROT_box.inpcrd
Info: Exclusions in PARM file will be ignored!
Info: SCNB (VDW SCALING) 2
Info: USING ARITHMETIC MEAN TO COMBINE L-J SIGMA PARAMETERS
Reading parm file (./PROT_box.prmtop) ...
PARM file in AMBER 7 format
Warning: Encounter 10-12 H-bond term
Warning: Found 15415 H-H bonds.
ERROR
Info: SUMMARY OF PARAMETERS:
Info: 51 BONDS
Info: 110 ANGLES
Info: 110 HARMONIC
Info: 0 COSINE-BASED
Info: 44 DIHEDRAL
Info: 0 IMPROPER
Info: 0 CROSSTERM
Info: 0 VDW
Info: 190 VDW_PAIRS
Info: 0 NBTHOLE_PAIRS
Info: TIME FOR READING PDB FILE: 1.90735e-06
Info:
Info: ****************************
Info: STRUCTURE SUMMARY:
Info: 51884 ATOMS
Info: 51952 BONDS
Info: 10313 ANGLES
Info: 21952 DIHEDRALS
Info: 0 IMPROPERS
Info: 0 CROSSTERMS
Info: 0 EXCLUSIONS
Info: 155649 DEGREES OF FREEDOM
Info: 18257 HYDROGEN GROUPS
Info: 4 ATOMS IN LARGEST HYDROGEN GROUP
Info: 18257 MIGRATION GROUPS
Info: 4 ATOMS IN LARGEST MIGRATION GROUP
Info: TOTAL MASS = 318261 amu
Info: TOTAL CHARGE = -1.20961e-05 e
Info: MASS DENSITY = 0.934294 g/cm^3
Info: ATOM DENSITY = 0.0917221 atoms/A^3
Info: *****************************
Info:
Info: Entering startup at 22.9331 s, 30.2266 MB of memory in use
Info: Startup phase 0 took 0.000324965 s, 30.25 MB of memory in use
Info: Startup phase 1 took 0.0376842 s, 39.793 MB of memory in use
Info: Startup phase 2 took 0.000640869 s, 41.7344 MB of memory in use
Info: Startup phase 3 took 0.00019002 s, 41.75 MB of memory in use
Info: PATCH GRID IS 4 (PERIODIC) BY 4 (PERIODIC) BY 4 (PERIODIC)
Info: PATCH GRID IS 1-AWAY BY 1-AWAY BY 1-AWAY
Info: REMOVING COM VELOCITY 0 0 0
Info: LARGEST PATCH (25) HAS 880 ATOMS
Info: Startup phase 4 took 0.020824 s, 50.7656 MB of memory in use
Info: PME using 6 and 6 processors for FFT and reciprocal sum.
Info: PME USING 1 GRID NODES AND 1 TRANS NODES
Info: PME GRID LOCATIONS: 0 1 2 3 4 5
Info: PME TRANS LOCATIONS: 0 1 2 3 4 5
Info: Optimizing 4 FFT steps. 1... 2... 3... 4... Done.
Info: Startup phase 5 took 0.00726509 s, 52.3047 MB of memory in use
Info: Startup phase 6 took 0.000536919 s, 52.5469 MB of memory in use
LDB: Central LB being created...
Info: Startup phase 7 took 0.000446081 s, 52.8203 MB of memory in use
Info: CREATING 1328 COMPUTE OBJECTS
Info: NONBONDED TABLE R-SQUARED SPACING: 0.0625
Info: NONBONDED TABLE SIZE: 705 POINTS
Pe 2 hosts 5 local and 5 remote patches for pe 2
Pe 5 hosts 0 local and 1 remote patches for pe 2
Pe 3 hosts 5 local and 6 remote patches for pe 2
Pe 1 hosts 6 local and 5 remote patches for pe 2
Pe 0 hosts 5 local and 5 remote patches for pe 2
Pe 3 hosts 5 local and 6 remote patches for pe 4
Pe 1 hosts 3 local and 2 remote patches for pe 4
Pe 4 hosts 6 local and 5 remote patches for pe 4
Pe 0 hosts 5 local and 5 remote patches for pe 4
Pe 5 hosts 5 local and 6 remote patches for pe 4
Pe 2 hosts 2 local and 2 remote patches for pe 4
Pe 4 hosts 3 local and 2 remote patches for pe 2
Info: useSync: 1 useProxySync: 0
Info: Startup phase 8 took 0.155836 s, 91.125 MB of memory in use
Info: Startup phase 9 took 0.00121212 s, 91.1758 MB of memory in use
Info: Startup phase 10 took 0.000585079 s, 91.4102 MB of memory in use
Info: Finished startup at 23.1586 s, 91.4688 MB of memory in use

TCL: Minimizing for 10000 steps
Pe 4 has 26 local and 26 remote patches and 445 local and 446 remote computes.
Pe 2 has 24 local and 24 remote patches and 419 local and 418 remote computes.
ETITLE: TS BOND ANGLE DIHED
IMPRP ELECT VDW BOUNDARY MISC
       KINETIC TOTAL TEMP POTENTIAL
  TOTAL3 TEMPAVG PRESSURE GPRESSURE
VOLUME PRESSAVG GPRESSAVG

ENERGY: 0 608.1068 1052.6951 3485.9453
0.0000 -156089.2177 21375000.4990 0.0000
0.0000 0.0000 21224058.0287 0.0000
21224058.0287 21224058.0287 0.0000 10520317.6506
10505666.4985 565665.4322 10520317.6506 10505666.4985

...................................
..........................................
ENERGY: 56 12508.5664 1082.1368 3569.0850
0.0000 -193924.2577 17115.4831 0.0000
0.0000 0.0000 -159648.9862 0.0000
-159648.9862 -159648.9862 0.0000 -7122.9231
-6474.8098 565665.4322 -7122.9231 -6474.8098

ENERGY: 57 21230.3212 1650.3913 3684.5584
0.0000 -195555.6143 19427.4344 0.0000
0.0000 0.0000 -149562.9090 0.0000
-149562.9090 -149562.9090 0.0000 -16730.6831
-5674.2667 565665.4322 -16730.6831 -5674.2667

LINE MINIMIZER BRACKET: DX 0.000804502 0.001609 DU -943.587 10086.1
DUDX -3.34677e+06 1.05924e+06 1.22224e+07
FATAL ERROR: cuda_check_remote_progress polled 1000000 times over
101.946588 s on step 58
------------- Processor 4 Exiting: Called CmiAbort ------------
Reason: FATAL ERROR: cuda_check_remote_progress polled 1000000 times
over 101.946588 s on step 58

Charm++ fatal error:
FATAL ERROR: cuda_check_remote_progress polled 1000000 times over
101.946588 s on step 58

[4] Stack Traceback:
  [4:0] CmiAbort+0x95 [0xbb5585]
  [4:1] _Z8NAMD_diePKc+0x62 [0x5821aa]

The cards were activated with

nvidia-smi -L
nvidia-smi -pm 1

and about cuda:
root_at_gig64:...........# modinfo nvidia
filename: /lib/modules/2.6.38-2-amd64/updates/dkms/nvidia.ko
alias: char-major-195-*
version: 295.20
supported: external
license: NVIDIA
alias: pci:v000010DEd00000E00sv*sd*bc04sc80i00*
alias: pci:v000010DEd00000AA3sv*sd*bc0Bsc40i00*
alias: pci:v000010DEd*sv*sd*bc03sc02i00*
alias: pci:v000010DEd*sv*sd*bc03sc00i00*
depends: i2c-core
vermagic: 2.6.38-2-amd64 SMP mod_unload modversions
parm: NVreg_EnableVia4x:int
parm: NVreg_EnableALiAGP:int
parm: NVreg_ReqAGPRate:int
parm: NVreg_EnableAGPSBA:int
parm: NVreg_EnableAGPFW:int
parm: NVreg_Mobile:int
parm: NVreg_ResmanDebugLevel:int
parm: NVreg_RmLogonRC:int
parm: NVreg_ModifyDeviceFiles:int
parm: NVreg_DeviceFileUID:int
parm: NVreg_DeviceFileGID:int
parm: NVreg_DeviceFileMode:int
parm: NVreg_RemapLimit:int
parm: NVreg_UpdateMemoryTypes:int
parm: NVreg_InitializeSystemMemoryAllocations:int
parm: NVreg_UseVBios:int
parm: NVreg_RMEdgeIntrCheck:int
parm: NVreg_UsePageAttributeTable:int
parm: NVreg_EnableMSI:int
parm: NVreg_MapRegistersEarly:int
parm: NVreg_RegisterForACPIEvents:int
parm: NVreg_RegistryDwords:charp
parm: NVreg_RmMsg:charp
parm: NVreg_NvAGP:int
root_at_gig64:/home/francesco/.............

I am using the updated driver as provided on Debian amd64 wheezy

Thanks for advice

francesco pietra

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:21:47 CST