RE: CUDA error in cuda_check_local_progress

From: Abhishek TYAGI (atyagiaa_at_connect.ust.hk)
Date: Thu Apr 17 2014 - 04:23:44 CDT

Hi ,

I tried to run on +p1, +p2, +p3, p4 separately too. But, it run on +p1 for few minutes then the same output appears. But when I run nvidia-smi, I observe that namd is still running. Finally the output from log file is as follows:

**

WRITING EXTENDED SYSTEM TO RESTART FILE AT STEP 10000
WRITING COORDINATES TO DCD FILE AT STEP 10000
WRITING COORDINATES TO RESTART FILE AT STEP 10000
FINISHED WRITING RESTART COORDINATES
WRITING VELOCITIES TO RESTART FILE AT STEP 10000
FINISHED WRITING RESTART VELOCITIES
REINITIALIZING VELOCITIES AT STEP 10000 TO 288 KELVIN.
TCL: Running for 5000 steps
ERROR: Constraint failure in RATTLE algorithm for atom 4170!
ERROR: Constraint failure; simulation has become unstable.
ERROR: Constraint failure in RATTLE algorithm for atom 4236!
ERROR: Constraint failure; simulation has become unstable.
ERROR: Exiting prematurely; see error messages above.
====================================================

WallClock: 50792.585938 CPUTime: 50638.324219 Memory: 1220.495789 MB
Program finished.

Can you suggest some more way to do it.

regards

Abhi

________________________________
From: Norman Geist <norman.geist_at_uni-greifswald.de>
Sent: Thursday, April 17, 2014 4:21 PM
To: Abhishek TYAGI
Cc: Namd Mailing List
Subject: AW: namd-l: CUDA error in cuda_check_local_progress

What GPUs are that? This error occurs for example if your cutoff or pairlistdist, etc. are too large to fit the GPUs memory and stuff. Whats the output of “nvidia-smi –q”. Maybe there are multiple GPUs where one is only for display and therefore hasn’t enough memory. Try setting +devices to select the GPU ids manually and see if it works with one GPU separately.

Norman Geist.

Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im Auftrag von Abhishek TYAGI
Gesendet: Donnerstag, 17. April 2014 09:41
An: namd-l_at_ks.uiuc.edu
Betreff: namd-l: CUDA error in cuda_check_local_progress

Hi,

I am running a simulation for graphene and dna system. While running in my CPU their is no error, but while running on GPU Cluster (Nvidia, Cuda) I am using NAMD tool available on website (NAMD_2.9_Linux-x86_64-multicore-CUDA.tar.gz). The following error appears all the time. I tried to change timesteps, frequencies and other things too but i really dont understand what to do in this case.

I run the command for minimization but it is failed everytime:

% charmrun namd2 +idlepoll +p4 eq1.namd > eq1.log &

------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: FATAL ERROR: CUDA error in cuda_check_local_progress on Pe 0 (gpu10 device 0): unspecified launch failure

Charm++ fatal error:
FATAL ERROR: CUDA error in cuda_check_local_progress on Pe 0 (gpu10 device 0): unspecified launch failure

The eq1.namd conf file is as follows:

#############################################################
## JOB DESCRIPTION ##
#############################################################

# Minimization and Equilibration of
# COMMENT ON YOUR SYSTEM HERE

#############################################################
## ADJUSTABLE PARAMETERS ##
#############################################################

structure ionized.psf
coordinates ionized.pdb

set temperature 298
set outputname eq1

firsttimestep 0

#############################################################
## SIMULATION PARAMETERS ##
#############################################################

# Input
paraTypeCharmm on
parameters par_all27_na.prm
parameters par_graphene.prm
temperature $temperature

# Force-Field Parameters
exclude scaled1-4
1-4scaling 1.0
cutoff 12.
switching on
switchdist 10.
pairlistdist 13.5

# Integrator Parameters
timestep 0.5
rigidBonds all
nonbondedFreq 2
fullElectFrequency 4
stepspercycle 10

# Constant Temperature Control
langevin off
langevinDamping 5
langevinTemp $temperature
langevinHydrogen off

# Output
outputName $outputname

restartfreq 500 ;# 500steps = every 1ps
dcdfreq 300
outputEnergies 100
outputPressure 100

#############################################################
## PBC PARAMETERS ##
#############################################################

# Periodic Boundary Conditions
cellBasisVector1 40.0 0.0 0.0
cellBasisVector2 0.0 40.0 0.0
cellBasisVector3 0.0 0.0 30.0
cellOrigin 0.0 0.0 0.0

#############################################################
## EXECUTION SCRIPT ##
#############################################################

# Minimization
minimize 100000
reinitvels $temperature

run 50000

Please suggest me how to resolve this issue.

Thanks in advance

Abhishek

________________________________
[http://static.avast.com/emails/avast-mail-stamp.png] <http://www.avast.com/>

Diese E-Mail ist frei von Viren und Malware, denn der avast! Antivirus<http://www.avast.com/> Schutz ist aktiv.

This archive was generated by hypermail 2.1.6 : Wed Dec 31 2014 - 23:22:21 CST