CUDA error cudaStreamSynchronize(stream) and CUDA error in ComputeBondedCUDA

From: Francesco Pietra (chiendarret_at_gmail.com)
Date: Thu Nov 17 2022 - 10:19:43 CST

Hello
my computer GA-X79-UD3 with two 680 GPUs,
Debian10 Linux,
$ uname -r
5.10.0-19-amd64

NAMD_Git-2022-07-21_Linux-x86_64-multicore-CUDA
Driver Version: 470.141.03 CUDA Version: 11.4
can't any more run namd-CUDA

Preceded by:
nvidia-smi -pm 1

Error with both devices:
namd2 +idlepoll +p12 +devices 0,1 min.conf

TCL: Minimizing for 3000 steps
FATAL ERROR: CUDA error cudaStreamSynchronize(stream) in file
src/CudaTileListKernel.cu, function buildTileLists, line 1136
 on Pe 4 (gig64 device 0 pci 0:2:0): an illegal memory access was
encountered
FATAL ERROR: CUDA error in ComputeBondedCUDA::forceDoneCheck after polling
48 times over 0.005047 s on Pe 8 (gig64 device 1 pci 0:3:0): an illegal
memory access was encountered
FATAL ERROR: CUDA error cudaStreamSynchronize(stream) in file
src/CudaTileListKernel.cu, function buildTileLists, line 1136
 on Pe 4 (gig64 device 0 pci 0:2:0): an illegal memory access was
encountered
FATAL ERROR: CUDA error in ComputeBondedCUDA::forceDoneCheck after polling
48 times over 0.005047 s on Pe 8 (gig64 device 1 pci 0:3:0): an illegal
memory access was encountered
[Partition 0][Node 0] End of program

Error with device 0:
namd2 +idlepoll +p12 +devices 0 min.conf

TCL: Minimizing for 3000 steps
FATAL ERROR: CUDA error cudaStreamSynchronize(stream) in file
src/CudaTileListKernel.cu, function sortTileLists, line 1577
 on Pe 8 (gig64 device 0 pci 0:2:0): an illegal memory access was
encountered
FATAL ERROR: CUDA error cudaStreamSynchronize(stream) in file
src/CudaTileListKernel.cu, function sortTileLists, line 1577
 on Pe 8 (gig64 device 0 pci 0:2:0): an illegal memory access was
encountered
[Partition 0][Node 0] End of program
FATAL ERROR: CUDA error in ComputeBondedCUDA::forceDoneCheck after polling
673 times over 0.077770 s on Pe 8 (gig64 device 0 pci 0:2:0): an illegal
memory access was encountered
FATAL ERROR: CUDA error in ComputeBondedCUDA::forceDoneCheck after polling
673 times over 0.077770 s on Pe 8 (gig64 device 0 pci 0:2:0): an illegal
memory access was encountered

Error with device 1:
namd2 +idlepoll +p12 +devices 1 min.conf

TCL: Minimizing for 3000 steps
FATAL ERROR: CUDA error cudaStreamSynchronize(stream) in file
src/CudaTileListKernel.cu, function sortTileLists, line 1577
 on Pe 8 (gig64 device 1 pci 0:3:0): an illegal memory access was
encountered
FATAL ERROR: CUDA error cudaStreamSynchronize(stream) in file
src/CudaTileListKernel.cu, function sortTileLists, line 1577
 on Pe 8 (gig64 device 1 pci 0:3:0): an illegal memory access was
encountered
[Partition 0][Node 0] End of program
FATAL ERROR: CUDA error in ComputeBondedCUDA::forceDoneCheck after polling
671 times over 0.077836 s on Pe 8 (gig64 device 1 pci 0:3:0): an illegal
memory access was encountered
FATAL ERROR: CUDA error in ComputeBondedCUDA::forceDoneCheck after polling
671 times over 0.077836 s on Pe 8 (gig64 device 1 pci 0:3:0): an illegal
memory access was encountered

This error arose months ago, with previous versions of CUDA drived and
Linux kernel and continues with new drivers/kernel.

My question here is whether these errors may arise from wrong usage of namd
(I am using the same commands that used to be OK long ago)

Computer engineers say that these can't be hardware errors. Actually,
should my namd commands above have used selectively one GOU or the other
one, memory failure is unlikely.

Thanks for advice
francesco pietra

This archive was generated by hypermail 2.1.6 : Tue Dec 13 2022 - 14:32:44 CST