Re: Multi node run causes "CUDA error cudaStreamCreate"

From: Axel Kohlmeyer (akohlmey_at_gmail.com)
Date: Fri Apr 01 2011 - 15:02:18 CDT

i just had this kind of error myself.

check your GPUs with: nvida-smi -a
could be that one of them has ECC errors and then NAMD
(rightfully so) refuses to use the device.

axel

On Fri, Apr 1, 2011 at 1:32 PM, Michael S. Sellers (Cont, ARL/WMRD)
<michael.s.sellers.ctr_at_us.army.mil> wrote:
> All,
>
> I am receiving a "FATAL ERROR: CUDA error cudaStreamCreate on Pe 7 (n1
> device 1): no CUDA-capable device is available" when NAMD starts up and is
> optimizing FFT steps, for a job running on 3 nodes, 4ppn, 2 Tesla's per
> node.
>
> The command I'm executing within a PBS script is:
> ~/software/bin/charmrun +p12 ~/software/bin/namd2 +idlepoll sim1.conf  >
> $PBS_JOBNAME.out
>
> NAMD CUDA does not give this error on 1 node, 8ppn, 2 Teslas.  Please see
> output below.
>
> Might this be a situation where I need to use the +devices flag?  It seems
> as though the PEs are binding to CUDA devices on other nodes.
>
> Thanks,
>
> Mike
>
>
> Charm++> Running on 3 unique compute nodes (8-way SMP).
> Charm++> cpu topology info is gathered in 0.203 seconds.
> Info: NAMD CVS-2011-03-22 for Linux-x86_64-MPI-CUDA
> Info:
> Info: Please visit http://www.ks.uiuc.edu/Research/namd/
> Info: for updates, documentation, and support information.
> Info:
> Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005)
> Info: in all publications reporting results obtained with NAMD.
> Info:
> Info: Based on Charm++/Converse 60303 for mpi-linux-x86_64
> Info: 1 NAMD  CVS-2011-03-22  Linux-x86_64-MPI-CUDA
> Info: Running on 12 processors, 12 nodes, 3 physical nodes.
> Info: CPU topology information available.
> Info: Charm++/Converse parallel runtime startup completed at 0.204571 s
> Pe 2 sharing CUDA device 0 first 0 next 0
> Did not find +devices i,j,k,... argument, using all
> Pe 2 physical rank 2 binding to CUDA device 0 on n2: 'Tesla T10 Processor'
>  Mem: 4095MB  Rev: 1.3
> Pe 3 sharing CUDA device 1 first 1 next 1
> Pe 3 physical rank 3 binding to CUDA device 1 on n2: 'Tesla T10 Processor'
>  Mem: 4095MB  Rev: 1.3
> Pe 0 sharing CUDA device 0 first 0 next 2
> Pe 0 physical rank 0 binding to CUDA device 0 on n2: 'Tesla T10 Processor'
>  Mem: 4095MB  Rev: 1.3
> Pe 9 sharing CUDA device 1 first 9 next 11
> Pe 7 sharing CUDA device 1 first 5 next 5
> Pe 5 sharing CUDA device 1 first 5 next 7
> Pe 9 physical rank 1 binding to CUDA device 1 on n0: 'Tesla T10 Processor'
>  Mem: 4095MB  Rev: 1.3
> Pe 7 physical rank 3 binding to CUDA device 1 on n1: 'Tesla T10 Processor'
>  Mem: 4095MB  Rev: 1.3
> Pe 5 physical rank 1 binding to CUDA device 1 on n1: 'Tesla T10 Processor'
>  Mem: 4095MB  Rev: 1.3
> Pe 10 sharing CUDA device 0 first 8 next 8
> Pe 11 sharing CUDA device 1 first 9 next 9
> Pe 8 sharing CUDA device 0 first 8 next 10
> Pe 11 physical rank 3 binding to CUDA device 1 on n0: 'Tesla T10 Processor'
>  Mem: 4095MB  Rev: 1.3
> Pe 10 physical rank 2 binding to CUDA device 0 on n0: 'Tesla T10 Processor'
>  Mem: 4095MB  Rev: 1.3
> Pe 8 physical rank 0 binding to CUDA device 0 on n0: 'Tesla T10 Processor'
>  Mem: 4095MB  Rev: 1.3
> Pe 6 sharing CUDA device 0 first 4 next 4
> Pe 6 physical rank 2 binding to CUDA device 0 on n1: 'Tesla T10 Processor'
>  Mem: 4095MB  Rev: 1.3
> Pe 1 sharing CUDA device 1 first 1 next 3
> Pe 1 physical rank 1 binding to CUDA device 1 on n2: 'Tesla T10 Processor'
>  Mem: 4095MB  Rev: 1.3
> Pe 4 sharing CUDA device 0 first 4 next 6
> Pe 4 physical rank 0 binding to CUDA device 0 on n1: 'Tesla T10 Processor'
>  Mem: 4095MB  Rev: 1.3
> Info: 51.4492 MB of memory in use based on /proc/self/stat
> ...
> ...
> Info: PME MAXIMUM GRID SPACING    1.5
> Info: Attempting to read FFTW data from
> FFTW_NAMD_CVS-2011-03-22_Linux-x86_64-MPI-CUDA.txt
> Info: Optimizing 6 FFT steps.  1...FATAL ERROR: CUDA error cudaStreamCreate
> on Pe 7 (n1 device 1): no CUDA-capable device is available
> ------------- Processor 7 Exiting: Called CmiAbort ------------
> Reason: FATAL ERROR: CUDA error cudaStreamCreate on Pe 7 (n1 device 1): no
> CUDA-capable device is available
>
> [7] Stack Traceback:
>  [7:0] CmiAbort+0x59  [0x907f64]
>  [7:1] _Z8NAMD_diePKc+0x4a  [0x4fa7ba]
>  [7:2] _Z13cuda_errcheckPKc+0xdf  [0x624b5f]
>  [7:3] _Z15cuda_initializev+0x2a7  [0x624e27]
>  [7:4] _Z11master_initiPPc+0x1a1  [0x500a11]
>  [7:5] main+0x19  [0x4fd489]
>  [7:6] __libc_start_main+0xf4  [0x32ca41d994]
>  [7:7] cos+0x1d1  [0x4f9d99]
> FATAL ERROR: CUDA error cudaStreamCreate on Pe 9 (n0 device 1): no
> CUDA-capable device is available
> ------------- Processor 9 Exiting: Called CmiAbort ------------
> Reason: FATAL ERROR: CUDA error cudaStreamCreate on Pe 9 (n0 device 1): no
> CUDA-capable device is available
>
> [9] Stack Traceback:
>  [9:0] CmiAbort+0x59  [0x907f64]
>  [9:1] _Z8NAMD_diePKc+0x4a  [0x4fa7ba]
>  [9:2] _Z13cuda_errcheckPKc+0xdf  [0x624b5f]
>  [9:3] _Z15cuda_initializev+0x2a7  [0x624e27]
>  [9:4] _Z11master_initiPPc+0x1a1  [0x500a11]
>  [9:5] main+0x19  [0x4fd489]
>  [9:6] __libc_start_main+0xf4  [0x32ca41d994]
>  [9:7] cos+0x1d1  [0x4f9d99]
> FATAL ERROR: CUDA error cudaStreamCreate on Pe 5 (n1 device 1): no
> CUDA-capable device is available
> ------------- Processor 5 Exiting: Called CmiAbort ------------
> Reason: FATAL ERROR: CUDA error cudaStreamCreate on Pe 5 (n1 device 1): no
> CUDA-capable device is available
> ..
> ..
> ..
>
>

-- 
Dr. Axel Kohlmeyer
akohlmey_at_gmail.com  http://goo.gl/1wk0
Institute for Computational Molecular Science
Temple University, Philadelphia PA, USA.

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:56:54 CST