Re: Multi node run causes "CUDA error cudaStreamCreate"

From: Jim Phillips (jim_at_ks.uiuc.edu)
Date: Wed Apr 06 2011 - 16:17:02 CDT

Hi Mike,

I see that this is an MPI build. When you say it works on one node is
that the same binary launched the same way with mpiexec? Did you run the
test on all three nodes you're trying to run on? The same goes for the
nvidia-smi tests Axel suggested - you need to test all of the nodes.

Since you're getting errors from multiple nodes it's also possible that
the LD_LIBRARY_PATH isn't being set or passed through mpiexec correctly
and you're getting a different cuda runtime library.

It looks like you're using charmrun to launch an MPI binary. I'm going to
assume that charmrun is a script that is calling mpiexec more or less
correctly since it appears to be launching correctly, but you might want
to try just using the mpiexec command directly as you would for any other
MPI program on your cluster.

Since the call that's triggering the error is actually the first CUDA
library call from a .cu file rather than a .C file it's also possible that
your nvcc, -I, and -L options are mismatched. This could happen if, for
example, you did a partial build, edited the arch/Linux-x86_64.cuda file,
and then finished the build without doing a make clean.

You might want to try one of the released ibverbs-CUDA binaries (charmrun
can use mpiexec to launch non-MPI binaries now). If that works then the
problem is with your binary somehow.

-Jim

On Fri, 1 Apr 2011, Michael S. Sellers (Cont, ARL/WMRD) wrote:

>>>>> All,
>>>>>
>>>>> I am receiving a "FATAL ERROR: CUDA error cudaStreamCreate on Pe 7 (n1
>>>>> device 1): no CUDA-capable device is available" when NAMD starts up and
>>>>> is
>>>>> optimizing FFT steps, for a job running on 3 nodes, 4ppn, 2 Tesla's per
>>>>> node.
>>>>>
>>>>> The command I'm executing within a PBS script is:
>>>>> ~/software/bin/charmrun +p12 ~/software/bin/namd2 +idlepoll sim1.conf >
>>>>> $PBS_JOBNAME.out
>>>>>
>>>>> NAMD CUDA does not give this error on 1 node, 8ppn, 2 Teslas. Please
>>>>> see
>>>>> output below.
>>>>>
>>>>> Might this be a situation where I need to use the +devices flag? It
>>>>> seems
>>>>> as though the PEs are binding to CUDA devices on other nodes.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Mike
>>>>>
>>>>>
>>>>> Charm++> Running on 3 unique compute nodes (8-way SMP).
>>>>> Charm++> cpu topology info is gathered in 0.203 seconds.
>>>>> Info: NAMD CVS-2011-03-22 for Linux-x86_64-MPI-CUDA
>>>>> Info:
>>>>> Info: Please visit http://www.ks.uiuc.edu/Research/namd/
>>>>> Info: for updates, documentation, and support information.
>>>>> Info:
>>>>> Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005)
>>>>> Info: in all publications reporting results obtained with NAMD.
>>>>> Info:
>>>>> Info: Based on Charm++/Converse 60303 for mpi-linux-x86_64
>>>>> Info: 1 NAMD CVS-2011-03-22 Linux-x86_64-MPI-CUDA
>>>>> Info: Running on 12 processors, 12 nodes, 3 physical nodes.
>>>>> Info: CPU topology information available.
>>>>> Info: Charm++/Converse parallel runtime startup completed at 0.204571 s
>>>>> Pe 2 sharing CUDA device 0 first 0 next 0
>>>>> Did not find +devices i,j,k,... argument, using all
>>>>> Pe 2 physical rank 2 binding to CUDA device 0 on n2: 'Tesla T10
>>>>> Processor'
>>>>> Mem: 4095MB Rev: 1.3
>>>>> Pe 3 sharing CUDA device 1 first 1 next 1
>>>>> Pe 3 physical rank 3 binding to CUDA device 1 on n2: 'Tesla T10
>>>>> Processor'
>>>>> Mem: 4095MB Rev: 1.3
>>>>> Pe 0 sharing CUDA device 0 first 0 next 2
>>>>> Pe 0 physical rank 0 binding to CUDA device 0 on n2: 'Tesla T10
>>>>> Processor'
>>>>> Mem: 4095MB Rev: 1.3
>>>>> Pe 9 sharing CUDA device 1 first 9 next 11
>>>>> Pe 7 sharing CUDA device 1 first 5 next 5
>>>>> Pe 5 sharing CUDA device 1 first 5 next 7
>>>>> Pe 9 physical rank 1 binding to CUDA device 1 on n0: 'Tesla T10
>>>>> Processor'
>>>>> Mem: 4095MB Rev: 1.3
>>>>> Pe 7 physical rank 3 binding to CUDA device 1 on n1: 'Tesla T10
>>>>> Processor'
>>>>> Mem: 4095MB Rev: 1.3
>>>>> Pe 5 physical rank 1 binding to CUDA device 1 on n1: 'Tesla T10
>>>>> Processor'
>>>>> Mem: 4095MB Rev: 1.3
>>>>> Pe 10 sharing CUDA device 0 first 8 next 8
>>>>> Pe 11 sharing CUDA device 1 first 9 next 9
>>>>> Pe 8 sharing CUDA device 0 first 8 next 10
>>>>> Pe 11 physical rank 3 binding to CUDA device 1 on n0: 'Tesla T10
>>>>> Processor'
>>>>> Mem: 4095MB Rev: 1.3
>>>>> Pe 10 physical rank 2 binding to CUDA device 0 on n0: 'Tesla T10
>>>>> Processor'
>>>>> Mem: 4095MB Rev: 1.3
>>>>> Pe 8 physical rank 0 binding to CUDA device 0 on n0: 'Tesla T10
>>>>> Processor'
>>>>> Mem: 4095MB Rev: 1.3
>>>>> Pe 6 sharing CUDA device 0 first 4 next 4
>>>>> Pe 6 physical rank 2 binding to CUDA device 0 on n1: 'Tesla T10
>>>>> Processor'
>>>>> Mem: 4095MB Rev: 1.3
>>>>> Pe 1 sharing CUDA device 1 first 1 next 3
>>>>> Pe 1 physical rank 1 binding to CUDA device 1 on n2: 'Tesla T10
>>>>> Processor'
>>>>> Mem: 4095MB Rev: 1.3
>>>>> Pe 4 sharing CUDA device 0 first 4 next 6
>>>>> Pe 4 physical rank 0 binding to CUDA device 0 on n1: 'Tesla T10
>>>>> Processor'
>>>>> Mem: 4095MB Rev: 1.3
>>>>> Info: 51.4492 MB of memory in use based on /proc/self/stat
>>>>> ...
>>>>> ...
>>>>> Info: PME MAXIMUM GRID SPACING 1.5
>>>>> Info: Attempting to read FFTW data from
>>>>> FFTW_NAMD_CVS-2011-03-22_Linux-x86_64-MPI-CUDA.txt
>>>>> Info: Optimizing 6 FFT steps. 1...FATAL ERROR: CUDA error
>>>>> cudaStreamCreate
>>>>> on Pe 7 (n1 device 1): no CUDA-capable device is available
>>>>> ------------- Processor 7 Exiting: Called CmiAbort ------------
>>>>> Reason: FATAL ERROR: CUDA error cudaStreamCreate on Pe 7 (n1 device 1):
>>>>> no
>>>>> CUDA-capable device is available
>>>>>
>>>>> [7] Stack Traceback:
>>>>> [7:0] CmiAbort+0x59 [0x907f64]
>>>>> [7:1] _Z8NAMD_diePKc+0x4a [0x4fa7ba]
>>>>> [7:2] _Z13cuda_errcheckPKc+0xdf [0x624b5f]
>>>>> [7:3] _Z15cuda_initializev+0x2a7 [0x624e27]
>>>>> [7:4] _Z11master_initiPPc+0x1a1 [0x500a11]
>>>>> [7:5] main+0x19 [0x4fd489]
>>>>> [7:6] __libc_start_main+0xf4 [0x32ca41d994]
>>>>> [7:7] cos+0x1d1 [0x4f9d99]
>>>>> FATAL ERROR: CUDA error cudaStreamCreate on Pe 9 (n0 device 1): no
>>>>> CUDA-capable device is available
>>>>> ------------- Processor 9 Exiting: Called CmiAbort ------------
>>>>> Reason: FATAL ERROR: CUDA error cudaStreamCreate on Pe 9 (n0 device 1):
>>>>> no
>>>>> CUDA-capable device is available
>>>>>
>>>>> [9] Stack Traceback:
>>>>> [9:0] CmiAbort+0x59 [0x907f64]
>>>>> [9:1] _Z8NAMD_diePKc+0x4a [0x4fa7ba]
>>>>> [9:2] _Z13cuda_errcheckPKc+0xdf [0x624b5f]
>>>>> [9:3] _Z15cuda_initializev+0x2a7 [0x624e27]
>>>>> [9:4] _Z11master_initiPPc+0x1a1 [0x500a11]
>>>>> [9:5] main+0x19 [0x4fd489]
>>>>> [9:6] __libc_start_main+0xf4 [0x32ca41d994]
>>>>> [9:7] cos+0x1d1 [0x4f9d99]
>>>>> FATAL ERROR: CUDA error cudaStreamCreate on Pe 5 (n1 device 1): no
>>>>> CUDA-capable device is available
>>>>> ------------- Processor 5 Exiting: Called CmiAbort ------------
>>>>> Reason: FATAL ERROR: CUDA error cudaStreamCreate on Pe 5 (n1 device 1):
>>>>> no
>>>>> CUDA-capable device is available
>>>>> ..
>>>>> ..
>>>>> ..
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>
>>
>>
>>

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:20:05 CST