Re: Multi node run causes "CUDA error cudaStreamCreate"

From: Michael S. Sellers (Cont, ARL/WMRD) (michael.s.sellers.ctr_at_us.army.mil)
Date: Fri Apr 01 2011 - 16:27:08 CDT

Axel,

Here's the check for exclusive compute mode.

from 'nvidia-smi -s':
COMPUTE mode rules for GPU 0: 0
COMPUTE mode rules for GPU 1: 0

So it looks like they are normal.

-Mike

Axel Kohlmeyer wrote:
> On Fri, Apr 1, 2011 at 4:46 PM, Michael S. Sellers (Cont, ARL/WMRD)
> <michael.s.sellers.ctr_at_us.army.mil> wrote:
>
>> Axel,
>>
>> Thanks for the help. See below for the output of your suggestion.
>>
>> The following, from a multi node NAMD startup does not seem right:
>>
>> Pe 9 physical rank 1 binding to CUDA device 1 on n0: 'Tesla T10 Processor'
>> Mem: 4095MB Rev: 1.3
>> Pe 7 physical rank 3 binding to CUDA device 1 on n1: 'Tesla T10 Processor'
>> Mem: 4095MB Rev: 1.3
>>
>> Should Pe 9 bind to 'CUDA device 0 on n2' ? Where the pool is node{0-2},
>> Pe{0-11}, CUDA device{0,1}/node.
>>
>
> no. the sharing seems right.
> with two devices per node, all even PEs should get device 0
> and the odd PEs get device 1.
>
> ecc also is not an issue, since you have a G200 device without ECC support.
>
> the next possibility is that the GPUs are configured for
> "compute-exclusive" mode, i.e. only one process at a
> time can use a GPU. this is done by some admins
> when the batch system allows multiple jobs to enter
> a node to reduce the risk of having two jobs accessing
> the same GPUs, while others are idle.
>
> cheers,
> axel.
>
>
>> -Mike
>>
>> ________________________________________________________________________
>>
>> Output of 'nvidia-smi -r' for several nodes:
>>
>> ECC is not supported by GPU 0
>> ECC is not supported by GPU 1
>>
>>
>> Output of 'nvidia-smi -a':
>>
>> ==============NVSMI LOG==============
>>
>>
>> Timestamp :
>> Unit 0:
>> Product Name : NVIDIA Tesla S1070 -500
>> Product ID :
>> Serial Number :
>> Firmware Ver : 3.6
>> Intake Temperature : 15 C
>> GPU 0:
>> Product Name : Tesla T10 Processor
>> Serial : Not available
>> PCI ID : 5e710de
>> Bridge Port : 0
>> Temperature : 31 C
>> GPU 1:
>> Product Name : Tesla T10 Processor
>> Serial : Not available
>> PCI ID : 5e710de
>> Bridge Port : 2
>> Temperature : 29 C
>> Fan Tachs:
>> #00: 3636 Status: NORMAL
>> #01: 3462 Status: NORMAL
>> #02: 3664 Status: NORMAL
>> #03: 3376 Status: NORMAL
>> #04: 3598 Status: NORMAL
>> #05: 3582 Status: NORMAL
>> #06: 3688 Status: NORMAL
>> #07: 3474 Status: NORMAL
>> #08: 3664 Status: NORMAL
>> #09: 3488 Status: NORMAL
>> #10: 3658 Status: NORMAL
>> #11: 3412 Status: NORMAL
>> #12: 3682 Status: NORMAL
>> #13: 3578 Status: NORMAL
>> PSU:
>> Voltage : 11.99 V
>> Current : 15.64 A
>> State : Normal
>> LED:
>> State : GREEN
>>
>>
>>
>>
>>
>>
>> Axel Kohlmeyer wrote:
>>
>>> i just had this kind of error myself.
>>>
>>> check your GPUs with: nvida-smi -a
>>> could be that one of them has ECC errors and then NAMD
>>> (rightfully so) refuses to use the device.
>>>
>>> axel
>>>
>>> On Fri, Apr 1, 2011 at 1:32 PM, Michael S. Sellers (Cont, ARL/WMRD)
>>> <michael.s.sellers.ctr_at_us.army.mil> wrote:
>>>
>>>
>>>> All,
>>>>
>>>> I am receiving a "FATAL ERROR: CUDA error cudaStreamCreate on Pe 7 (n1
>>>> device 1): no CUDA-capable device is available" when NAMD starts up and
>>>> is
>>>> optimizing FFT steps, for a job running on 3 nodes, 4ppn, 2 Tesla's per
>>>> node.
>>>>
>>>> The command I'm executing within a PBS script is:
>>>> ~/software/bin/charmrun +p12 ~/software/bin/namd2 +idlepoll sim1.conf >
>>>> $PBS_JOBNAME.out
>>>>
>>>> NAMD CUDA does not give this error on 1 node, 8ppn, 2 Teslas. Please see
>>>> output below.
>>>>
>>>> Might this be a situation where I need to use the +devices flag? It
>>>> seems
>>>> as though the PEs are binding to CUDA devices on other nodes.
>>>>
>>>> Thanks,
>>>>
>>>> Mike
>>>>
>>>>
>>>> Charm++> Running on 3 unique compute nodes (8-way SMP).
>>>> Charm++> cpu topology info is gathered in 0.203 seconds.
>>>> Info: NAMD CVS-2011-03-22 for Linux-x86_64-MPI-CUDA
>>>> Info:
>>>> Info: Please visit http://www.ks.uiuc.edu/Research/namd/
>>>> Info: for updates, documentation, and support information.
>>>> Info:
>>>> Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005)
>>>> Info: in all publications reporting results obtained with NAMD.
>>>> Info:
>>>> Info: Based on Charm++/Converse 60303 for mpi-linux-x86_64
>>>> Info: 1 NAMD CVS-2011-03-22 Linux-x86_64-MPI-CUDA
>>>> Info: Running on 12 processors, 12 nodes, 3 physical nodes.
>>>> Info: CPU topology information available.
>>>> Info: Charm++/Converse parallel runtime startup completed at 0.204571 s
>>>> Pe 2 sharing CUDA device 0 first 0 next 0
>>>> Did not find +devices i,j,k,... argument, using all
>>>> Pe 2 physical rank 2 binding to CUDA device 0 on n2: 'Tesla T10
>>>> Processor'
>>>> Mem: 4095MB Rev: 1.3
>>>> Pe 3 sharing CUDA device 1 first 1 next 1
>>>> Pe 3 physical rank 3 binding to CUDA device 1 on n2: 'Tesla T10
>>>> Processor'
>>>> Mem: 4095MB Rev: 1.3
>>>> Pe 0 sharing CUDA device 0 first 0 next 2
>>>> Pe 0 physical rank 0 binding to CUDA device 0 on n2: 'Tesla T10
>>>> Processor'
>>>> Mem: 4095MB Rev: 1.3
>>>> Pe 9 sharing CUDA device 1 first 9 next 11
>>>> Pe 7 sharing CUDA device 1 first 5 next 5
>>>> Pe 5 sharing CUDA device 1 first 5 next 7
>>>> Pe 9 physical rank 1 binding to CUDA device 1 on n0: 'Tesla T10
>>>> Processor'
>>>> Mem: 4095MB Rev: 1.3
>>>> Pe 7 physical rank 3 binding to CUDA device 1 on n1: 'Tesla T10
>>>> Processor'
>>>> Mem: 4095MB Rev: 1.3
>>>> Pe 5 physical rank 1 binding to CUDA device 1 on n1: 'Tesla T10
>>>> Processor'
>>>> Mem: 4095MB Rev: 1.3
>>>> Pe 10 sharing CUDA device 0 first 8 next 8
>>>> Pe 11 sharing CUDA device 1 first 9 next 9
>>>> Pe 8 sharing CUDA device 0 first 8 next 10
>>>> Pe 11 physical rank 3 binding to CUDA device 1 on n0: 'Tesla T10
>>>> Processor'
>>>> Mem: 4095MB Rev: 1.3
>>>> Pe 10 physical rank 2 binding to CUDA device 0 on n0: 'Tesla T10
>>>> Processor'
>>>> Mem: 4095MB Rev: 1.3
>>>> Pe 8 physical rank 0 binding to CUDA device 0 on n0: 'Tesla T10
>>>> Processor'
>>>> Mem: 4095MB Rev: 1.3
>>>> Pe 6 sharing CUDA device 0 first 4 next 4
>>>> Pe 6 physical rank 2 binding to CUDA device 0 on n1: 'Tesla T10
>>>> Processor'
>>>> Mem: 4095MB Rev: 1.3
>>>> Pe 1 sharing CUDA device 1 first 1 next 3
>>>> Pe 1 physical rank 1 binding to CUDA device 1 on n2: 'Tesla T10
>>>> Processor'
>>>> Mem: 4095MB Rev: 1.3
>>>> Pe 4 sharing CUDA device 0 first 4 next 6
>>>> Pe 4 physical rank 0 binding to CUDA device 0 on n1: 'Tesla T10
>>>> Processor'
>>>> Mem: 4095MB Rev: 1.3
>>>> Info: 51.4492 MB of memory in use based on /proc/self/stat
>>>> ...
>>>> ...
>>>> Info: PME MAXIMUM GRID SPACING 1.5
>>>> Info: Attempting to read FFTW data from
>>>> FFTW_NAMD_CVS-2011-03-22_Linux-x86_64-MPI-CUDA.txt
>>>> Info: Optimizing 6 FFT steps. 1...FATAL ERROR: CUDA error
>>>> cudaStreamCreate
>>>> on Pe 7 (n1 device 1): no CUDA-capable device is available
>>>> ------------- Processor 7 Exiting: Called CmiAbort ------------
>>>> Reason: FATAL ERROR: CUDA error cudaStreamCreate on Pe 7 (n1 device 1):
>>>> no
>>>> CUDA-capable device is available
>>>>
>>>> [7] Stack Traceback:
>>>> [7:0] CmiAbort+0x59 [0x907f64]
>>>> [7:1] _Z8NAMD_diePKc+0x4a [0x4fa7ba]
>>>> [7:2] _Z13cuda_errcheckPKc+0xdf [0x624b5f]
>>>> [7:3] _Z15cuda_initializev+0x2a7 [0x624e27]
>>>> [7:4] _Z11master_initiPPc+0x1a1 [0x500a11]
>>>> [7:5] main+0x19 [0x4fd489]
>>>> [7:6] __libc_start_main+0xf4 [0x32ca41d994]
>>>> [7:7] cos+0x1d1 [0x4f9d99]
>>>> FATAL ERROR: CUDA error cudaStreamCreate on Pe 9 (n0 device 1): no
>>>> CUDA-capable device is available
>>>> ------------- Processor 9 Exiting: Called CmiAbort ------------
>>>> Reason: FATAL ERROR: CUDA error cudaStreamCreate on Pe 9 (n0 device 1):
>>>> no
>>>> CUDA-capable device is available
>>>>
>>>> [9] Stack Traceback:
>>>> [9:0] CmiAbort+0x59 [0x907f64]
>>>> [9:1] _Z8NAMD_diePKc+0x4a [0x4fa7ba]
>>>> [9:2] _Z13cuda_errcheckPKc+0xdf [0x624b5f]
>>>> [9:3] _Z15cuda_initializev+0x2a7 [0x624e27]
>>>> [9:4] _Z11master_initiPPc+0x1a1 [0x500a11]
>>>> [9:5] main+0x19 [0x4fd489]
>>>> [9:6] __libc_start_main+0xf4 [0x32ca41d994]
>>>> [9:7] cos+0x1d1 [0x4f9d99]
>>>> FATAL ERROR: CUDA error cudaStreamCreate on Pe 5 (n1 device 1): no
>>>> CUDA-capable device is available
>>>> ------------- Processor 5 Exiting: Called CmiAbort ------------
>>>> Reason: FATAL ERROR: CUDA error cudaStreamCreate on Pe 5 (n1 device 1):
>>>> no
>>>> CUDA-capable device is available
>>>> ..
>>>> ..
>>>> ..
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>
>
>
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:56:54 CST