Re: Multi node run causes "CUDA error cudaStreamCreate"

From: Axel Kohlmeyer (akohlmey_at_gmail.com)
Date: Fri Apr 01 2011 - 16:10:00 CDT

On Fri, Apr 1, 2011 at 4:46 PM, Michael S. Sellers (Cont, ARL/WMRD)
<michael.s.sellers.ctr_at_us.army.mil> wrote:
> Axel,
>
> Thanks for the help.  See below for the output of your suggestion.
>
> The following, from a multi node NAMD startup does not seem right:
>
> Pe 9 physical rank 1 binding to CUDA device 1 on n0: 'Tesla T10 Processor'
> Mem: 4095MB  Rev: 1.3
> Pe 7 physical rank 3 binding to CUDA device 1 on n1: 'Tesla T10 Processor'
> Mem: 4095MB  Rev: 1.3
>
> Should Pe 9 bind to 'CUDA device 0 on n2' ?  Where the pool is node{0-2},
> Pe{0-11}, CUDA device{0,1}/node.

no. the sharing seems right.
with two devices per node, all even PEs should get device 0
and the odd PEs get device 1.

ecc also is not an issue, since you have a G200 device without ECC support.

the next possibility is that the GPUs are configured for
"compute-exclusive" mode, i.e. only one process at a
time can use a GPU. this is done by some admins
when the batch system allows multiple jobs to enter
a node to reduce the risk of having two jobs accessing
the same GPUs, while others are idle.

cheers,
    axel.

>
> -Mike
>
> ________________________________________________________________________
>
> Output of 'nvidia-smi -r' for several nodes:
>
> ECC is not supported by GPU 0
> ECC is not supported by GPU 1
>
>
> Output of 'nvidia-smi -a':
>
> ==============NVSMI LOG==============
>
>
> Timestamp                       :
> Unit 0:
>       Product Name            : NVIDIA Tesla S1070 -500
>       Product ID              :
>       Serial Number           :
>       Firmware Ver            : 3.6
>       Intake Temperature      : 15 C
>       GPU 0:
>               Product Name    : Tesla T10 Processor
>               Serial          : Not available
>               PCI ID          : 5e710de
>               Bridge Port     : 0
>               Temperature     : 31 C
>       GPU 1:
>               Product Name    : Tesla T10 Processor
>               Serial          : Not available
>               PCI ID          : 5e710de
>               Bridge Port     : 2
>               Temperature     : 29 C
>       Fan Tachs:
>               #00: 3636 Status: NORMAL
>               #01: 3462 Status: NORMAL
>               #02: 3664 Status: NORMAL
>               #03: 3376 Status: NORMAL
>               #04: 3598 Status: NORMAL
>               #05: 3582 Status: NORMAL
>               #06: 3688 Status: NORMAL
>               #07: 3474 Status: NORMAL
>               #08: 3664 Status: NORMAL
>               #09: 3488 Status: NORMAL
>               #10: 3658 Status: NORMAL
>               #11: 3412 Status: NORMAL
>               #12: 3682 Status: NORMAL
>               #13: 3578 Status: NORMAL
>       PSU:
>               Voltage         :   11.99 V
>               Current         :   15.64 A
>               State           : Normal
>       LED:
>               State           : GREEN
>
>
>
>
>
>
> Axel Kohlmeyer wrote:
>>
>> i just had this kind of error myself.
>>
>> check your GPUs with: nvida-smi -a
>> could be that one of them has ECC errors and then NAMD
>> (rightfully so) refuses to use the device.
>>
>> axel
>>
>> On Fri, Apr 1, 2011 at 1:32 PM, Michael S. Sellers (Cont, ARL/WMRD)
>> <michael.s.sellers.ctr_at_us.army.mil> wrote:
>>
>>>
>>> All,
>>>
>>> I am receiving a "FATAL ERROR: CUDA error cudaStreamCreate on Pe 7 (n1
>>> device 1): no CUDA-capable device is available" when NAMD starts up and
>>> is
>>> optimizing FFT steps, for a job running on 3 nodes, 4ppn, 2 Tesla's per
>>> node.
>>>
>>> The command I'm executing within a PBS script is:
>>> ~/software/bin/charmrun +p12 ~/software/bin/namd2 +idlepoll sim1.conf  >
>>> $PBS_JOBNAME.out
>>>
>>> NAMD CUDA does not give this error on 1 node, 8ppn, 2 Teslas.  Please see
>>> output below.
>>>
>>> Might this be a situation where I need to use the +devices flag?  It
>>> seems
>>> as though the PEs are binding to CUDA devices on other nodes.
>>>
>>> Thanks,
>>>
>>> Mike
>>>
>>>
>>> Charm++> Running on 3 unique compute nodes (8-way SMP).
>>> Charm++> cpu topology info is gathered in 0.203 seconds.
>>> Info: NAMD CVS-2011-03-22 for Linux-x86_64-MPI-CUDA
>>> Info:
>>> Info: Please visit http://www.ks.uiuc.edu/Research/namd/
>>> Info: for updates, documentation, and support information.
>>> Info:
>>> Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005)
>>> Info: in all publications reporting results obtained with NAMD.
>>> Info:
>>> Info: Based on Charm++/Converse 60303 for mpi-linux-x86_64
>>> Info: 1 NAMD  CVS-2011-03-22  Linux-x86_64-MPI-CUDA
>>> Info: Running on 12 processors, 12 nodes, 3 physical nodes.
>>> Info: CPU topology information available.
>>> Info: Charm++/Converse parallel runtime startup completed at 0.204571 s
>>> Pe 2 sharing CUDA device 0 first 0 next 0
>>> Did not find +devices i,j,k,... argument, using all
>>> Pe 2 physical rank 2 binding to CUDA device 0 on n2: 'Tesla T10
>>> Processor'
>>>  Mem: 4095MB  Rev: 1.3
>>> Pe 3 sharing CUDA device 1 first 1 next 1
>>> Pe 3 physical rank 3 binding to CUDA device 1 on n2: 'Tesla T10
>>> Processor'
>>>  Mem: 4095MB  Rev: 1.3
>>> Pe 0 sharing CUDA device 0 first 0 next 2
>>> Pe 0 physical rank 0 binding to CUDA device 0 on n2: 'Tesla T10
>>> Processor'
>>>  Mem: 4095MB  Rev: 1.3
>>> Pe 9 sharing CUDA device 1 first 9 next 11
>>> Pe 7 sharing CUDA device 1 first 5 next 5
>>> Pe 5 sharing CUDA device 1 first 5 next 7
>>> Pe 9 physical rank 1 binding to CUDA device 1 on n0: 'Tesla T10
>>> Processor'
>>>  Mem: 4095MB  Rev: 1.3
>>> Pe 7 physical rank 3 binding to CUDA device 1 on n1: 'Tesla T10
>>> Processor'
>>>  Mem: 4095MB  Rev: 1.3
>>> Pe 5 physical rank 1 binding to CUDA device 1 on n1: 'Tesla T10
>>> Processor'
>>>  Mem: 4095MB  Rev: 1.3
>>> Pe 10 sharing CUDA device 0 first 8 next 8
>>> Pe 11 sharing CUDA device 1 first 9 next 9
>>> Pe 8 sharing CUDA device 0 first 8 next 10
>>> Pe 11 physical rank 3 binding to CUDA device 1 on n0: 'Tesla T10
>>> Processor'
>>>  Mem: 4095MB  Rev: 1.3
>>> Pe 10 physical rank 2 binding to CUDA device 0 on n0: 'Tesla T10
>>> Processor'
>>>  Mem: 4095MB  Rev: 1.3
>>> Pe 8 physical rank 0 binding to CUDA device 0 on n0: 'Tesla T10
>>> Processor'
>>>  Mem: 4095MB  Rev: 1.3
>>> Pe 6 sharing CUDA device 0 first 4 next 4
>>> Pe 6 physical rank 2 binding to CUDA device 0 on n1: 'Tesla T10
>>> Processor'
>>>  Mem: 4095MB  Rev: 1.3
>>> Pe 1 sharing CUDA device 1 first 1 next 3
>>> Pe 1 physical rank 1 binding to CUDA device 1 on n2: 'Tesla T10
>>> Processor'
>>>  Mem: 4095MB  Rev: 1.3
>>> Pe 4 sharing CUDA device 0 first 4 next 6
>>> Pe 4 physical rank 0 binding to CUDA device 0 on n1: 'Tesla T10
>>> Processor'
>>>  Mem: 4095MB  Rev: 1.3
>>> Info: 51.4492 MB of memory in use based on /proc/self/stat
>>> ...
>>> ...
>>> Info: PME MAXIMUM GRID SPACING    1.5
>>> Info: Attempting to read FFTW data from
>>> FFTW_NAMD_CVS-2011-03-22_Linux-x86_64-MPI-CUDA.txt
>>> Info: Optimizing 6 FFT steps.  1...FATAL ERROR: CUDA error
>>> cudaStreamCreate
>>> on Pe 7 (n1 device 1): no CUDA-capable device is available
>>> ------------- Processor 7 Exiting: Called CmiAbort ------------
>>> Reason: FATAL ERROR: CUDA error cudaStreamCreate on Pe 7 (n1 device 1):
>>> no
>>> CUDA-capable device is available
>>>
>>> [7] Stack Traceback:
>>>  [7:0] CmiAbort+0x59  [0x907f64]
>>>  [7:1] _Z8NAMD_diePKc+0x4a  [0x4fa7ba]
>>>  [7:2] _Z13cuda_errcheckPKc+0xdf  [0x624b5f]
>>>  [7:3] _Z15cuda_initializev+0x2a7  [0x624e27]
>>>  [7:4] _Z11master_initiPPc+0x1a1  [0x500a11]
>>>  [7:5] main+0x19  [0x4fd489]
>>>  [7:6] __libc_start_main+0xf4  [0x32ca41d994]
>>>  [7:7] cos+0x1d1  [0x4f9d99]
>>> FATAL ERROR: CUDA error cudaStreamCreate on Pe 9 (n0 device 1): no
>>> CUDA-capable device is available
>>> ------------- Processor 9 Exiting: Called CmiAbort ------------
>>> Reason: FATAL ERROR: CUDA error cudaStreamCreate on Pe 9 (n0 device 1):
>>> no
>>> CUDA-capable device is available
>>>
>>> [9] Stack Traceback:
>>>  [9:0] CmiAbort+0x59  [0x907f64]
>>>  [9:1] _Z8NAMD_diePKc+0x4a  [0x4fa7ba]
>>>  [9:2] _Z13cuda_errcheckPKc+0xdf  [0x624b5f]
>>>  [9:3] _Z15cuda_initializev+0x2a7  [0x624e27]
>>>  [9:4] _Z11master_initiPPc+0x1a1  [0x500a11]
>>>  [9:5] main+0x19  [0x4fd489]
>>>  [9:6] __libc_start_main+0xf4  [0x32ca41d994]
>>>  [9:7] cos+0x1d1  [0x4f9d99]
>>> FATAL ERROR: CUDA error cudaStreamCreate on Pe 5 (n1 device 1): no
>>> CUDA-capable device is available
>>> ------------- Processor 5 Exiting: Called CmiAbort ------------
>>> Reason: FATAL ERROR: CUDA error cudaStreamCreate on Pe 5 (n1 device 1):
>>> no
>>> CUDA-capable device is available
>>> ..
>>> ..
>>> ..
>>>
>>>
>>>
>>
>>
>>
>>
>

-- 
Dr. Axel Kohlmeyer
akohlmey_at_gmail.com  http://goo.gl/1wk0
Institute for Computational Molecular Science
Temple University, Philadelphia PA, USA.

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:56:54 CST