Re: Multi node run causes "CUDA error cudaStreamCreate"

From: Michael S. Sellers (Cont, ARL/WMRD) (michael.s.sellers.ctr_at_us.army.mil)
Date: Fri Apr 01 2011 - 15:46:21 CDT

Axel,

Thanks for the help. See below for the output of your suggestion.

The following, from a multi node NAMD startup does not seem right:

Pe 9 physical rank 1 binding to CUDA device 1 on n0: 'Tesla T10 Processor' Mem: 4095MB Rev: 1.3
Pe 7 physical rank 3 binding to CUDA device 1 on n1: 'Tesla T10 Processor' Mem: 4095MB Rev: 1.3

Should Pe 9 bind to 'CUDA device 0 on n2' ? Where the pool is
node{0-2}, Pe{0-11}, CUDA device{0,1}/node.

-Mike

________________________________________________________________________

Output of 'nvidia-smi -r' for several nodes:

ECC is not supported by GPU 0
ECC is not supported by GPU 1

Output of 'nvidia-smi -a':

==============NVSMI LOG==============

Timestamp :
Unit 0:
        Product Name : NVIDIA Tesla S1070 -500
        Product ID :
        Serial Number :
        Firmware Ver : 3.6
        Intake Temperature : 15 C
        GPU 0:
                Product Name : Tesla T10 Processor
                Serial : Not available
                PCI ID : 5e710de
                Bridge Port : 0
                Temperature : 31 C
        GPU 1:
                Product Name : Tesla T10 Processor
                Serial : Not available
                PCI ID : 5e710de
                Bridge Port : 2
                Temperature : 29 C
        Fan Tachs:
                #00: 3636 Status: NORMAL
                #01: 3462 Status: NORMAL
                #02: 3664 Status: NORMAL
                #03: 3376 Status: NORMAL
                #04: 3598 Status: NORMAL
                #05: 3582 Status: NORMAL
                #06: 3688 Status: NORMAL
                #07: 3474 Status: NORMAL
                #08: 3664 Status: NORMAL
                #09: 3488 Status: NORMAL
                #10: 3658 Status: NORMAL
                #11: 3412 Status: NORMAL
                #12: 3682 Status: NORMAL
                #13: 3578 Status: NORMAL
        PSU:
                Voltage : 11.99 V
                Current : 15.64 A
                State : Normal
        LED:
                State : GREEN

Axel Kohlmeyer wrote:
> i just had this kind of error myself.
>
> check your GPUs with: nvida-smi -a
> could be that one of them has ECC errors and then NAMD
> (rightfully so) refuses to use the device.
>
> axel
>
> On Fri, Apr 1, 2011 at 1:32 PM, Michael S. Sellers (Cont, ARL/WMRD)
> <michael.s.sellers.ctr_at_us.army.mil> wrote:
>
>> All,
>>
>> I am receiving a "FATAL ERROR: CUDA error cudaStreamCreate on Pe 7 (n1
>> device 1): no CUDA-capable device is available" when NAMD starts up and is
>> optimizing FFT steps, for a job running on 3 nodes, 4ppn, 2 Tesla's per
>> node.
>>
>> The command I'm executing within a PBS script is:
>> ~/software/bin/charmrun +p12 ~/software/bin/namd2 +idlepoll sim1.conf >
>> $PBS_JOBNAME.out
>>
>> NAMD CUDA does not give this error on 1 node, 8ppn, 2 Teslas. Please see
>> output below.
>>
>> Might this be a situation where I need to use the +devices flag? It seems
>> as though the PEs are binding to CUDA devices on other nodes.
>>
>> Thanks,
>>
>> Mike
>>
>>
>> Charm++> Running on 3 unique compute nodes (8-way SMP).
>> Charm++> cpu topology info is gathered in 0.203 seconds.
>> Info: NAMD CVS-2011-03-22 for Linux-x86_64-MPI-CUDA
>> Info:
>> Info: Please visit http://www.ks.uiuc.edu/Research/namd/
>> Info: for updates, documentation, and support information.
>> Info:
>> Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005)
>> Info: in all publications reporting results obtained with NAMD.
>> Info:
>> Info: Based on Charm++/Converse 60303 for mpi-linux-x86_64
>> Info: 1 NAMD CVS-2011-03-22 Linux-x86_64-MPI-CUDA
>> Info: Running on 12 processors, 12 nodes, 3 physical nodes.
>> Info: CPU topology information available.
>> Info: Charm++/Converse parallel runtime startup completed at 0.204571 s
>> Pe 2 sharing CUDA device 0 first 0 next 0
>> Did not find +devices i,j,k,... argument, using all
>> Pe 2 physical rank 2 binding to CUDA device 0 on n2: 'Tesla T10 Processor'
>> Mem: 4095MB Rev: 1.3
>> Pe 3 sharing CUDA device 1 first 1 next 1
>> Pe 3 physical rank 3 binding to CUDA device 1 on n2: 'Tesla T10 Processor'
>> Mem: 4095MB Rev: 1.3
>> Pe 0 sharing CUDA device 0 first 0 next 2
>> Pe 0 physical rank 0 binding to CUDA device 0 on n2: 'Tesla T10 Processor'
>> Mem: 4095MB Rev: 1.3
>> Pe 9 sharing CUDA device 1 first 9 next 11
>> Pe 7 sharing CUDA device 1 first 5 next 5
>> Pe 5 sharing CUDA device 1 first 5 next 7
>> Pe 9 physical rank 1 binding to CUDA device 1 on n0: 'Tesla T10 Processor'
>> Mem: 4095MB Rev: 1.3
>> Pe 7 physical rank 3 binding to CUDA device 1 on n1: 'Tesla T10 Processor'
>> Mem: 4095MB Rev: 1.3
>> Pe 5 physical rank 1 binding to CUDA device 1 on n1: 'Tesla T10 Processor'
>> Mem: 4095MB Rev: 1.3
>> Pe 10 sharing CUDA device 0 first 8 next 8
>> Pe 11 sharing CUDA device 1 first 9 next 9
>> Pe 8 sharing CUDA device 0 first 8 next 10
>> Pe 11 physical rank 3 binding to CUDA device 1 on n0: 'Tesla T10 Processor'
>> Mem: 4095MB Rev: 1.3
>> Pe 10 physical rank 2 binding to CUDA device 0 on n0: 'Tesla T10 Processor'
>> Mem: 4095MB Rev: 1.3
>> Pe 8 physical rank 0 binding to CUDA device 0 on n0: 'Tesla T10 Processor'
>> Mem: 4095MB Rev: 1.3
>> Pe 6 sharing CUDA device 0 first 4 next 4
>> Pe 6 physical rank 2 binding to CUDA device 0 on n1: 'Tesla T10 Processor'
>> Mem: 4095MB Rev: 1.3
>> Pe 1 sharing CUDA device 1 first 1 next 3
>> Pe 1 physical rank 1 binding to CUDA device 1 on n2: 'Tesla T10 Processor'
>> Mem: 4095MB Rev: 1.3
>> Pe 4 sharing CUDA device 0 first 4 next 6
>> Pe 4 physical rank 0 binding to CUDA device 0 on n1: 'Tesla T10 Processor'
>> Mem: 4095MB Rev: 1.3
>> Info: 51.4492 MB of memory in use based on /proc/self/stat
>> ...
>> ...
>> Info: PME MAXIMUM GRID SPACING 1.5
>> Info: Attempting to read FFTW data from
>> FFTW_NAMD_CVS-2011-03-22_Linux-x86_64-MPI-CUDA.txt
>> Info: Optimizing 6 FFT steps. 1...FATAL ERROR: CUDA error cudaStreamCreate
>> on Pe 7 (n1 device 1): no CUDA-capable device is available
>> ------------- Processor 7 Exiting: Called CmiAbort ------------
>> Reason: FATAL ERROR: CUDA error cudaStreamCreate on Pe 7 (n1 device 1): no
>> CUDA-capable device is available
>>
>> [7] Stack Traceback:
>> [7:0] CmiAbort+0x59 [0x907f64]
>> [7:1] _Z8NAMD_diePKc+0x4a [0x4fa7ba]
>> [7:2] _Z13cuda_errcheckPKc+0xdf [0x624b5f]
>> [7:3] _Z15cuda_initializev+0x2a7 [0x624e27]
>> [7:4] _Z11master_initiPPc+0x1a1 [0x500a11]
>> [7:5] main+0x19 [0x4fd489]
>> [7:6] __libc_start_main+0xf4 [0x32ca41d994]
>> [7:7] cos+0x1d1 [0x4f9d99]
>> FATAL ERROR: CUDA error cudaStreamCreate on Pe 9 (n0 device 1): no
>> CUDA-capable device is available
>> ------------- Processor 9 Exiting: Called CmiAbort ------------
>> Reason: FATAL ERROR: CUDA error cudaStreamCreate on Pe 9 (n0 device 1): no
>> CUDA-capable device is available
>>
>> [9] Stack Traceback:
>> [9:0] CmiAbort+0x59 [0x907f64]
>> [9:1] _Z8NAMD_diePKc+0x4a [0x4fa7ba]
>> [9:2] _Z13cuda_errcheckPKc+0xdf [0x624b5f]
>> [9:3] _Z15cuda_initializev+0x2a7 [0x624e27]
>> [9:4] _Z11master_initiPPc+0x1a1 [0x500a11]
>> [9:5] main+0x19 [0x4fd489]
>> [9:6] __libc_start_main+0xf4 [0x32ca41d994]
>> [9:7] cos+0x1d1 [0x4f9d99]
>> FATAL ERROR: CUDA error cudaStreamCreate on Pe 5 (n1 device 1): no
>> CUDA-capable device is available
>> ------------- Processor 5 Exiting: Called CmiAbort ------------
>> Reason: FATAL ERROR: CUDA error cudaStreamCreate on Pe 5 (n1 device 1): no
>> CUDA-capable device is available
>> ..
>> ..
>> ..
>>
>>
>>
>
>
>
>


This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 05:23:47 CST