From: Michael S. Sellers (Cont, ARL/WMRD) (michael.s.sellers.ctr_at_us.army.mil)
Date: Fri Apr 01 2011 - 15:46:21 CDT
Axel,
Thanks for the help. See below for the output of your suggestion.
The following, from a multi node NAMD startup does not seem right:
Pe 9 physical rank 1 binding to CUDA device 1 on n0: 'Tesla T10 Processor' Mem: 4095MB Rev: 1.3
Pe 7 physical rank 3 binding to CUDA device 1 on n1: 'Tesla T10 Processor' Mem: 4095MB Rev: 1.3
Should Pe 9 bind to 'CUDA device 0 on n2' ? Where the pool is
node{0-2}, Pe{0-11}, CUDA device{0,1}/node.
-Mike
________________________________________________________________________
Output of 'nvidia-smi -r' for several nodes:
ECC is not supported by GPU 0
ECC is not supported by GPU 1
Output of 'nvidia-smi -a':
==============NVSMI LOG==============
Timestamp :
Unit 0:
Product Name : NVIDIA Tesla S1070 -500
Product ID :
Serial Number :
Firmware Ver : 3.6
Intake Temperature : 15 C
GPU 0:
Product Name : Tesla T10 Processor
Serial : Not available
PCI ID : 5e710de
Bridge Port : 0
Temperature : 31 C
GPU 1:
Product Name : Tesla T10 Processor
Serial : Not available
PCI ID : 5e710de
Bridge Port : 2
Temperature : 29 C
Fan Tachs:
#00: 3636 Status: NORMAL
#01: 3462 Status: NORMAL
#02: 3664 Status: NORMAL
#03: 3376 Status: NORMAL
#04: 3598 Status: NORMAL
#05: 3582 Status: NORMAL
#06: 3688 Status: NORMAL
#07: 3474 Status: NORMAL
#08: 3664 Status: NORMAL
#09: 3488 Status: NORMAL
#10: 3658 Status: NORMAL
#11: 3412 Status: NORMAL
#12: 3682 Status: NORMAL
#13: 3578 Status: NORMAL
PSU:
Voltage : 11.99 V
Current : 15.64 A
State : Normal
LED:
State : GREEN
Axel Kohlmeyer wrote:
> i just had this kind of error myself.
>
> check your GPUs with: nvida-smi -a
> could be that one of them has ECC errors and then NAMD
> (rightfully so) refuses to use the device.
>
> axel
>
> On Fri, Apr 1, 2011 at 1:32 PM, Michael S. Sellers (Cont, ARL/WMRD)
> <michael.s.sellers.ctr_at_us.army.mil> wrote:
>
>> All,
>>
>> I am receiving a "FATAL ERROR: CUDA error cudaStreamCreate on Pe 7 (n1
>> device 1): no CUDA-capable device is available" when NAMD starts up and is
>> optimizing FFT steps, for a job running on 3 nodes, 4ppn, 2 Tesla's per
>> node.
>>
>> The command I'm executing within a PBS script is:
>> ~/software/bin/charmrun +p12 ~/software/bin/namd2 +idlepoll sim1.conf >
>> $PBS_JOBNAME.out
>>
>> NAMD CUDA does not give this error on 1 node, 8ppn, 2 Teslas. Please see
>> output below.
>>
>> Might this be a situation where I need to use the +devices flag? It seems
>> as though the PEs are binding to CUDA devices on other nodes.
>>
>> Thanks,
>>
>> Mike
>>
>>
>> Charm++> Running on 3 unique compute nodes (8-way SMP).
>> Charm++> cpu topology info is gathered in 0.203 seconds.
>> Info: NAMD CVS-2011-03-22 for Linux-x86_64-MPI-CUDA
>> Info:
>> Info: Please visit http://www.ks.uiuc.edu/Research/namd/
>> Info: for updates, documentation, and support information.
>> Info:
>> Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005)
>> Info: in all publications reporting results obtained with NAMD.
>> Info:
>> Info: Based on Charm++/Converse 60303 for mpi-linux-x86_64
>> Info: 1 NAMD CVS-2011-03-22 Linux-x86_64-MPI-CUDA
>> Info: Running on 12 processors, 12 nodes, 3 physical nodes.
>> Info: CPU topology information available.
>> Info: Charm++/Converse parallel runtime startup completed at 0.204571 s
>> Pe 2 sharing CUDA device 0 first 0 next 0
>> Did not find +devices i,j,k,... argument, using all
>> Pe 2 physical rank 2 binding to CUDA device 0 on n2: 'Tesla T10 Processor'
>> Mem: 4095MB Rev: 1.3
>> Pe 3 sharing CUDA device 1 first 1 next 1
>> Pe 3 physical rank 3 binding to CUDA device 1 on n2: 'Tesla T10 Processor'
>> Mem: 4095MB Rev: 1.3
>> Pe 0 sharing CUDA device 0 first 0 next 2
>> Pe 0 physical rank 0 binding to CUDA device 0 on n2: 'Tesla T10 Processor'
>> Mem: 4095MB Rev: 1.3
>> Pe 9 sharing CUDA device 1 first 9 next 11
>> Pe 7 sharing CUDA device 1 first 5 next 5
>> Pe 5 sharing CUDA device 1 first 5 next 7
>> Pe 9 physical rank 1 binding to CUDA device 1 on n0: 'Tesla T10 Processor'
>> Mem: 4095MB Rev: 1.3
>> Pe 7 physical rank 3 binding to CUDA device 1 on n1: 'Tesla T10 Processor'
>> Mem: 4095MB Rev: 1.3
>> Pe 5 physical rank 1 binding to CUDA device 1 on n1: 'Tesla T10 Processor'
>> Mem: 4095MB Rev: 1.3
>> Pe 10 sharing CUDA device 0 first 8 next 8
>> Pe 11 sharing CUDA device 1 first 9 next 9
>> Pe 8 sharing CUDA device 0 first 8 next 10
>> Pe 11 physical rank 3 binding to CUDA device 1 on n0: 'Tesla T10 Processor'
>> Mem: 4095MB Rev: 1.3
>> Pe 10 physical rank 2 binding to CUDA device 0 on n0: 'Tesla T10 Processor'
>> Mem: 4095MB Rev: 1.3
>> Pe 8 physical rank 0 binding to CUDA device 0 on n0: 'Tesla T10 Processor'
>> Mem: 4095MB Rev: 1.3
>> Pe 6 sharing CUDA device 0 first 4 next 4
>> Pe 6 physical rank 2 binding to CUDA device 0 on n1: 'Tesla T10 Processor'
>> Mem: 4095MB Rev: 1.3
>> Pe 1 sharing CUDA device 1 first 1 next 3
>> Pe 1 physical rank 1 binding to CUDA device 1 on n2: 'Tesla T10 Processor'
>> Mem: 4095MB Rev: 1.3
>> Pe 4 sharing CUDA device 0 first 4 next 6
>> Pe 4 physical rank 0 binding to CUDA device 0 on n1: 'Tesla T10 Processor'
>> Mem: 4095MB Rev: 1.3
>> Info: 51.4492 MB of memory in use based on /proc/self/stat
>> ...
>> ...
>> Info: PME MAXIMUM GRID SPACING 1.5
>> Info: Attempting to read FFTW data from
>> FFTW_NAMD_CVS-2011-03-22_Linux-x86_64-MPI-CUDA.txt
>> Info: Optimizing 6 FFT steps. 1...FATAL ERROR: CUDA error cudaStreamCreate
>> on Pe 7 (n1 device 1): no CUDA-capable device is available
>> ------------- Processor 7 Exiting: Called CmiAbort ------------
>> Reason: FATAL ERROR: CUDA error cudaStreamCreate on Pe 7 (n1 device 1): no
>> CUDA-capable device is available
>>
>> [7] Stack Traceback:
>> [7:0] CmiAbort+0x59 [0x907f64]
>> [7:1] _Z8NAMD_diePKc+0x4a [0x4fa7ba]
>> [7:2] _Z13cuda_errcheckPKc+0xdf [0x624b5f]
>> [7:3] _Z15cuda_initializev+0x2a7 [0x624e27]
>> [7:4] _Z11master_initiPPc+0x1a1 [0x500a11]
>> [7:5] main+0x19 [0x4fd489]
>> [7:6] __libc_start_main+0xf4 [0x32ca41d994]
>> [7:7] cos+0x1d1 [0x4f9d99]
>> FATAL ERROR: CUDA error cudaStreamCreate on Pe 9 (n0 device 1): no
>> CUDA-capable device is available
>> ------------- Processor 9 Exiting: Called CmiAbort ------------
>> Reason: FATAL ERROR: CUDA error cudaStreamCreate on Pe 9 (n0 device 1): no
>> CUDA-capable device is available
>>
>> [9] Stack Traceback:
>> [9:0] CmiAbort+0x59 [0x907f64]
>> [9:1] _Z8NAMD_diePKc+0x4a [0x4fa7ba]
>> [9:2] _Z13cuda_errcheckPKc+0xdf [0x624b5f]
>> [9:3] _Z15cuda_initializev+0x2a7 [0x624e27]
>> [9:4] _Z11master_initiPPc+0x1a1 [0x500a11]
>> [9:5] main+0x19 [0x4fd489]
>> [9:6] __libc_start_main+0xf4 [0x32ca41d994]
>> [9:7] cos+0x1d1 [0x4f9d99]
>> FATAL ERROR: CUDA error cudaStreamCreate on Pe 5 (n1 device 1): no
>> CUDA-capable device is available
>> ------------- Processor 5 Exiting: Called CmiAbort ------------
>> Reason: FATAL ERROR: CUDA error cudaStreamCreate on Pe 5 (n1 device 1): no
>> CUDA-capable device is available
>> ..
>> ..
>> ..
>>
>>
>>
>
>
>
>
This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:56:54 CST