Multi node run causes "CUDA error cudaStreamCreate"

From: Michael S. Sellers (Cont, ARL/WMRD) (michael.s.sellers.ctr_at_us.army.mil)
Date: Fri Apr 01 2011 - 12:32:35 CDT

All,

I am receiving a "FATAL ERROR: CUDA error cudaStreamCreate on Pe 7 (n1
device 1): no CUDA-capable device is available" when NAMD starts up and
is optimizing FFT steps, for a job running on 3 nodes, 4ppn, 2 Tesla's
per node.

The command I'm executing within a PBS script is:
~/software/bin/charmrun +p12 ~/software/bin/namd2 +idlepoll sim1.conf >
$PBS_JOBNAME.out

NAMD CUDA does not give this error on 1 node, 8ppn, 2 Teslas. Please
see output below.

Might this be a situation where I need to use the +devices flag? It
seems as though the PEs are binding to CUDA devices on other nodes.

Thanks,

Mike

Charm++> Running on 3 unique compute nodes (8-way SMP).
Charm++> cpu topology info is gathered in 0.203 seconds.
Info: NAMD CVS-2011-03-22 for Linux-x86_64-MPI-CUDA
Info:
Info: Please visit http://www.ks.uiuc.edu/Research/namd/
Info: for updates, documentation, and support information.
Info:
Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005)
Info: in all publications reporting results obtained with NAMD.
Info:
Info: Based on Charm++/Converse 60303 for mpi-linux-x86_64
Info: 1 NAMD CVS-2011-03-22 Linux-x86_64-MPI-CUDA
Info: Running on 12 processors, 12 nodes, 3 physical nodes.
Info: CPU topology information available.
Info: Charm++/Converse parallel runtime startup completed at 0.204571 s
Pe 2 sharing CUDA device 0 first 0 next 0
Did not find +devices i,j,k,... argument, using all
Pe 2 physical rank 2 binding to CUDA device 0 on n2: 'Tesla T10
Processor' Mem: 4095MB Rev: 1.3
Pe 3 sharing CUDA device 1 first 1 next 1
Pe 3 physical rank 3 binding to CUDA device 1 on n2: 'Tesla T10
Processor' Mem: 4095MB Rev: 1.3
Pe 0 sharing CUDA device 0 first 0 next 2
Pe 0 physical rank 0 binding to CUDA device 0 on n2: 'Tesla T10
Processor' Mem: 4095MB Rev: 1.3
Pe 9 sharing CUDA device 1 first 9 next 11
Pe 7 sharing CUDA device 1 first 5 next 5
Pe 5 sharing CUDA device 1 first 5 next 7
Pe 9 physical rank 1 binding to CUDA device 1 on n0: 'Tesla T10
Processor' Mem: 4095MB Rev: 1.3
Pe 7 physical rank 3 binding to CUDA device 1 on n1: 'Tesla T10
Processor' Mem: 4095MB Rev: 1.3
Pe 5 physical rank 1 binding to CUDA device 1 on n1: 'Tesla T10
Processor' Mem: 4095MB Rev: 1.3
Pe 10 sharing CUDA device 0 first 8 next 8
Pe 11 sharing CUDA device 1 first 9 next 9
Pe 8 sharing CUDA device 0 first 8 next 10
Pe 11 physical rank 3 binding to CUDA device 1 on n0: 'Tesla T10
Processor' Mem: 4095MB Rev: 1.3
Pe 10 physical rank 2 binding to CUDA device 0 on n0: 'Tesla T10
Processor' Mem: 4095MB Rev: 1.3
Pe 8 physical rank 0 binding to CUDA device 0 on n0: 'Tesla T10
Processor' Mem: 4095MB Rev: 1.3
Pe 6 sharing CUDA device 0 first 4 next 4
Pe 6 physical rank 2 binding to CUDA device 0 on n1: 'Tesla T10
Processor' Mem: 4095MB Rev: 1.3
Pe 1 sharing CUDA device 1 first 1 next 3
Pe 1 physical rank 1 binding to CUDA device 1 on n2: 'Tesla T10
Processor' Mem: 4095MB Rev: 1.3
Pe 4 sharing CUDA device 0 first 4 next 6
Pe 4 physical rank 0 binding to CUDA device 0 on n1: 'Tesla T10
Processor' Mem: 4095MB Rev: 1.3
Info: 51.4492 MB of memory in use based on /proc/self/stat
...
...
Info: PME MAXIMUM GRID SPACING 1.5
Info: Attempting to read FFTW data from
FFTW_NAMD_CVS-2011-03-22_Linux-x86_64-MPI-CUDA.txt
Info: Optimizing 6 FFT steps. 1...FATAL ERROR: CUDA error
cudaStreamCreate on Pe 7 (n1 device 1): no CUDA-capable device is available
------------- Processor 7 Exiting: Called CmiAbort ------------
Reason: FATAL ERROR: CUDA error cudaStreamCreate on Pe 7 (n1 device 1):
no CUDA-capable device is available

[7] Stack Traceback:
  [7:0] CmiAbort+0x59 [0x907f64]
  [7:1] _Z8NAMD_diePKc+0x4a [0x4fa7ba]
  [7:2] _Z13cuda_errcheckPKc+0xdf [0x624b5f]
  [7:3] _Z15cuda_initializev+0x2a7 [0x624e27]
  [7:4] _Z11master_initiPPc+0x1a1 [0x500a11]
  [7:5] main+0x19 [0x4fd489]
  [7:6] __libc_start_main+0xf4 [0x32ca41d994]
  [7:7] cos+0x1d1 [0x4f9d99]
FATAL ERROR: CUDA error cudaStreamCreate on Pe 9 (n0 device 1): no
CUDA-capable device is available
------------- Processor 9 Exiting: Called CmiAbort ------------
Reason: FATAL ERROR: CUDA error cudaStreamCreate on Pe 9 (n0 device 1):
no CUDA-capable device is available

[9] Stack Traceback:
  [9:0] CmiAbort+0x59 [0x907f64]
  [9:1] _Z8NAMD_diePKc+0x4a [0x4fa7ba]
  [9:2] _Z13cuda_errcheckPKc+0xdf [0x624b5f]
  [9:3] _Z15cuda_initializev+0x2a7 [0x624e27]
  [9:4] _Z11master_initiPPc+0x1a1 [0x500a11]
  [9:5] main+0x19 [0x4fd489]
  [9:6] __libc_start_main+0xf4 [0x32ca41d994]
  [9:7] cos+0x1d1 [0x4f9d99]
FATAL ERROR: CUDA error cudaStreamCreate on Pe 5 (n1 device 1): no
CUDA-capable device is available
------------- Processor 5 Exiting: Called CmiAbort ------------
Reason: FATAL ERROR: CUDA error cudaStreamCreate on Pe 5 (n1 device 1):
no CUDA-capable device is available
..
..
..

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:56:54 CST