Fwd: cuda error cudastreamcreate. SOLVED (probably)

From: Francesco Pietra (chiendarret_at_gmail.com)
Date: Wed Jun 15 2011 - 09:43:28 CDT

Previous "probably" was understatement. Problems were not solved. As
far as I can understand, the graphic cards are sometimes seen,
sometimes not.

The simulation (pressure equilibration) was completed successfully.
Next run (just a continuation of previous pressure equilibration)
failed, again 'Device Emulation (CPU' , see log file below. Attempted
again, same error.

# modinfo nvidia
filename: /lib/modules/2.6.38-2-amd64/updates/dkms/nvidia.ko
alias: char-major-195-*
supported: external
license: NVIDIA
alias: pci:v000010DEd00000E00sv*sd*bc04sc80i00*
alias: pci:v000010DEd00000AA3sv*sd*bc0Bsc40i00*
alias: pci:v000010DEd*sv*sd*bc03sc02i00*
alias: pci:v000010DEd*sv*sd*bc03sc00i00*
depends: i2c-core
vermagic: 2.6.38-2-amd64 SMP mod_unload modversions
parm: NVreg_EnableVia4x:int
parm: NVreg_EnableALiAGP:int
parm: NVreg_ReqAGPRate:int
parm: NVreg_EnableAGPSBA:int
parm: NVreg_EnableAGPFW:int
parm: NVreg_Mobile:int
parm: NVreg_ResmanDebugLevel:int
parm: NVreg_RmLogonRC:int
parm: NVreg_ModifyDeviceFiles:int
parm: NVreg_DeviceFileUID:int
parm: NVreg_DeviceFileGID:int
parm: NVreg_DeviceFileMode:int
parm: NVreg_RemapLimit:int
parm: NVreg_UpdateMemoryTypes:int
parm: NVreg_InitializeSystemMemoryAllocations:int
parm: NVreg_UseVBios:int
parm: NVreg_RMEdgeIntrCheck:int
parm: NVreg_UsePageAttributeTable:int
parm: NVreg_EnableMSI:int
parm: NVreg_MapRegistersEarly:int
parm: NVreg_RegisterForACPIEvents:int
parm: NVreg_RegistryDwords:charp
parm: NVreg_RmMsg:charp
parm: NVreg_NvAGP:int

However:

$ nvidia-smi -L
Could not open device /dev/nvidia1 (no such file)
Failed to initialize NVML: unknown error.

I am unable to draw technical conclusions from this 'unknown error'. I
wonder whether other information can be extracted to fix the problems.

Thanks for advice (and your patience in following this thread).

francesco

---------- Forwarded message ----------
From: Francesco Pietra <chiendarret_at_gmail.com>
Date: Wed, Jun 15, 2011 at 9:45 AM
Subject: Fwd: namd-l: cuda error cudastreamcreate. SOLVED (probably)
To: Jim Phillips <jim_at_ks.uiuc.edu>, Ajasja Ljubetič
<ajasja.ljubetic_at_gmail.com>, NAMD <namd-l_at_ks.uiuc.edu>

IT MAY BE OF INTEREST TO NAMD/DEBIAN USERS. HOWEVER, THE END QUESTION
BELOW (IN UPPERCASE) IS DIRECTED SPECIFICALLY TO NAMD

Following suggestions by Lennart Sorensen at "amd64_at_lists.debian.org",
my problem was the presence of a nvidia driver at
/lib/modules/2.6.38-2-amd64/updates/dkms/, which prevented rebuilding.
On the two commands below the correct driver, dated 15 June 2011, was
built for my linux headers.

apt-get remove nvidia-kernel-dkms (which also removes nvidia.ko)

apt-get install nvidia-kernel-dkms

Debian amd64 wheezy packages installed were:

gcc-4.4, 4.5, 4-6
libcuda1 270.41.19-1
libgl1-nvidia-glx 270.41.19-1
libnvidia-ml1 270.41.19-1
linux-headers-2.6-amd64  (2.6.38+34)
linux-headers-2.6.38-2-amd64  (2.6.38-5)
linux-headers-2.6.38-2-common (2.6.38-5)
linux-image-2.6-amd64 (2.38+34)
linux-image-2.6-38-2-amd64 (2.6.38-5)
linux-kbuild-2.6.38 (2.6.38-1)
nvidia-cuda-dev 3.2.16.2
nvidia-cuda-toolkit 3.2.16-2
nvidia-glx 270.41.19-1
nvidia-installer-cleanup 20110515+1
nvidia-kernel-common 20110515+1
nvidia-kernel-dkms 270.41.19-1
nvidia-smi 270.41.19-1

Now:

$ nvidia-smi -L
GPU 0: GeForce GTX 470 (UUID: N/A)
GPU 1: GeForce GTX 470 (UUID: N/A)

# modinfo nvidia
filename:       /lib/modules/2.6.38-2-amd64/updates/dkms/nvidia.ko
alias:          char-major-195-*
supported:      external
license:        NVIDIA
alias:          pci:v000010DEd00000E00sv*sd*bc04sc80i00*
alias:          pci:v000010DEd00000AA3sv*sd*bc0Bsc40i00*
alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
depends:        i2c-core
vermagic:       2.6.38-2-amd64 SMP mod_unload modversions
parm:           NVreg_EnableVia4x:int
parm:           NVreg_EnableALiAGP:int
parm:           NVreg_ReqAGPRate:int
parm:           NVreg_EnableAGPSBA:int
parm:           NVreg_EnableAGPFW:int
parm:           NVreg_Mobile:int
parm:           NVreg_ResmanDebugLevel:int
parm:           NVreg_RmLogonRC:int
parm:           NVreg_ModifyDeviceFiles:int
parm:           NVreg_DeviceFileUID:int
parm:           NVreg_DeviceFileGID:int
parm:           NVreg_DeviceFileMode:int
parm:           NVreg_RemapLimit:int
parm:           NVreg_UpdateMemoryTypes:int
parm:           NVreg_InitializeSystemMemoryAllocations:int
parm:           NVreg_UseVBios:int
parm:           NVreg_RMEdgeIntrCheck:int
parm:           NVreg_UsePageAttributeTable:int
parm:           NVreg_EnableMSI:int
parm:           NVreg_MapRegistersEarly:int
parm:           NVreg_RegisterForACPIEvents:int
parm:           NVreg_RegistryDwords:charp
parm:           NVreg_RmMsg:charp
parm:           NVreg_NvAGP:int

With such settings, NAMD simulation

charmrun $NAMD_HOME/bin/namd2 ++local +p6 +idlepoll ++verbose
filename.conf 2>&1 | tee filename.log

(NAMD_CVS-2011-06-04_Linux-x86_64-CUDA.tar.gz) started correctly,
using both gtx 470 cards, running overnight.

This morning, a second run to continue previous pressure equilibration
(using commands from console memory; there is only X server, no
desktop, and the X server had not been started) failed to start, with
log:

Info: Based on Charm++/Converse 60303 for net-linux-x86_64-iccstatic
Info: Built Sat Jun 4 02:22:51 CDT 2011 by jim on lisboa.ks.uiuc.edu
Info: 1 NAMD  CVS-2011-06-04  Linux-x86_64-CUDA  6    gig64  francesco
Info: Running on 6 processors, 6 nodes, 1 physical nodes.
Info: CPU topology information available.
Info: Charm++/Converse parallel runtime startup completed at 0.00989103 s
Pe 2 sharing CUDA device 0 first 0 next 3
Pe 2 physical rank 2 binding to CUDA device 0 on gig64: 'Device
Emulation (CPU)'  Mem: 0MB  Rev: 9999.9999
FATAL ERROR: CUDA error cudaStreamCreate on Pe 2 (gig64 device 0): no
CUDA-capable device is available

where 'Device Emulation (CPU)', instead of gtx 470, is indicative of
failure. After some info commands, as above, on a second attempt NAMD
simulation started regularly:

Info: Based on Charm++/Converse 60303 for net-linux-x86_64-iccstatic
Info: Built Sat Jun 4 02:22:51 CDT 2011 by jim on lisboa.ks.uiuc.edu
Info: 1 NAMD  CVS-2011-06-04  Linux-x86_64-CUDA  6    gig64  francesco
Info: Running on 6 processors, 6 nodes, 1 physical nodes.
Info: CPU topology information available.
Info: Charm++/Converse parallel runtime startup completed at 0.00345588 s
Did not find +devices i,j,k,... argument, using all
Pe 0 sharing CUDA device 0 first 0 next 2
Pe 1 sharing CUDA device 1 first 1 next 3
Pe 1 physical rank 1 binding to CUDA device 1 on gig64: 'GeForce GTX
470'  Mem: 1279MB  Rev: 2.0
Pe 0 physical rank 0 binding to CUDA device 0 on gig64: 'GeForce GTX
470'  Mem: 1279MB  Rev: 2.0
Pe 3 sharing CUDA device 1 first 1 next 5
Pe 2 sharing CUDA device 0 first 0 next 4
Pe 3 physical rank 3 binding to CUDA device 1 on gig64: 'GeForce GTX
470'  Mem: 1279MB  Rev: 2.0
Pe 5 sharing CUDA device 1 first 1 next 1
Pe 2 physical rank 2 binding to CUDA device 0 on gig64: 'GeForce GTX
470'  Mem: 1279MB  Rev: 2.0
Pe 5 physical rank 5 binding to CUDA device 1 on gig64: 'GeForce GTX
470'  Mem: 1279MB  Rev: 2.0
Pe 4 sharing CUDA device 0 first 0 next 0
Pe 4 physical rank 4 binding to CUDA device 0 on gig64: 'GeForce GTX
470'  Mem: 1279MB  Rev: 2.0
Info: 1.64104 MB of memory in use based on CmiMemoryUsage
Info: Configuration file is press-04.conf
Info: Working in the current directory
/home/francesco/3b.complex_press04_NAF++/mod1.4
TCL: Suspending until startup complete.

QUESTION TO NAMD:
what does device emulation cpu in log output "Pe 2 physical rank 2
binding to CUDA device 0 on gig64: 'Device Emulation (CPU)'  Mem: 0MB
Rev: 9999.9999" mean? I don't understand what is going wrong there.

Thanks a lot
francesco pietra

---------- Forwarded message ----------
From: Francesco Pietra <chiendarret_at_gmail.com>
Date: Tue, Jun 14, 2011 at 6:45 PM
Subject: Re: namd-l: cuda error cudastreamcreate
To: Jim Phillips <jim_at_ks.uiuc.edu>

On Tue, Jun 14, 2011 at 6:02 PM, Jim Phillips <jim_at_ks.uiuc.edu> wrote:
> On Tue, 14 Jun 2011, Francesco Pietra wrote:
>
>> nvidia-smi -r (or nvidia-smi -a)
>> NVIDIA: could not open the device file /dev/nvidia1 (no such file)
>> Failed to initialize NVML: unknown error.
>>
>> If "nvidia-smi" is for Tesla only, how to check GTX 470?
>
> It's not Tesla-only (see tests below).  -Jim
>
> jim_at_lisboa>nvidia-smi -L
> GPU 0: GeForce GTX 285 (UUID: N/A)
>
> jim_at_aberdeen>nvidia-smi -L
> GPU 0: Tesla C870 (UUID:
> GPU-798dee8502c5e13c-7dd72cfe-6069e259-8fd36a96-5163bf00fbbcb8e9f61eda54)
> GPU 1: Tesla C870 (UUID:
> GPU-ed96e9c4afb70d35-694f6869-981de52a-23e64327-917becef3aa20bfd0d66432c)
> GPU 2: GeForce 9800 GTX/9800 GTX+ (UUID: N/A)

It does not work with my installation:

$ which nvidia-smi
/usr/bin/nvidia-smi

$ nvidia-smi -L (or any other option of this command)
could not open device file /dev/nvidiaactl (no such device or address).

I am using the Debian installation of nvidia.ko. I wonder whether it
would be better for me to shift to the nvidia directions suggested by
Ajasja. However, Debian Linux is not mentioned there. Ubuntu is
similar, but for commands only.

Well, it is becoming painful.

francesco

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:20:26 CST