Re: cuda error cudastreamcreate. SOLVED (probably)

From: Axel Kohlmeyer (akohlmey_at_gmail.com)
Date: Thu Jun 16 2011 - 04:40:03 CDT

Typical response of a person that is completely clueless or believes
in FUD. Where are the opencl applications??? Why would expanding the
tablet business mean cutting back on cuda? I would expect the exact
opposite.

Axel

  --
Axel Kohlmeyer
akohlmey_at_gmail.com
http://goo.gl/1wk0

On Jun 16, 2011, at 1:57, Francesco Pietra <chiendarret_at_gmail.com>
wrote:

> From amd64_at_lists.debian.org, where i was asking about my alleged
> problems with cuda/debian-os, an external user intervened observing:
>
> Why are you using Cuda rather than OpenCL ? Nvidia has said they are
> cutting back on their GPU business and moving into CPUs for tablets
> which are now appearing on the market. If you have to move to AMD/ATI
> in the future OpenCL will still work, but CUDA will not.
>
> May I ask the opinion of namd users about that warning?
>
> thanks
>
> francesco
>
> On Wed, Jun 15, 2011 at 4:51 PM, Axel Kohlmeyer <akohlmey_at_gmail.com>
> wrote:
>> francesco,
>>
>> this is trivial unix stuff.
>>
>> you need "device" files (usually called "nodes") in
>> the /dev/ directory to communicate with devices.
>>
>> my guess is, that those are "vanishing" every time
>> your reboot and they'll show up magically as soon
>> as you run "nvidia-smi " as root.
>>
>> this is due to the udev service that manages device
>> nodes and recreates them a boot time and sets
>> permissions (unlike in the old times, where one would just
>> create those very everybody and everything). this is
>> needed to support removable devices and lots of other
>> convenient gimmicks.
>>
>> the easiest way to handle this, would be to call
>> nvidia-smi once from /etc/rc.d/rc.local
>>
>> cheers,
>> axel.
>>
>> On Wed, Jun 15, 2011 at 10:43 AM, Francesco Pietra
>> <chiendarret_at_gmail.com> wrote:
>>> Previous "probably" was understatement. Problems were not solved. As
>>> far as I can understand, the graphic cards are sometimes seen,
>>> sometimes not.
>>>
>>> The simulation (pressure equilibration) was completed successfully.
>>> Next run (just a continuation of previous pressure equilibration)
>>> failed, again 'Device Emulation (CPU' , see log file below.
>>> Attempted
>>> again, same error.
>>>
>>> # modinfo nvidia
>>> filename: /lib/modules/2.6.38-2-amd64/updates/dkms/nvidia.ko
>>> alias: char-major-195-*
>>> supported: external
>>> license: NVIDIA
>>> alias: pci:v000010DEd00000E00sv*sd*bc04sc80i00*
>>> alias: pci:v000010DEd00000AA3sv*sd*bc0Bsc40i00*
>>> alias: pci:v000010DEd*sv*sd*bc03sc02i00*
>>> alias: pci:v000010DEd*sv*sd*bc03sc00i00*
>>> depends: i2c-core
>>> vermagic: 2.6.38-2-amd64 SMP mod_unload modversions
>>> parm: NVreg_EnableVia4x:int
>>> parm: NVreg_EnableALiAGP:int
>>> parm: NVreg_ReqAGPRate:int
>>> parm: NVreg_EnableAGPSBA:int
>>> parm: NVreg_EnableAGPFW:int
>>> parm: NVreg_Mobile:int
>>> parm: NVreg_ResmanDebugLevel:int
>>> parm: NVreg_RmLogonRC:int
>>> parm: NVreg_ModifyDeviceFiles:int
>>> parm: NVreg_DeviceFileUID:int
>>> parm: NVreg_DeviceFileGID:int
>>> parm: NVreg_DeviceFileMode:int
>>> parm: NVreg_RemapLimit:int
>>> parm: NVreg_UpdateMemoryTypes:int
>>> parm: NVreg_InitializeSystemMemoryAllocations:int
>>> parm: NVreg_UseVBios:int
>>> parm: NVreg_RMEdgeIntrCheck:int
>>> parm: NVreg_UsePageAttributeTable:int
>>> parm: NVreg_EnableMSI:int
>>> parm: NVreg_MapRegistersEarly:int
>>> parm: NVreg_RegisterForACPIEvents:int
>>> parm: NVreg_RegistryDwords:charp
>>> parm: NVreg_RmMsg:charp
>>> parm: NVreg_NvAGP:int
>>>
>>> However:
>>>
>>> $ nvidia-smi -L
>>> Could not open device /dev/nvidia1 (no such file)
>>> Failed to initialize NVML: unknown error.
>>>
>>>
>>> I am unable to draw technical conclusions from this 'unknown
>>> error'. I
>>> wonder whether other information can be extracted to fix the
>>> problems.
>>>
>>> Thanks for advice (and your patience in following this thread).
>>>
>>> francesco
>>>
>>>
>>>
>>> ---------- Forwarded message ----------
>>> From: Francesco Pietra <chiendarret_at_gmail.com>
>>> Date: Wed, Jun 15, 2011 at 9:45 AM
>>> Subject: Fwd: namd-l: cuda error cudastreamcreate. SOLVED (probably)
>>> To: Jim Phillips <jim_at_ks.uiuc.edu>, Ajasja Ljubetič
>>> <ajasja.ljubetic_at_gmail.com>, NAMD <namd-l_at_ks.uiuc.edu>
>>>
>>>
>>> IT MAY BE OF INTEREST TO NAMD/DEBIAN USERS. HOWEVER, THE END
>>> QUESTION
>>> BELOW (IN UPPERCASE) IS DIRECTED SPECIFICALLY TO NAMD
>>>
>>> Following suggestions by Lennart Sorensen at
>>> "amd64_at_lists.debian.org",
>>> my problem was the presence of a nvidia driver at
>>> /lib/modules/2.6.38-2-amd64/updates/dkms/, which prevented
>>> rebuilding.
>>> On the two commands below the correct driver, dated 15 June 2011,
>>> was
>>> built for my linux headers.
>>>
>>> apt-get remove nvidia-kernel-dkms (which also removes nvidia.ko)
>>>
>>> apt-get install nvidia-kernel-dkms
>>>
>>>
>>> Debian amd64 wheezy packages installed were:
>>>
>>> gcc-4.4, 4.5, 4-6
>>> libcuda1 270.41.19-1
>>> libgl1-nvidia-glx 270.41.19-1
>>> libnvidia-ml1 270.41.19-1
>>> linux-headers-2.6-amd64 (2.6.38+34)
>>> linux-headers-2.6.38-2-amd64 (2.6.38-5)
>>> linux-headers-2.6.38-2-common (2.6.38-5)
>>> linux-image-2.6-amd64 (2.38+34)
>>> linux-image-2.6-38-2-amd64 (2.6.38-5)
>>> linux-kbuild-2.6.38 (2.6.38-1)
>>> nvidia-cuda-dev 3.2.16.2
>>> nvidia-cuda-toolkit 3.2.16-2
>>> nvidia-glx 270.41.19-1
>>> nvidia-installer-cleanup 20110515+1
>>> nvidia-kernel-common 20110515+1
>>> nvidia-kernel-dkms 270.41.19-1
>>> nvidia-smi 270.41.19-1
>>>
>>> Now:
>>>
>>> $ nvidia-smi -L
>>> GPU 0: GeForce GTX 470 (UUID: N/A)
>>> GPU 1: GeForce GTX 470 (UUID: N/A)
>>>
>>> # modinfo nvidia
>>> filename: /lib/modules/2.6.38-2-amd64/updates/dkms/nvidia.ko
>>> alias: char-major-195-*
>>> supported: external
>>> license: NVIDIA
>>> alias: pci:v000010DEd00000E00sv*sd*bc04sc80i00*
>>> alias: pci:v000010DEd00000AA3sv*sd*bc0Bsc40i00*
>>> alias: pci:v000010DEd*sv*sd*bc03sc02i00*
>>> alias: pci:v000010DEd*sv*sd*bc03sc00i00*
>>> depends: i2c-core
>>> vermagic: 2.6.38-2-amd64 SMP mod_unload modversions
>>> parm: NVreg_EnableVia4x:int
>>> parm: NVreg_EnableALiAGP:int
>>> parm: NVreg_ReqAGPRate:int
>>> parm: NVreg_EnableAGPSBA:int
>>> parm: NVreg_EnableAGPFW:int
>>> parm: NVreg_Mobile:int
>>> parm: NVreg_ResmanDebugLevel:int
>>> parm: NVreg_RmLogonRC:int
>>> parm: NVreg_ModifyDeviceFiles:int
>>> parm: NVreg_DeviceFileUID:int
>>> parm: NVreg_DeviceFileGID:int
>>> parm: NVreg_DeviceFileMode:int
>>> parm: NVreg_RemapLimit:int
>>> parm: NVreg_UpdateMemoryTypes:int
>>> parm: NVreg_InitializeSystemMemoryAllocations:int
>>> parm: NVreg_UseVBios:int
>>> parm: NVreg_RMEdgeIntrCheck:int
>>> parm: NVreg_UsePageAttributeTable:int
>>> parm: NVreg_EnableMSI:int
>>> parm: NVreg_MapRegistersEarly:int
>>> parm: NVreg_RegisterForACPIEvents:int
>>> parm: NVreg_RegistryDwords:charp
>>> parm: NVreg_RmMsg:charp
>>> parm: NVreg_NvAGP:int
>>>
>>> With such settings, NAMD simulation
>>>
>>> charmrun $NAMD_HOME/bin/namd2 ++local +p6 +idlepoll ++verbose
>>> filename.conf 2>&1 | tee filename.log
>>>
>>> (NAMD_CVS-2011-06-04_Linux-x86_64-CUDA.tar.gz) started correctly,
>>> using both gtx 470 cards, running overnight.
>>>
>>> This morning, a second run to continue previous pressure
>>> equilibration
>>> (using commands from console memory; there is only X server, no
>>> desktop, and the X server had not been started) failed to start,
>>> with
>>> log:
>>>
>>> Info: Based on Charm++/Converse 60303 for net-linux-x86_64-iccstatic
>>> Info: Built Sat Jun 4 02:22:51 CDT 2011 by jim on lisboa.ks.uiuc.edu
>>> Info: 1 NAMD CVS-2011-06-04 Linux-x86_64-CUDA 6 gig64
>>> francesco
>>> Info: Running on 6 processors, 6 nodes, 1 physical nodes.
>>> Info: CPU topology information available.
>>> Info: Charm++/Converse parallel runtime startup completed at
>>> 0.00989103 s
>>> Pe 2 sharing CUDA device 0 first 0 next 3
>>> Pe 2 physical rank 2 binding to CUDA device 0 on gig64: 'Device
>>> Emulation (CPU)' Mem: 0MB Rev: 9999.9999
>>> FATAL ERROR: CUDA error cudaStreamCreate on Pe 2 (gig64 device 0):
>>> no
>>> CUDA-capable device is available
>>>
>>> where 'Device Emulation (CPU)', instead of gtx 470, is indicative of
>>> failure. After some info commands, as above, on a second attempt
>>> NAMD
>>> simulation started regularly:
>>>
>>> Info: Based on Charm++/Converse 60303 for net-linux-x86_64-iccstatic
>>> Info: Built Sat Jun 4 02:22:51 CDT 2011 by jim on lisboa.ks.uiuc.edu
>>> Info: 1 NAMD CVS-2011-06-04 Linux-x86_64-CUDA 6 gig64
>>> francesco
>>> Info: Running on 6 processors, 6 nodes, 1 physical nodes.
>>> Info: CPU topology information available.
>>> Info: Charm++/Converse parallel runtime startup completed at
>>> 0.00345588 s
>>> Did not find +devices i,j,k,... argument, using all
>>> Pe 0 sharing CUDA device 0 first 0 next 2
>>> Pe 1 sharing CUDA device 1 first 1 next 3
>>> Pe 1 physical rank 1 binding to CUDA device 1 on gig64: 'GeForce GTX
>>> 470' Mem: 1279MB Rev: 2.0
>>> Pe 0 physical rank 0 binding to CUDA device 0 on gig64: 'GeForce GTX
>>> 470' Mem: 1279MB Rev: 2.0
>>> Pe 3 sharing CUDA device 1 first 1 next 5
>>> Pe 2 sharing CUDA device 0 first 0 next 4
>>> Pe 3 physical rank 3 binding to CUDA device 1 on gig64: 'GeForce GTX
>>> 470' Mem: 1279MB Rev: 2.0
>>> Pe 5 sharing CUDA device 1 first 1 next 1
>>> Pe 2 physical rank 2 binding to CUDA device 0 on gig64: 'GeForce GTX
>>> 470' Mem: 1279MB Rev: 2.0
>>> Pe 5 physical rank 5 binding to CUDA device 1 on gig64: 'GeForce GTX
>>> 470' Mem: 1279MB Rev: 2.0
>>> Pe 4 sharing CUDA device 0 first 0 next 0
>>> Pe 4 physical rank 4 binding to CUDA device 0 on gig64: 'GeForce GTX
>>> 470' Mem: 1279MB Rev: 2.0
>>> Info: 1.64104 MB of memory in use based on CmiMemoryUsage
>>> Info: Configuration file is press-04.conf
>>> Info: Working in the current directory
>>> /home/francesco/3b.complex_press04_NAF++/mod1.4
>>> TCL: Suspending until startup complete.
>>>
>>>
>>> QUESTION TO NAMD:
>>> what does device emulation cpu in log output "Pe 2 physical rank 2
>>> binding to CUDA device 0 on gig64: 'Device Emulation (CPU)' Mem:
>>> 0MB
>>> Rev: 9999.9999" mean? I don't understand what is going wrong there.
>>>
>>> Thanks a lot
>>> francesco pietra
>>>
>>>
>>> ---------- Forwarded message ----------
>>> From: Francesco Pietra <chiendarret_at_gmail.com>
>>> Date: Tue, Jun 14, 2011 at 6:45 PM
>>> Subject: Re: namd-l: cuda error cudastreamcreate
>>> To: Jim Phillips <jim_at_ks.uiuc.edu>
>>>
>>>
>>> On Tue, Jun 14, 2011 at 6:02 PM, Jim Phillips <jim_at_ks.uiuc.edu>
>>> wrote:
>>>> On Tue, 14 Jun 2011, Francesco Pietra wrote:
>>>>
>>>>> nvidia-smi -r (or nvidia-smi -a)
>>>>> NVIDIA: could not open the device file /dev/nvidia1 (no such file)
>>>>> Failed to initialize NVML: unknown error.
>>>>>
>>>>> If "nvidia-smi" is for Tesla only, how to check GTX 470?
>>>>
>>>> It's not Tesla-only (see tests below). -Jim
>>>>
>>>> jim_at_lisboa>nvidia-smi -L
>>>> GPU 0: GeForce GTX 285 (UUID: N/A)
>>>>
>>>> jim_at_aberdeen>nvidia-smi -L
>>>> GPU 0: Tesla C870 (UUID:
>>>> GPU-
>>>> 798dee8502c5e13c-
>>>> 7dd72cfe-6069e259-8fd36a96-5163bf00fbbcb8e9f61eda54)
>>>> GPU 1: Tesla C870 (UUID:
>>>> GPU-
>>>> ed96e9c4afb70d35-
>>>> 694f6869-981de52a-23e64327-917becef3aa20bfd0d66432c)
>>>> GPU 2: GeForce 9800 GTX/9800 GTX+ (UUID: N/A)
>>>
>>> It does not work with my installation:
>>>
>>> $ which nvidia-smi
>>> /usr/bin/nvidia-smi
>>>
>>> $ nvidia-smi -L (or any other option of this command)
>>> could not open device file /dev/nvidiaactl (no such device or
>>> address).
>>>
>>> I am using the Debian installation of nvidia.ko. I wonder whether it
>>> would be better for me to shift to the nvidia directions suggested
>>> by
>>> Ajasja. However, Debian Linux is not mentioned there. Ubuntu is
>>> similar, but for commands only.
>>>
>>> Well, it is becoming painful.
>>>
>>> francesco
>>>
>>>
>>
>>
>>
>> --
>> Dr. Axel Kohlmeyer
>> akohlmey_at_gmail.com http://goo.gl/1wk0
>>
>> Institute for Computational Molecular Science
>> Temple University, Philadelphia PA, USA.
>>

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:20:27 CST