Re: cuda error cudastreamcreate. SOLVED (probably)

From: Axel Kohlmeyer (akohlmey_at_gmail.com)
Date: Wed Jun 15 2011 - 09:51:33 CDT

francesco,

this is trivial unix stuff.

you need "device" files (usually called "nodes") in
the /dev/ directory to communicate with devices.

my guess is, that those are "vanishing" every time
your reboot and they'll show up magically as soon
as you run "nvidia-smi " as root.

this is due to the udev service that manages device
nodes and recreates them a boot time and sets
permissions (unlike in the old times, where one would just
create those very everybody and everything). this is
needed to support removable devices and lots of other
convenient gimmicks.

the easiest way to handle this, would be to call
nvidia-smi once from /etc/rc.d/rc.local

cheers,
    axel.

On Wed, Jun 15, 2011 at 10:43 AM, Francesco Pietra
<chiendarret_at_gmail.com> wrote:
> Previous "probably" was understatement. Problems were not solved. As
> far as I can understand, the graphic cards are sometimes seen,
> sometimes not.
>
> The simulation (pressure equilibration) was completed successfully.
> Next run (just a continuation of previous pressure equilibration)
> failed, again 'Device Emulation (CPU' , see log file below. Attempted
> again, same error.
>
> # modinfo nvidia
> filename:       /lib/modules/2.6.38-2-amd64/updates/dkms/nvidia.ko
> alias:          char-major-195-*
> supported:      external
> license:        NVIDIA
> alias:          pci:v000010DEd00000E00sv*sd*bc04sc80i00*
> alias:          pci:v000010DEd00000AA3sv*sd*bc0Bsc40i00*
> alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
> alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
> depends:        i2c-core
> vermagic:       2.6.38-2-amd64 SMP mod_unload modversions
> parm:           NVreg_EnableVia4x:int
> parm:           NVreg_EnableALiAGP:int
> parm:           NVreg_ReqAGPRate:int
> parm:           NVreg_EnableAGPSBA:int
> parm:           NVreg_EnableAGPFW:int
> parm:           NVreg_Mobile:int
> parm:           NVreg_ResmanDebugLevel:int
> parm:           NVreg_RmLogonRC:int
> parm:           NVreg_ModifyDeviceFiles:int
> parm:           NVreg_DeviceFileUID:int
> parm:           NVreg_DeviceFileGID:int
> parm:           NVreg_DeviceFileMode:int
> parm:           NVreg_RemapLimit:int
> parm:           NVreg_UpdateMemoryTypes:int
> parm:           NVreg_InitializeSystemMemoryAllocations:int
> parm:           NVreg_UseVBios:int
> parm:           NVreg_RMEdgeIntrCheck:int
> parm:           NVreg_UsePageAttributeTable:int
> parm:           NVreg_EnableMSI:int
> parm:           NVreg_MapRegistersEarly:int
> parm:           NVreg_RegisterForACPIEvents:int
> parm:           NVreg_RegistryDwords:charp
> parm:           NVreg_RmMsg:charp
> parm:           NVreg_NvAGP:int
>
> However:
>
> $ nvidia-smi -L
> Could not open device /dev/nvidia1 (no such file)
> Failed to initialize NVML: unknown error.
>
>
> I am unable to draw technical conclusions from this 'unknown error'. I
> wonder whether other information can be extracted to fix the problems.
>
> Thanks for advice (and your patience in following this thread).
>
> francesco
>
>
>
> ---------- Forwarded message ----------
> From: Francesco Pietra <chiendarret_at_gmail.com>
> Date: Wed, Jun 15, 2011 at 9:45 AM
> Subject: Fwd: namd-l: cuda error cudastreamcreate. SOLVED (probably)
> To: Jim Phillips <jim_at_ks.uiuc.edu>, Ajasja Ljubetič
> <ajasja.ljubetic_at_gmail.com>, NAMD <namd-l_at_ks.uiuc.edu>
>
>
> IT MAY BE OF INTEREST TO NAMD/DEBIAN USERS. HOWEVER, THE END QUESTION
> BELOW (IN UPPERCASE) IS DIRECTED SPECIFICALLY TO NAMD
>
> Following suggestions by Lennart Sorensen at "amd64_at_lists.debian.org",
> my problem was the presence of a nvidia driver at
> /lib/modules/2.6.38-2-amd64/updates/dkms/, which prevented rebuilding.
> On the two commands below the correct driver, dated 15 June 2011, was
> built for my linux headers.
>
> apt-get remove nvidia-kernel-dkms (which also removes nvidia.ko)
>
> apt-get install nvidia-kernel-dkms
>
>
> Debian amd64 wheezy packages installed were:
>
> gcc-4.4, 4.5, 4-6
> libcuda1 270.41.19-1
> libgl1-nvidia-glx 270.41.19-1
> libnvidia-ml1 270.41.19-1
> linux-headers-2.6-amd64  (2.6.38+34)
> linux-headers-2.6.38-2-amd64  (2.6.38-5)
> linux-headers-2.6.38-2-common (2.6.38-5)
> linux-image-2.6-amd64 (2.38+34)
> linux-image-2.6-38-2-amd64 (2.6.38-5)
> linux-kbuild-2.6.38 (2.6.38-1)
> nvidia-cuda-dev 3.2.16.2
> nvidia-cuda-toolkit 3.2.16-2
> nvidia-glx 270.41.19-1
> nvidia-installer-cleanup 20110515+1
> nvidia-kernel-common 20110515+1
> nvidia-kernel-dkms 270.41.19-1
> nvidia-smi 270.41.19-1
>
> Now:
>
> $ nvidia-smi -L
> GPU 0: GeForce GTX 470 (UUID: N/A)
> GPU 1: GeForce GTX 470 (UUID: N/A)
>
> # modinfo nvidia
> filename:       /lib/modules/2.6.38-2-amd64/updates/dkms/nvidia.ko
> alias:          char-major-195-*
> supported:      external
> license:        NVIDIA
> alias:          pci:v000010DEd00000E00sv*sd*bc04sc80i00*
> alias:          pci:v000010DEd00000AA3sv*sd*bc0Bsc40i00*
> alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
> alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
> depends:        i2c-core
> vermagic:       2.6.38-2-amd64 SMP mod_unload modversions
> parm:           NVreg_EnableVia4x:int
> parm:           NVreg_EnableALiAGP:int
> parm:           NVreg_ReqAGPRate:int
> parm:           NVreg_EnableAGPSBA:int
> parm:           NVreg_EnableAGPFW:int
> parm:           NVreg_Mobile:int
> parm:           NVreg_ResmanDebugLevel:int
> parm:           NVreg_RmLogonRC:int
> parm:           NVreg_ModifyDeviceFiles:int
> parm:           NVreg_DeviceFileUID:int
> parm:           NVreg_DeviceFileGID:int
> parm:           NVreg_DeviceFileMode:int
> parm:           NVreg_RemapLimit:int
> parm:           NVreg_UpdateMemoryTypes:int
> parm:           NVreg_InitializeSystemMemoryAllocations:int
> parm:           NVreg_UseVBios:int
> parm:           NVreg_RMEdgeIntrCheck:int
> parm:           NVreg_UsePageAttributeTable:int
> parm:           NVreg_EnableMSI:int
> parm:           NVreg_MapRegistersEarly:int
> parm:           NVreg_RegisterForACPIEvents:int
> parm:           NVreg_RegistryDwords:charp
> parm:           NVreg_RmMsg:charp
> parm:           NVreg_NvAGP:int
>
> With such settings, NAMD simulation
>
> charmrun $NAMD_HOME/bin/namd2 ++local +p6 +idlepoll ++verbose
> filename.conf 2>&1 | tee filename.log
>
> (NAMD_CVS-2011-06-04_Linux-x86_64-CUDA.tar.gz) started correctly,
> using both gtx 470 cards, running overnight.
>
> This morning, a second run to continue previous pressure equilibration
> (using commands from console memory; there is only X server, no
> desktop, and the X server had not been started) failed to start, with
> log:
>
> Info: Based on Charm++/Converse 60303 for net-linux-x86_64-iccstatic
> Info: Built Sat Jun 4 02:22:51 CDT 2011 by jim on lisboa.ks.uiuc.edu
> Info: 1 NAMD  CVS-2011-06-04  Linux-x86_64-CUDA  6    gig64  francesco
> Info: Running on 6 processors, 6 nodes, 1 physical nodes.
> Info: CPU topology information available.
> Info: Charm++/Converse parallel runtime startup completed at 0.00989103 s
> Pe 2 sharing CUDA device 0 first 0 next 3
> Pe 2 physical rank 2 binding to CUDA device 0 on gig64: 'Device
> Emulation (CPU)'  Mem: 0MB  Rev: 9999.9999
> FATAL ERROR: CUDA error cudaStreamCreate on Pe 2 (gig64 device 0): no
> CUDA-capable device is available
>
> where 'Device Emulation (CPU)', instead of gtx 470, is indicative of
> failure. After some info commands, as above, on a second attempt NAMD
> simulation started regularly:
>
> Info: Based on Charm++/Converse 60303 for net-linux-x86_64-iccstatic
> Info: Built Sat Jun 4 02:22:51 CDT 2011 by jim on lisboa.ks.uiuc.edu
> Info: 1 NAMD  CVS-2011-06-04  Linux-x86_64-CUDA  6    gig64  francesco
> Info: Running on 6 processors, 6 nodes, 1 physical nodes.
> Info: CPU topology information available.
> Info: Charm++/Converse parallel runtime startup completed at 0.00345588 s
> Did not find +devices i,j,k,... argument, using all
> Pe 0 sharing CUDA device 0 first 0 next 2
> Pe 1 sharing CUDA device 1 first 1 next 3
> Pe 1 physical rank 1 binding to CUDA device 1 on gig64: 'GeForce GTX
> 470'  Mem: 1279MB  Rev: 2.0
> Pe 0 physical rank 0 binding to CUDA device 0 on gig64: 'GeForce GTX
> 470'  Mem: 1279MB  Rev: 2.0
> Pe 3 sharing CUDA device 1 first 1 next 5
> Pe 2 sharing CUDA device 0 first 0 next 4
> Pe 3 physical rank 3 binding to CUDA device 1 on gig64: 'GeForce GTX
> 470'  Mem: 1279MB  Rev: 2.0
> Pe 5 sharing CUDA device 1 first 1 next 1
> Pe 2 physical rank 2 binding to CUDA device 0 on gig64: 'GeForce GTX
> 470'  Mem: 1279MB  Rev: 2.0
> Pe 5 physical rank 5 binding to CUDA device 1 on gig64: 'GeForce GTX
> 470'  Mem: 1279MB  Rev: 2.0
> Pe 4 sharing CUDA device 0 first 0 next 0
> Pe 4 physical rank 4 binding to CUDA device 0 on gig64: 'GeForce GTX
> 470'  Mem: 1279MB  Rev: 2.0
> Info: 1.64104 MB of memory in use based on CmiMemoryUsage
> Info: Configuration file is press-04.conf
> Info: Working in the current directory
> /home/francesco/3b.complex_press04_NAF++/mod1.4
> TCL: Suspending until startup complete.
>
>
> QUESTION TO NAMD:
> what does device emulation cpu in log output "Pe 2 physical rank 2
> binding to CUDA device 0 on gig64: 'Device Emulation (CPU)'  Mem: 0MB
> Rev: 9999.9999" mean? I don't understand what is going wrong there.
>
> Thanks a lot
> francesco pietra
>
>
> ---------- Forwarded message ----------
> From: Francesco Pietra <chiendarret_at_gmail.com>
> Date: Tue, Jun 14, 2011 at 6:45 PM
> Subject: Re: namd-l: cuda error cudastreamcreate
> To: Jim Phillips <jim_at_ks.uiuc.edu>
>
>
> On Tue, Jun 14, 2011 at 6:02 PM, Jim Phillips <jim_at_ks.uiuc.edu> wrote:
>> On Tue, 14 Jun 2011, Francesco Pietra wrote:
>>
>>> nvidia-smi -r (or nvidia-smi -a)
>>> NVIDIA: could not open the device file /dev/nvidia1 (no such file)
>>> Failed to initialize NVML: unknown error.
>>>
>>> If "nvidia-smi" is for Tesla only, how to check GTX 470?
>>
>> It's not Tesla-only (see tests below).  -Jim
>>
>> jim_at_lisboa>nvidia-smi -L
>> GPU 0: GeForce GTX 285 (UUID: N/A)
>>
>> jim_at_aberdeen>nvidia-smi -L
>> GPU 0: Tesla C870 (UUID:
>> GPU-798dee8502c5e13c-7dd72cfe-6069e259-8fd36a96-5163bf00fbbcb8e9f61eda54)
>> GPU 1: Tesla C870 (UUID:
>> GPU-ed96e9c4afb70d35-694f6869-981de52a-23e64327-917becef3aa20bfd0d66432c)
>> GPU 2: GeForce 9800 GTX/9800 GTX+ (UUID: N/A)
>
> It does not work with my installation:
>
> $ which nvidia-smi
> /usr/bin/nvidia-smi
>
> $ nvidia-smi -L (or any other option of this command)
> could not open device file /dev/nvidiaactl (no such device or address).
>
> I am using the Debian installation of nvidia.ko. I wonder whether it
> would be better for me to shift to the nvidia directions suggested by
> Ajasja. However, Debian Linux is not mentioned there. Ubuntu is
> similar, but for commands only.
>
> Well, it is becoming painful.
>
> francesco
>
>

-- 
Dr. Axel Kohlmeyer
akohlmey_at_gmail.com  http://goo.gl/1wk0
Institute for Computational Molecular Science
Temple University, Philadelphia PA, USA.

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:57:18 CST