Re: namd on nvidia 302.17

From: Francesco Pietra (chiendarret_at_gmail.com)
Date: Thu Sep 27 2012 - 08:54:10 CDT

There are for me two ways of getting cuda at work: (a) install the
driver according to nvidia (as probably is implied in what you
suggested); (b) rely on Debian amd64, which furnishes precompiled
nvidia driver. I adopted (b) because upgrading is automatic and Debian
is notoriously highly reliable.

I did not take notice of the cuda driver I had just before the "fatal"
upgrading, but it were months that I did not upgrade. The version noed
on my amd64 notebook is 295.53; probably I upgraded from that version.

Now, on amd64,version 304.48.1 is available, while in my system
version 302.17-3 is installed, along with the basic
nvidia-kernel-dkms, as I posted initially. All under cuda-toolkit
version 4 (although this is not used in the "Debian way" of my
installation).

The output of

dpkg -l |grep nvidia

modinfo nvidia

which I posted initially, indicate, in my experience, that everything
is working correctly. On these basis, I suspected that 302.17-3 is too
advanced for current namd builds, although everything is under toolkit
4 (or equivalent way).

I could try to install 295 driver in place of 302 but probably someone
knows better than me what could be expected. Moving forward is easy,
going back, with any OS, is matter for experts.

I am not sure that all I said is correct. I am a biochemist, not a
software expert.

Thanks for your kind attention.

francesco pietra

On Thu, Sep 27, 2012 at 2:58 PM, Aron Broom <broomsday_at_gmail.com> wrote:
> So one potential problem here: is 302.17 a development driver, or just the
> one Linux installs itself from the proprietary drivers? It looks to me like
> the absolutely newest development driver is ver 295.41. I'm not confident
> that you'd be able to run NAMD without the development driver installed.
> The installation is manual, and it should overwrite whatever driver you have
> there. I recommend a trip to the CUDA development zone webpage.
>
> ~Aron
>
> On Thu, Sep 27, 2012 at 3:52 AM, Francesco Pietra <chiendarret_at_gmail.com>
> wrote:
>>
>> Hello:
>> I have tried the NAMD_CVS-2012-09-26_Linux-x86_64-multicore-CUDA with
>> nvidia version 302.17:
>>
>> Running command: namd2 heat-01.conf +p6 +idlepoll
>>
>> Charm++: standalone mode (not using charmrun)
>> Converse/Charm++ Commit ID: v6.4.0-beta1-0-g5776d21
>> CharmLB> Load balancer assumes all CPUs are same.
>> Charm++> Running on 1 unique compute nodes (12-way SMP).
>> Charm++> cpu topology info is gathered in 0.001 seconds.
>> Info: NAMD CVS-2012-09-26 for Linux-x86_64-multicore-CUDA
>> Info:
>> Info: Please visit http://www.ks.uiuc.edu/Research/namd/
>> Info: for updates, documentation, and support information.
>> Info:
>> Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005)
>> Info: in all publications reporting results obtained with NAMD.
>> Info:
>> Info: Based on Charm++/Converse 60400 for multicore-linux64-iccstatic
>> Info: Built Wed Sep 26 02:25:08 CDT 2012 by jim on lisboa.ks.uiuc.edu
>> Info: 1 NAMD CVS-2012-09-26 Linux-x86_64-multicore-CUDA 6 gig64
>> francesco
>> Info: Running on 6 processors, 1 nodes, 1 physical nodes.
>> Info: CPU topology information available.
>> Info: Charm++/Converse parallel runtime startup completed at 0.085423 s
>> FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 3 (gig64):
>> initialization error
>> FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 1 (gig64):
>> initialization error
>> ------------- Processor 3 Exiting: Called CmiAbort ------------
>> Reason: FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 3 (gig64):
>> initialization error
>>
>> Program finished.
>> FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 4 (gig64):
>> initialization error
>> FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 2 (gig64):
>> initialization error
>>
>>
>> As I had (nearly) no comment to such failures, I can only imagine that
>> either (i) my question - disregarding obvious issues - was too silly
>> to merit attention; (ii) it is well known that nvidia version 302.17
>> is incompatible with current namd builds for Linux-GNU.
>>
>> At any event, in the frame of metapackages, it is probably impossible
>> within Debian GNU-Linux wheezy to go back to a previous version of
>> nvidia. On the other hand, the stable version of the OS furnishes a
>> much too old version of nvidia. Therefore, my question is:
>>
>> Any chance to compile namd in front of installed nvidia version 302.17?
>>
>> Thanks for advice. Without access to namd-cuda I am currently hindered
>> to answer a question raised by the reviewers of a manuscript (the CPU
>> cluster has long ago been shut down, as it became too expensive for
>> our budget)
>>
>> francesco pietra
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Sep 26, 2012 at 4:08 PM, Francesco Pietra <chiendarret_at_gmail.com>
>> wrote:
>> > I forgot to mention that I am at final version 2.9 of namd.
>> > f.
>> >
>> > On Wed, Sep 26, 2012 at 4:05 PM, Aron Broom <broomsday_at_gmail.com> wrote:
>> >> I'm not certain, but I think the driver version needs to match the CUDA
>> >> toolkit version that NAMD uses, and I think the library file NAMD comes
>> >> with
>> >> is toolkit 4.0 or something of that sort.
>> >>
>> >> ~Aron
>> >>
>> >>
>> >> On Wed, Sep 26, 2012 at 9:58 AM, Francesco Pietra
>> >> <chiendarret_at_gmail.com>
>> >> wrote:
>> >>>
>> >>> Hi:
>> >>> Following updating/upgrading of Debian GNU-Linux amd64 wheezy,
>> >>> minimizations do not run anymore on GTX-680:
>> >>>
>> >>> CUDA error in CudaGetDeviceCount on Pe3 Pe4, Pe6. Initialization
>> >>> error.
>> >>>
>> >>> The two GTX are regularly activated with
>> >>> nvidia-smi -L
>> >>> nvidia-smi -pm 1
>> >>>
>> >>> Server and nvidia are the same version:
>> >>>
>> >>> francesco_at_gig64:~$ dpkg -l |grep nvidia
>> >>> ii glx-alternative-nvidia 0.2.2
>> >>> amd64 allows the selection of NVIDIA as GLX provider
>> >>> ii libgl1-nvidia-alternatives 302.17-3
>> >>> amd64 transition libGL.so* diversions to
>> >>> glx-alternative-nvidia
>> >>> ii libgl1-nvidia-glx:amd64 302.17-3
>> >>> amd64 NVIDIA binary OpenGL libraries
>> >>> ii libglx-nvidia-alternatives 302.17-3
>> >>> amd64 transition libgl.so diversions to
>> >>> glx-alternative-nvidia
>> >>> ii libnvidia-ml1:amd64 302.17-3
>> >>> amd64 NVIDIA management library (NVML) runtime library
>> >>> ii nvidia-alternative 302.17-3
>> >>> amd64 allows the selection of NVIDIA as GLX provider
>> >>> ii nvidia-glx 302.17-3
>> >>> amd64 NVIDIA metapackage
>> >>> ii nvidia-installer-cleanup 20120630+3
>> >>> amd64 Cleanup after driver installation with the
>> >>> nvidia-installer
>> >>> ii nvidia-kernel-common 20120630+3
>> >>> amd64 NVIDIA binary kernel module support files
>> >>> ii nvidia-kernel-dkms 302.17-3
>> >>> amd64 NVIDIA binary kernel module DKMS source
>> >>> ii nvidia-smi 302.17-3
>> >>> amd64 NVIDIA System Management Interface
>> >>> ii nvidia-support 20120630+3
>> >>> amd64 NVIDIA binary graphics driver support files
>> >>> ii nvidia-vdpau-driver:amd64 302.17-3
>> >>> amd64 NVIDIA vdpau driver
>> >>> ii nvidia-xconfig 302.17-2
>> >>> amd64 X configuration tool for non-free NVIDIA drivers
>> >>> ii xserver-xorg-video-nvidia 302.17-3
>> >>> amd64 NVIDIA binary Xorg driver
>> >>> francesco_at_gig64:~$
>> >>>
>> >>>
>> >>> root_at_gig64:/home/francesco# modinfo nvidia
>> >>> filename: /lib/modules/3.2.0-2-amd64/updates/dkms/nvidia.ko
>> >>> alias: char-major-195-*
>> >>> version: 302.17
>> >>> supported: external
>> >>> license: NVIDIA
>> >>> alias: pci:v000010DEd00000E00sv*sd*bc04sc80i00*
>> >>> alias: pci:v000010DEd00000AA3sv*sd*bc0Bsc40i00*
>> >>> alias: pci:v000010DEd*sv*sd*bc03sc02i00*
>> >>> alias: pci:v000010DEd*sv*sd*bc03sc00i00*
>> >>> depends: i2c-core
>> >>> vermagic: 3.2.0-2-amd64 SMP mod_unload modversions
>> >>> parm: NVreg_EnableVia4x:int
>> >>> parm: NVreg_EnableALiAGP:int
>> >>> parm: NVreg_ReqAGPRate:int
>> >>> parm: NVreg_EnableAGPSBA:int
>> >>> parm: NVreg_EnableAGPFW:int
>> >>> parm: NVreg_Mobile:int
>> >>> parm: NVreg_ResmanDebugLevel:int
>> >>> parm: NVreg_RmLogonRC:int
>> >>> parm: NVreg_ModifyDeviceFiles:int
>> >>> parm: NVreg_DeviceFileUID:int
>> >>> parm: NVreg_DeviceFileGID:int
>> >>> parm: NVreg_DeviceFileMode:int
>> >>> parm: NVreg_RemapLimit:int
>> >>> parm: NVreg_UpdateMemoryTypes:int
>> >>> parm: NVreg_InitializeSystemMemoryAllocations:int
>> >>> parm: NVreg_UseVBios:int
>> >>> parm: NVreg_RMEdgeIntrCheck:int
>> >>> parm: NVreg_UsePageAttributeTable:int
>> >>> parm: NVreg_EnableMSI:int
>> >>> parm: NVreg_MapRegistersEarly:int
>> >>> parm: NVreg_RegisterForACPIEvents:int
>> >>> parm: NVreg_RegistryDwords:charp
>> >>> parm: NVreg_RmMsg:charp
>> >>> parm: NVreg_NvAGP:int
>> >>> root_at_gig64:/home/francesco#
>> >>>
>> >>> I have also tried with recently used MD files, same problem:
>> >>> francesco_at_gig64:~/tmp$ charmrun namd2 heat-01.conf +p6 +idlepoll 2>&1
>> >>> | tee heat-01.log
>> >>> Running command: namd2 heat-01.conf +p6 +idlepoll
>> >>>
>> >>> Charm++: standalone mode (not using charmrun)
>> >>> Converse/Charm++ Commit ID: v6.4.0-beta1-0-g5776d21
>> >>> CharmLB> Load balancer assumes all CPUs are same.
>> >>> Charm++> Running on 1 unique compute nodes (12-way SMP).
>> >>> Charm++> cpu topology info is gathered in 0.001 seconds.
>> >>> Info: NAMD CVS-2012-06-20 for Linux-x86_64-multicore-CUDA
>> >>> Info:
>> >>> Info: Please visit http://www.ks.uiuc.edu/Research/namd/
>> >>> Info: for updates, documentation, and support information.
>> >>> Info:
>> >>> Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005)
>> >>> Info: in all publications reporting results obtained with NAMD.
>> >>> Info:
>> >>> Info: Based on Charm++/Converse 60400 for multicore-linux64-iccstatic
>> >>> Info: Built Wed Jun 20 02:24:32 CDT 2012 by jim on lisboa.ks.uiuc.edu
>> >>> Info: 1 NAMD CVS-2012-06-20 Linux-x86_64-multicore-CUDA 6 gig64
>> >>> francesco
>> >>> Info: Running on 6 processors, 1 nodes, 1 physical nodes.
>> >>> Info: CPU topology information available.
>> >>> Info: Charm++/Converse parallel runtime startup completed at
>> >>> 0.00989199 s
>> >>> FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 5 (gig64):
>> >>> initialization error
>> >>> ------------- Processor 5 Exiting: Called CmiAbort ------------
>> >>> Reason: FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 5 (gig64):
>> >>> initialization error
>> >>>
>> >>> FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 1 (gig64):
>> >>> initialization error
>> >>> Program finished.
>> >>> FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 3 (gig64):
>> >>> initialization error
>> >>> FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 2 (gig64):
>> >>> initialization error
>> >>> FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 4 (gig64):
>> >>> initialization error
>> >>> francesco_at_gig64:~/tmp$
>> >>>
>> >>>
>> >>> This is a shared-mem machine.
>> >>> Does the version 302.17 work for you?
>> >>>
>> >>> Thanks
>> >>> francesco pietra
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> Aron Broom M.Sc
>> >> PhD Student
>> >> Department of Chemistry
>> >> University of Waterloo
>> >>
>>
>
>
>
> --
> Aron Broom M.Sc
> PhD Student
> Department of Chemistry
> University of Waterloo
>

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:22:07 CST