Re: namd on nvidia 302.17

From: Francesco Pietra (chiendarret_at_gmail.com)
Date: Thu Sep 27 2012 - 12:39:30 CDT

Next message: Aron Broom: "Re: namd on nvidia 302.17"
Previous message: Aron Broom: "Re: Namd list: Problem in equilibration stage"
In reply to: Francesco Pietra: "Re: namd on nvidia 302.17"
Next in thread: Aron Broom: "Re: namd on nvidia 302.17"
Reply: Aron Broom: "Re: namd on nvidia 302.17"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

SOLVED. Although not told by the the tests"dpkg -l |grep nvia| amd
"modinfo nvidia", there was a mismatch between the runtime and the
driver. When this became clear, on a new "apt-get upgrade" a mixture
of versions 302 and 304 was installed, creating a mess. I had to
correct manually by installing the specific version 304 for all
packages with "apt-get install=version".

In face of so much time lost for trivial problems - and posted on NAMD
for NAMD non existing problems (I apologize for that)- I think now
that it would be better (at least for people using the OS for
scientific purposes) to install the driver the "nvidia way" rather
than the "Debian way". In order to have something fixed when upgrading
Debian. As I am presently short of time, I decided for no more
upgrading Debian until I have free time enough to change the "way".

Thanks
francesco Pietra

On Thu, Sep 27, 2012 at 3:54 PM, Francesco Pietra <chiendarret_at_gmail.com> wrote:
> There are for me two ways of getting cuda at work: (a) install the
> driver according to nvidia (as probably is implied in what you
> suggested); (b) rely on Debian amd64, which furnishes precompiled
> nvidia driver. I adopted (b) because upgrading is automatic and Debian
> is notoriously highly reliable.
>
> I did not take notice of the cuda driver I had just before the "fatal"
> upgrading, but it were months that I did not upgrade. The version noed
> on my amd64 notebook is 295.53; probably I upgraded from that version.
>
> Now, on amd64,version 304.48.1 is available, while in my system
> version 302.17-3 is installed, along with the basic
> nvidia-kernel-dkms, as I posted initially. All under cuda-toolkit
> version 4 (although this is not used in the "Debian way" of my
> installation).
>
> The output of
>
> dpkg -l |grep nvidia
>
> modinfo nvidia
>
> which I posted initially, indicate, in my experience, that everything
> is working correctly. On these basis, I suspected that 302.17-3 is too
> advanced for current namd builds, although everything is under toolkit
> 4 (or equivalent way).
>
> I could try to install 295 driver in place of 302 but probably someone
> knows better than me what could be expected. Moving forward is easy,
> going back, with any OS, is matter for experts.
>
> I am not sure that all I said is correct. I am a biochemist, not a
> software expert.
>
> Thanks for your kind attention.
>
> francesco pietra
>
> On Thu, Sep 27, 2012 at 2:58 PM, Aron Broom <broomsday_at_gmail.com> wrote:
>> So one potential problem here: is 302.17 a development driver, or just the
>> one Linux installs itself from the proprietary drivers? It looks to me like
>> the absolutely newest development driver is ver 295.41. I'm not confident
>> that you'd be able to run NAMD without the development driver installed.
>> The installation is manual, and it should overwrite whatever driver you have
>> there. I recommend a trip to the CUDA development zone webpage.
>>
>> ~Aron
>>
>> On Thu, Sep 27, 2012 at 3:52 AM, Francesco Pietra <chiendarret_at_gmail.com>
>> wrote:
>>>
>>> Hello:
>>> I have tried the NAMD_CVS-2012-09-26_Linux-x86_64-multicore-CUDA with
>>> nvidia version 302.17:
>>>
>>> Running command: namd2 heat-01.conf +p6 +idlepoll
>>>
>>> Charm++: standalone mode (not using charmrun)
>>> Converse/Charm++ Commit ID: v6.4.0-beta1-0-g5776d21
>>> CharmLB> Load balancer assumes all CPUs are same.
>>> Charm++> Running on 1 unique compute nodes (12-way SMP).
>>> Charm++> cpu topology info is gathered in 0.001 seconds.
>>> Info: NAMD CVS-2012-09-26 for Linux-x86_64-multicore-CUDA
>>> Info:
>>> Info: Please visit http://www.ks.uiuc.edu/Research/namd/
>>> Info: for updates, documentation, and support information.
>>> Info:
>>> Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005)
>>> Info: in all publications reporting results obtained with NAMD.
>>> Info:
>>> Info: Based on Charm++/Converse 60400 for multicore-linux64-iccstatic
>>> Info: Built Wed Sep 26 02:25:08 CDT 2012 by jim on lisboa.ks.uiuc.edu
>>> Info: 1 NAMD CVS-2012-09-26 Linux-x86_64-multicore-CUDA 6 gig64
>>> francesco
>>> Info: Running on 6 processors, 1 nodes, 1 physical nodes.
>>> Info: CPU topology information available.
>>> Info: Charm++/Converse parallel runtime startup completed at 0.085423 s
>>> FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 3 (gig64):
>>> initialization error
>>> FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 1 (gig64):
>>> initialization error
>>> ------------- Processor 3 Exiting: Called CmiAbort ------------
>>> Reason: FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 3 (gig64):
>>> initialization error
>>>
>>> Program finished.
>>> FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 4 (gig64):
>>> initialization error
>>> FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 2 (gig64):
>>> initialization error
>>>
>>>
>>> As I had (nearly) no comment to such failures, I can only imagine that
>>> either (i) my question - disregarding obvious issues - was too silly
>>> to merit attention; (ii) it is well known that nvidia version 302.17
>>> is incompatible with current namd builds for Linux-GNU.
>>>
>>> At any event, in the frame of metapackages, it is probably impossible
>>> within Debian GNU-Linux wheezy to go back to a previous version of
>>> nvidia. On the other hand, the stable version of the OS furnishes a
>>> much too old version of nvidia. Therefore, my question is:
>>>
>>> Any chance to compile namd in front of installed nvidia version 302.17?
>>>
>>> Thanks for advice. Without access to namd-cuda I am currently hindered
>>> to answer a question raised by the reviewers of a manuscript (the CPU
>>> cluster has long ago been shut down, as it became too expensive for
>>> our budget)
>>>
>>> francesco pietra
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Sep 26, 2012 at 4:08 PM, Francesco Pietra <chiendarret_at_gmail.com>
>>> wrote:
>>> > I forgot to mention that I am at final version 2.9 of namd.
>>> > f.
>>> >
>>> > On Wed, Sep 26, 2012 at 4:05 PM, Aron Broom <broomsday_at_gmail.com> wrote:
>>> >> I'm not certain, but I think the driver version needs to match the CUDA
>>> >> toolkit version that NAMD uses, and I think the library file NAMD comes
>>> >> with
>>> >> is toolkit 4.0 or something of that sort.
>>> >>
>>> >> ~Aron
>>> >>
>>> >>
>>> >> On Wed, Sep 26, 2012 at 9:58 AM, Francesco Pietra
>>> >> <chiendarret_at_gmail.com>
>>> >> wrote:
>>> >>>
>>> >>> Hi:
>>> >>> Following updating/upgrading of Debian GNU-Linux amd64 wheezy,
>>> >>> minimizations do not run anymore on GTX-680:
>>> >>>
>>> >>> CUDA error in CudaGetDeviceCount on Pe3 Pe4, Pe6. Initialization
>>> >>> error.
>>> >>>
>>> >>> The two GTX are regularly activated with
>>> >>> nvidia-smi -L
>>> >>> nvidia-smi -pm 1
>>> >>>
>>> >>> Server and nvidia are the same version:
>>> >>>
>>> >>> francesco_at_gig64:~$ dpkg -l |grep nvidia
>>> >>> ii glx-alternative-nvidia 0.2.2
>>> >>> amd64 allows the selection of NVIDIA as GLX provider
>>> >>> ii libgl1-nvidia-alternatives 302.17-3
>>> >>> amd64 transition libGL.so* diversions to
>>> >>> glx-alternative-nvidia
>>> >>> ii libgl1-nvidia-glx:amd64 302.17-3
>>> >>> amd64 NVIDIA binary OpenGL libraries
>>> >>> ii libglx-nvidia-alternatives 302.17-3
>>> >>> amd64 transition libgl.so diversions to
>>> >>> glx-alternative-nvidia
>>> >>> ii libnvidia-ml1:amd64 302.17-3
>>> >>> amd64 NVIDIA management library (NVML) runtime library
>>> >>> ii nvidia-alternative 302.17-3
>>> >>> amd64 allows the selection of NVIDIA as GLX provider
>>> >>> ii nvidia-glx 302.17-3
>>> >>> amd64 NVIDIA metapackage
>>> >>> ii nvidia-installer-cleanup 20120630+3
>>> >>> amd64 Cleanup after driver installation with the
>>> >>> nvidia-installer
>>> >>> ii nvidia-kernel-common 20120630+3
>>> >>> amd64 NVIDIA binary kernel module support files
>>> >>> ii nvidia-kernel-dkms 302.17-3
>>> >>> amd64 NVIDIA binary kernel module DKMS source
>>> >>> ii nvidia-smi 302.17-3
>>> >>> amd64 NVIDIA System Management Interface
>>> >>> ii nvidia-support 20120630+3
>>> >>> amd64 NVIDIA binary graphics driver support files
>>> >>> ii nvidia-vdpau-driver:amd64 302.17-3
>>> >>> amd64 NVIDIA vdpau driver
>>> >>> ii nvidia-xconfig 302.17-2
>>> >>> amd64 X configuration tool for non-free NVIDIA drivers
>>> >>> ii xserver-xorg-video-nvidia 302.17-3
>>> >>> amd64 NVIDIA binary Xorg driver
>>> >>> francesco_at_gig64:~$
>>> >>>
>>> >>>
>>> >>> root_at_gig64:/home/francesco# modinfo nvidia
>>> >>> filename: /lib/modules/3.2.0-2-amd64/updates/dkms/nvidia.ko
>>> >>> alias: char-major-195-*
>>> >>> version: 302.17
>>> >>> supported: external
>>> >>> license: NVIDIA
>>> >>> alias: pci:v000010DEd00000E00sv*sd*bc04sc80i00*
>>> >>> alias: pci:v000010DEd00000AA3sv*sd*bc0Bsc40i00*
>>> >>> alias: pci:v000010DEd*sv*sd*bc03sc02i00*
>>> >>> alias: pci:v000010DEd*sv*sd*bc03sc00i00*
>>> >>> depends: i2c-core
>>> >>> vermagic: 3.2.0-2-amd64 SMP mod_unload modversions
>>> >>> parm: NVreg_EnableVia4x:int
>>> >>> parm: NVreg_EnableALiAGP:int
>>> >>> parm: NVreg_ReqAGPRate:int
>>> >>> parm: NVreg_EnableAGPSBA:int
>>> >>> parm: NVreg_EnableAGPFW:int
>>> >>> parm: NVreg_Mobile:int
>>> >>> parm: NVreg_ResmanDebugLevel:int
>>> >>> parm: NVreg_RmLogonRC:int
>>> >>> parm: NVreg_ModifyDeviceFiles:int
>>> >>> parm: NVreg_DeviceFileUID:int
>>> >>> parm: NVreg_DeviceFileGID:int
>>> >>> parm: NVreg_DeviceFileMode:int
>>> >>> parm: NVreg_RemapLimit:int
>>> >>> parm: NVreg_UpdateMemoryTypes:int
>>> >>> parm: NVreg_InitializeSystemMemoryAllocations:int
>>> >>> parm: NVreg_UseVBios:int
>>> >>> parm: NVreg_RMEdgeIntrCheck:int
>>> >>> parm: NVreg_UsePageAttributeTable:int
>>> >>> parm: NVreg_EnableMSI:int
>>> >>> parm: NVreg_MapRegistersEarly:int
>>> >>> parm: NVreg_RegisterForACPIEvents:int
>>> >>> parm: NVreg_RegistryDwords:charp
>>> >>> parm: NVreg_RmMsg:charp
>>> >>> parm: NVreg_NvAGP:int
>>> >>> root_at_gig64:/home/francesco#
>>> >>>
>>> >>> I have also tried with recently used MD files, same problem:
>>> >>> francesco_at_gig64:~/tmp$ charmrun namd2 heat-01.conf +p6 +idlepoll 2>&1
>>> >>> | tee heat-01.log
>>> >>> Running command: namd2 heat-01.conf +p6 +idlepoll
>>> >>>
>>> >>> Charm++: standalone mode (not using charmrun)
>>> >>> Converse/Charm++ Commit ID: v6.4.0-beta1-0-g5776d21
>>> >>> CharmLB> Load balancer assumes all CPUs are same.
>>> >>> Charm++> Running on 1 unique compute nodes (12-way SMP).
>>> >>> Charm++> cpu topology info is gathered in 0.001 seconds.
>>> >>> Info: NAMD CVS-2012-06-20 for Linux-x86_64-multicore-CUDA
>>> >>> Info:
>>> >>> Info: Please visit http://www.ks.uiuc.edu/Research/namd/
>>> >>> Info: for updates, documentation, and support information.
>>> >>> Info:
>>> >>> Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005)
>>> >>> Info: in all publications reporting results obtained with NAMD.
>>> >>> Info:
>>> >>> Info: Based on Charm++/Converse 60400 for multicore-linux64-iccstatic
>>> >>> Info: Built Wed Jun 20 02:24:32 CDT 2012 by jim on lisboa.ks.uiuc.edu
>>> >>> Info: 1 NAMD CVS-2012-06-20 Linux-x86_64-multicore-CUDA 6 gig64
>>> >>> francesco
>>> >>> Info: Running on 6 processors, 1 nodes, 1 physical nodes.
>>> >>> Info: CPU topology information available.
>>> >>> Info: Charm++/Converse parallel runtime startup completed at
>>> >>> 0.00989199 s
>>> >>> FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 5 (gig64):
>>> >>> initialization error
>>> >>> ------------- Processor 5 Exiting: Called CmiAbort ------------
>>> >>> Reason: FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 5 (gig64):
>>> >>> initialization error
>>> >>>
>>> >>> FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 1 (gig64):
>>> >>> initialization error
>>> >>> Program finished.
>>> >>> FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 3 (gig64):
>>> >>> initialization error
>>> >>> FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 2 (gig64):
>>> >>> initialization error
>>> >>> FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 4 (gig64):
>>> >>> initialization error
>>> >>> francesco_at_gig64:~/tmp$
>>> >>>
>>> >>>
>>> >>> This is a shared-mem machine.
>>> >>> Does the version 302.17 work for you?
>>> >>>
>>> >>> Thanks
>>> >>> francesco pietra
>>> >>>
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Aron Broom M.Sc
>>> >> PhD Student
>>> >> Department of Chemistry
>>> >> University of Waterloo
>>> >>
>>>
>>
>>
>>
>> --
>> Aron Broom M.Sc
>> PhD Student
>> Department of Chemistry
>> University of Waterloo
>>

Next message: Aron Broom: "Re: namd on nvidia 302.17"
Previous message: Aron Broom: "Re: Namd list: Problem in equilibration stage"
In reply to: Francesco Pietra: "Re: namd on nvidia 302.17"
Next in thread: Aron Broom: "Re: namd on nvidia 302.17"
Reply: Aron Broom: "Re: namd on nvidia 302.17"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:22:07 CST