Re: namd on nvidia 302.17

From: Aron Broom (broomsday_at_gmail.com)
Date: Thu Sep 27 2012 - 12:56:23 CDT

glad to hear you fixed it. I guess that means you can use the normal
"non-development" driver to run NAMD? That is good news.

On Thu, Sep 27, 2012 at 1:39 PM, Francesco Pietra <chiendarret_at_gmail.com>wrote:

> SOLVED. Although not told by the the tests"dpkg -l |grep nvia| amd
> "modinfo nvidia", there was a mismatch between the runtime and the
> driver. When this became clear, on a new "apt-get upgrade" a mixture
> of versions 302 and 304 was installed, creating a mess. I had to
> correct manually by installing the specific version 304 for all
> packages with "apt-get install=version".
>
> In face of so much time lost for trivial problems - and posted on NAMD
> for NAMD non existing problems (I apologize for that)- I think now
> that it would be better (at least for people using the OS for
> scientific purposes) to install the driver the "nvidia way" rather
> than the "Debian way". In order to have something fixed when upgrading
> Debian. As I am presently short of time, I decided for no more
> upgrading Debian until I have free time enough to change the "way".
>
> Thanks
> francesco Pietra
>
> On Thu, Sep 27, 2012 at 3:54 PM, Francesco Pietra <chiendarret_at_gmail.com>
> wrote:
> > There are for me two ways of getting cuda at work: (a) install the
> > driver according to nvidia (as probably is implied in what you
> > suggested); (b) rely on Debian amd64, which furnishes precompiled
> > nvidia driver. I adopted (b) because upgrading is automatic and Debian
> > is notoriously highly reliable.
> >
> > I did not take notice of the cuda driver I had just before the "fatal"
> > upgrading, but it were months that I did not upgrade. The version noed
> > on my amd64 notebook is 295.53; probably I upgraded from that version.
> >
> > Now, on amd64,version 304.48.1 is available, while in my system
> > version 302.17-3 is installed, along with the basic
> > nvidia-kernel-dkms, as I posted initially. All under cuda-toolkit
> > version 4 (although this is not used in the "Debian way" of my
> > installation).
> >
> > The output of
> >
> > dpkg -l |grep nvidia
> >
> > modinfo nvidia
> >
> > which I posted initially, indicate, in my experience, that everything
> > is working correctly. On these basis, I suspected that 302.17-3 is too
> > advanced for current namd builds, although everything is under toolkit
> > 4 (or equivalent way).
> >
> > I could try to install 295 driver in place of 302 but probably someone
> > knows better than me what could be expected. Moving forward is easy,
> > going back, with any OS, is matter for experts.
> >
> > I am not sure that all I said is correct. I am a biochemist, not a
> > software expert.
> >
> > Thanks for your kind attention.
> >
> > francesco pietra
> >
> > On Thu, Sep 27, 2012 at 2:58 PM, Aron Broom <broomsday_at_gmail.com> wrote:
> >> So one potential problem here: is 302.17 a development driver, or just
> the
> >> one Linux installs itself from the proprietary drivers? It looks to me
> like
> >> the absolutely newest development driver is ver 295.41. I'm not
> confident
> >> that you'd be able to run NAMD without the development driver installed.
> >> The installation is manual, and it should overwrite whatever driver you
> have
> >> there. I recommend a trip to the CUDA development zone webpage.
> >>
> >> ~Aron
> >>
> >> On Thu, Sep 27, 2012 at 3:52 AM, Francesco Pietra <
> chiendarret_at_gmail.com>
> >> wrote:
> >>>
> >>> Hello:
> >>> I have tried the NAMD_CVS-2012-09-26_Linux-x86_64-multicore-CUDA with
> >>> nvidia version 302.17:
> >>>
> >>> Running command: namd2 heat-01.conf +p6 +idlepoll
> >>>
> >>> Charm++: standalone mode (not using charmrun)
> >>> Converse/Charm++ Commit ID: v6.4.0-beta1-0-g5776d21
> >>> CharmLB> Load balancer assumes all CPUs are same.
> >>> Charm++> Running on 1 unique compute nodes (12-way SMP).
> >>> Charm++> cpu topology info is gathered in 0.001 seconds.
> >>> Info: NAMD CVS-2012-09-26 for Linux-x86_64-multicore-CUDA
> >>> Info:
> >>> Info: Please visit http://www.ks.uiuc.edu/Research/namd/
> >>> Info: for updates, documentation, and support information.
> >>> Info:
> >>> Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005)
> >>> Info: in all publications reporting results obtained with NAMD.
> >>> Info:
> >>> Info: Based on Charm++/Converse 60400 for multicore-linux64-iccstatic
> >>> Info: Built Wed Sep 26 02:25:08 CDT 2012 by jim on lisboa.ks.uiuc.edu
> >>> Info: 1 NAMD CVS-2012-09-26 Linux-x86_64-multicore-CUDA 6 gig64
> >>> francesco
> >>> Info: Running on 6 processors, 1 nodes, 1 physical nodes.
> >>> Info: CPU topology information available.
> >>> Info: Charm++/Converse parallel runtime startup completed at 0.085423 s
> >>> FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 3 (gig64):
> >>> initialization error
> >>> FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 1 (gig64):
> >>> initialization error
> >>> ------------- Processor 3 Exiting: Called CmiAbort ------------
> >>> Reason: FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 3 (gig64):
> >>> initialization error
> >>>
> >>> Program finished.
> >>> FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 4 (gig64):
> >>> initialization error
> >>> FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 2 (gig64):
> >>> initialization error
> >>>
> >>>
> >>> As I had (nearly) no comment to such failures, I can only imagine that
> >>> either (i) my question - disregarding obvious issues - was too silly
> >>> to merit attention; (ii) it is well known that nvidia version 302.17
> >>> is incompatible with current namd builds for Linux-GNU.
> >>>
> >>> At any event, in the frame of metapackages, it is probably impossible
> >>> within Debian GNU-Linux wheezy to go back to a previous version of
> >>> nvidia. On the other hand, the stable version of the OS furnishes a
> >>> much too old version of nvidia. Therefore, my question is:
> >>>
> >>> Any chance to compile namd in front of installed nvidia version 302.17?
> >>>
> >>> Thanks for advice. Without access to namd-cuda I am currently hindered
> >>> to answer a question raised by the reviewers of a manuscript (the CPU
> >>> cluster has long ago been shut down, as it became too expensive for
> >>> our budget)
> >>>
> >>> francesco pietra
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Wed, Sep 26, 2012 at 4:08 PM, Francesco Pietra <
> chiendarret_at_gmail.com>
> >>> wrote:
> >>> > I forgot to mention that I am at final version 2.9 of namd.
> >>> > f.
> >>> >
> >>> > On Wed, Sep 26, 2012 at 4:05 PM, Aron Broom <broomsday_at_gmail.com>
> wrote:
> >>> >> I'm not certain, but I think the driver version needs to match the
> CUDA
> >>> >> toolkit version that NAMD uses, and I think the library file NAMD
> comes
> >>> >> with
> >>> >> is toolkit 4.0 or something of that sort.
> >>> >>
> >>> >> ~Aron
> >>> >>
> >>> >>
> >>> >> On Wed, Sep 26, 2012 at 9:58 AM, Francesco Pietra
> >>> >> <chiendarret_at_gmail.com>
> >>> >> wrote:
> >>> >>>
> >>> >>> Hi:
> >>> >>> Following updating/upgrading of Debian GNU-Linux amd64 wheezy,
> >>> >>> minimizations do not run anymore on GTX-680:
> >>> >>>
> >>> >>> CUDA error in CudaGetDeviceCount on Pe3 Pe4, Pe6. Initialization
> >>> >>> error.
> >>> >>>
> >>> >>> The two GTX are regularly activated with
> >>> >>> nvidia-smi -L
> >>> >>> nvidia-smi -pm 1
> >>> >>>
> >>> >>> Server and nvidia are the same version:
> >>> >>>
> >>> >>> francesco_at_gig64:~$ dpkg -l |grep nvidia
> >>> >>> ii glx-alternative-nvidia 0.2.2
> >>> >>> amd64 allows the selection of NVIDIA as GLX provider
> >>> >>> ii libgl1-nvidia-alternatives 302.17-3
> >>> >>> amd64 transition libGL.so* diversions to
> >>> >>> glx-alternative-nvidia
> >>> >>> ii libgl1-nvidia-glx:amd64 302.17-3
> >>> >>> amd64 NVIDIA binary OpenGL libraries
> >>> >>> ii libglx-nvidia-alternatives 302.17-3
> >>> >>> amd64 transition libgl.so diversions to
> >>> >>> glx-alternative-nvidia
> >>> >>> ii libnvidia-ml1:amd64 302.17-3
> >>> >>> amd64 NVIDIA management library (NVML) runtime library
> >>> >>> ii nvidia-alternative 302.17-3
> >>> >>> amd64 allows the selection of NVIDIA as GLX provider
> >>> >>> ii nvidia-glx 302.17-3
> >>> >>> amd64 NVIDIA metapackage
> >>> >>> ii nvidia-installer-cleanup 20120630+3
> >>> >>> amd64 Cleanup after driver installation with the
> >>> >>> nvidia-installer
> >>> >>> ii nvidia-kernel-common 20120630+3
> >>> >>> amd64 NVIDIA binary kernel module support files
> >>> >>> ii nvidia-kernel-dkms 302.17-3
> >>> >>> amd64 NVIDIA binary kernel module DKMS source
> >>> >>> ii nvidia-smi 302.17-3
> >>> >>> amd64 NVIDIA System Management Interface
> >>> >>> ii nvidia-support 20120630+3
> >>> >>> amd64 NVIDIA binary graphics driver support files
> >>> >>> ii nvidia-vdpau-driver:amd64 302.17-3
> >>> >>> amd64 NVIDIA vdpau driver
> >>> >>> ii nvidia-xconfig 302.17-2
> >>> >>> amd64 X configuration tool for non-free NVIDIA drivers
> >>> >>> ii xserver-xorg-video-nvidia 302.17-3
> >>> >>> amd64 NVIDIA binary Xorg driver
> >>> >>> francesco_at_gig64:~$
> >>> >>>
> >>> >>>
> >>> >>> root_at_gig64:/home/francesco# modinfo nvidia
> >>> >>> filename: /lib/modules/3.2.0-2-amd64/updates/dkms/nvidia.ko
> >>> >>> alias: char-major-195-*
> >>> >>> version: 302.17
> >>> >>> supported: external
> >>> >>> license: NVIDIA
> >>> >>> alias: pci:v000010DEd00000E00sv*sd*bc04sc80i00*
> >>> >>> alias: pci:v000010DEd00000AA3sv*sd*bc0Bsc40i00*
> >>> >>> alias: pci:v000010DEd*sv*sd*bc03sc02i00*
> >>> >>> alias: pci:v000010DEd*sv*sd*bc03sc00i00*
> >>> >>> depends: i2c-core
> >>> >>> vermagic: 3.2.0-2-amd64 SMP mod_unload modversions
> >>> >>> parm: NVreg_EnableVia4x:int
> >>> >>> parm: NVreg_EnableALiAGP:int
> >>> >>> parm: NVreg_ReqAGPRate:int
> >>> >>> parm: NVreg_EnableAGPSBA:int
> >>> >>> parm: NVreg_EnableAGPFW:int
> >>> >>> parm: NVreg_Mobile:int
> >>> >>> parm: NVreg_ResmanDebugLevel:int
> >>> >>> parm: NVreg_RmLogonRC:int
> >>> >>> parm: NVreg_ModifyDeviceFiles:int
> >>> >>> parm: NVreg_DeviceFileUID:int
> >>> >>> parm: NVreg_DeviceFileGID:int
> >>> >>> parm: NVreg_DeviceFileMode:int
> >>> >>> parm: NVreg_RemapLimit:int
> >>> >>> parm: NVreg_UpdateMemoryTypes:int
> >>> >>> parm: NVreg_InitializeSystemMemoryAllocations:int
> >>> >>> parm: NVreg_UseVBios:int
> >>> >>> parm: NVreg_RMEdgeIntrCheck:int
> >>> >>> parm: NVreg_UsePageAttributeTable:int
> >>> >>> parm: NVreg_EnableMSI:int
> >>> >>> parm: NVreg_MapRegistersEarly:int
> >>> >>> parm: NVreg_RegisterForACPIEvents:int
> >>> >>> parm: NVreg_RegistryDwords:charp
> >>> >>> parm: NVreg_RmMsg:charp
> >>> >>> parm: NVreg_NvAGP:int
> >>> >>> root_at_gig64:/home/francesco#
> >>> >>>
> >>> >>> I have also tried with recently used MD files, same problem:
> >>> >>> francesco_at_gig64:~/tmp$ charmrun namd2 heat-01.conf +p6 +idlepoll
> 2>&1
> >>> >>> | tee heat-01.log
> >>> >>> Running command: namd2 heat-01.conf +p6 +idlepoll
> >>> >>>
> >>> >>> Charm++: standalone mode (not using charmrun)
> >>> >>> Converse/Charm++ Commit ID: v6.4.0-beta1-0-g5776d21
> >>> >>> CharmLB> Load balancer assumes all CPUs are same.
> >>> >>> Charm++> Running on 1 unique compute nodes (12-way SMP).
> >>> >>> Charm++> cpu topology info is gathered in 0.001 seconds.
> >>> >>> Info: NAMD CVS-2012-06-20 for Linux-x86_64-multicore-CUDA
> >>> >>> Info:
> >>> >>> Info: Please visit http://www.ks.uiuc.edu/Research/namd/
> >>> >>> Info: for updates, documentation, and support information.
> >>> >>> Info:
> >>> >>> Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802
> (2005)
> >>> >>> Info: in all publications reporting results obtained with NAMD.
> >>> >>> Info:
> >>> >>> Info: Based on Charm++/Converse 60400 for
> multicore-linux64-iccstatic
> >>> >>> Info: Built Wed Jun 20 02:24:32 CDT 2012 by jim on
> lisboa.ks.uiuc.edu
> >>> >>> Info: 1 NAMD CVS-2012-06-20 Linux-x86_64-multicore-CUDA 6
> gig64
> >>> >>> francesco
> >>> >>> Info: Running on 6 processors, 1 nodes, 1 physical nodes.
> >>> >>> Info: CPU topology information available.
> >>> >>> Info: Charm++/Converse parallel runtime startup completed at
> >>> >>> 0.00989199 s
> >>> >>> FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 5 (gig64):
> >>> >>> initialization error
> >>> >>> ------------- Processor 5 Exiting: Called CmiAbort ------------
> >>> >>> Reason: FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 5
> (gig64):
> >>> >>> initialization error
> >>> >>>
> >>> >>> FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 1 (gig64):
> >>> >>> initialization error
> >>> >>> Program finished.
> >>> >>> FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 3 (gig64):
> >>> >>> initialization error
> >>> >>> FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 2 (gig64):
> >>> >>> initialization error
> >>> >>> FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 4 (gig64):
> >>> >>> initialization error
> >>> >>> francesco_at_gig64:~/tmp$
> >>> >>>
> >>> >>>
> >>> >>> This is a shared-mem machine.
> >>> >>> Does the version 302.17 work for you?
> >>> >>>
> >>> >>> Thanks
> >>> >>> francesco pietra
> >>> >>>
> >>> >>
> >>> >>
> >>> >>
> >>> >> --
> >>> >> Aron Broom M.Sc
> >>> >> PhD Student
> >>> >> Department of Chemistry
> >>> >> University of Waterloo
> >>> >>
> >>>
> >>
> >>
> >>
> >> --
> >> Aron Broom M.Sc
> >> PhD Student
> >> Department of Chemistry
> >> University of Waterloo
> >>
>

-- 
Aron Broom M.Sc
PhD Student
Department of Chemistry
University of Waterloo

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:22:07 CST