Re: namd on nvidia 302.17

From: Aron Broom (broomsday_at_gmail.com)
Date: Thu Sep 27 2012 - 07:58:40 CDT

So one potential problem here: is 302.17 a development driver, or just the
one Linux installs itself from the proprietary drivers? It looks to me
like the absolutely newest development driver is *ver 295.41. I'm not
confident that you'd be able to run NAMD without the development driver
installed. The installation is manual, and it should overwrite whatever
driver you have there. I recommend a trip to the CUDA development zone
webpage.

~Aron
*
On Thu, Sep 27, 2012 at 3:52 AM, Francesco Pietra <chiendarret_at_gmail.com>wrote:

> Hello:
> I have tried the NAMD_CVS-2012-09-26_Linux-x86_64-multicore-CUDA with
> nvidia version 302.17:
>
> Running command: namd2 heat-01.conf +p6 +idlepoll
>
> Charm++: standalone mode (not using charmrun)
> Converse/Charm++ Commit ID: v6.4.0-beta1-0-g5776d21
> CharmLB> Load balancer assumes all CPUs are same.
> Charm++> Running on 1 unique compute nodes (12-way SMP).
> Charm++> cpu topology info is gathered in 0.001 seconds.
> Info: NAMD CVS-2012-09-26 for Linux-x86_64-multicore-CUDA
> Info:
> Info: Please visit http://www.ks.uiuc.edu/Research/namd/
> Info: for updates, documentation, and support information.
> Info:
> Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005)
> Info: in all publications reporting results obtained with NAMD.
> Info:
> Info: Based on Charm++/Converse 60400 for multicore-linux64-iccstatic
> Info: Built Wed Sep 26 02:25:08 CDT 2012 by jim on lisboa.ks.uiuc.edu
> Info: 1 NAMD CVS-2012-09-26 Linux-x86_64-multicore-CUDA 6 gig64
> francesco
> Info: Running on 6 processors, 1 nodes, 1 physical nodes.
> Info: CPU topology information available.
> Info: Charm++/Converse parallel runtime startup completed at 0.085423 s
> FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 3 (gig64):
> initialization error
> FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 1 (gig64):
> initialization error
> ------------- Processor 3 Exiting: Called CmiAbort ------------
> Reason: FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 3 (gig64):
> initialization error
>
> Program finished.
> FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 4 (gig64):
> initialization error
> FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 2 (gig64):
> initialization error
>
>
> As I had (nearly) no comment to such failures, I can only imagine that
> either (i) my question - disregarding obvious issues - was too silly
> to merit attention; (ii) it is well known that nvidia version 302.17
> is incompatible with current namd builds for Linux-GNU.
>
> At any event, in the frame of metapackages, it is probably impossible
> within Debian GNU-Linux wheezy to go back to a previous version of
> nvidia. On the other hand, the stable version of the OS furnishes a
> much too old version of nvidia. Therefore, my question is:
>
> Any chance to compile namd in front of installed nvidia version 302.17?
>
> Thanks for advice. Without access to namd-cuda I am currently hindered
> to answer a question raised by the reviewers of a manuscript (the CPU
> cluster has long ago been shut down, as it became too expensive for
> our budget)
>
> francesco pietra
>
>
>
>
>
>
>
>
>
>
> On Wed, Sep 26, 2012 at 4:08 PM, Francesco Pietra <chiendarret_at_gmail.com>
> wrote:
> > I forgot to mention that I am at final version 2.9 of namd.
> > f.
> >
> > On Wed, Sep 26, 2012 at 4:05 PM, Aron Broom <broomsday_at_gmail.com> wrote:
> >> I'm not certain, but I think the driver version needs to match the CUDA
> >> toolkit version that NAMD uses, and I think the library file NAMD comes
> with
> >> is toolkit 4.0 or something of that sort.
> >>
> >> ~Aron
> >>
> >>
> >> On Wed, Sep 26, 2012 at 9:58 AM, Francesco Pietra <
> chiendarret_at_gmail.com>
> >> wrote:
> >>>
> >>> Hi:
> >>> Following updating/upgrading of Debian GNU-Linux amd64 wheezy,
> >>> minimizations do not run anymore on GTX-680:
> >>>
> >>> CUDA error in CudaGetDeviceCount on Pe3 Pe4, Pe6. Initialization error.
> >>>
> >>> The two GTX are regularly activated with
> >>> nvidia-smi -L
> >>> nvidia-smi -pm 1
> >>>
> >>> Server and nvidia are the same version:
> >>>
> >>> francesco_at_gig64:~$ dpkg -l |grep nvidia
> >>> ii glx-alternative-nvidia 0.2.2
> >>> amd64 allows the selection of NVIDIA as GLX provider
> >>> ii libgl1-nvidia-alternatives 302.17-3
> >>> amd64 transition libGL.so* diversions to
> >>> glx-alternative-nvidia
> >>> ii libgl1-nvidia-glx:amd64 302.17-3
> >>> amd64 NVIDIA binary OpenGL libraries
> >>> ii libglx-nvidia-alternatives 302.17-3
> >>> amd64 transition libgl.so diversions to
> >>> glx-alternative-nvidia
> >>> ii libnvidia-ml1:amd64 302.17-3
> >>> amd64 NVIDIA management library (NVML) runtime library
> >>> ii nvidia-alternative 302.17-3
> >>> amd64 allows the selection of NVIDIA as GLX provider
> >>> ii nvidia-glx 302.17-3
> >>> amd64 NVIDIA metapackage
> >>> ii nvidia-installer-cleanup 20120630+3
> >>> amd64 Cleanup after driver installation with the
> >>> nvidia-installer
> >>> ii nvidia-kernel-common 20120630+3
> >>> amd64 NVIDIA binary kernel module support files
> >>> ii nvidia-kernel-dkms 302.17-3
> >>> amd64 NVIDIA binary kernel module DKMS source
> >>> ii nvidia-smi 302.17-3
> >>> amd64 NVIDIA System Management Interface
> >>> ii nvidia-support 20120630+3
> >>> amd64 NVIDIA binary graphics driver support files
> >>> ii nvidia-vdpau-driver:amd64 302.17-3
> >>> amd64 NVIDIA vdpau driver
> >>> ii nvidia-xconfig 302.17-2
> >>> amd64 X configuration tool for non-free NVIDIA drivers
> >>> ii xserver-xorg-video-nvidia 302.17-3
> >>> amd64 NVIDIA binary Xorg driver
> >>> francesco_at_gig64:~$
> >>>
> >>>
> >>> root_at_gig64:/home/francesco# modinfo nvidia
> >>> filename: /lib/modules/3.2.0-2-amd64/updates/dkms/nvidia.ko
> >>> alias: char-major-195-*
> >>> version: 302.17
> >>> supported: external
> >>> license: NVIDIA
> >>> alias: pci:v000010DEd00000E00sv*sd*bc04sc80i00*
> >>> alias: pci:v000010DEd00000AA3sv*sd*bc0Bsc40i00*
> >>> alias: pci:v000010DEd*sv*sd*bc03sc02i00*
> >>> alias: pci:v000010DEd*sv*sd*bc03sc00i00*
> >>> depends: i2c-core
> >>> vermagic: 3.2.0-2-amd64 SMP mod_unload modversions
> >>> parm: NVreg_EnableVia4x:int
> >>> parm: NVreg_EnableALiAGP:int
> >>> parm: NVreg_ReqAGPRate:int
> >>> parm: NVreg_EnableAGPSBA:int
> >>> parm: NVreg_EnableAGPFW:int
> >>> parm: NVreg_Mobile:int
> >>> parm: NVreg_ResmanDebugLevel:int
> >>> parm: NVreg_RmLogonRC:int
> >>> parm: NVreg_ModifyDeviceFiles:int
> >>> parm: NVreg_DeviceFileUID:int
> >>> parm: NVreg_DeviceFileGID:int
> >>> parm: NVreg_DeviceFileMode:int
> >>> parm: NVreg_RemapLimit:int
> >>> parm: NVreg_UpdateMemoryTypes:int
> >>> parm: NVreg_InitializeSystemMemoryAllocations:int
> >>> parm: NVreg_UseVBios:int
> >>> parm: NVreg_RMEdgeIntrCheck:int
> >>> parm: NVreg_UsePageAttributeTable:int
> >>> parm: NVreg_EnableMSI:int
> >>> parm: NVreg_MapRegistersEarly:int
> >>> parm: NVreg_RegisterForACPIEvents:int
> >>> parm: NVreg_RegistryDwords:charp
> >>> parm: NVreg_RmMsg:charp
> >>> parm: NVreg_NvAGP:int
> >>> root_at_gig64:/home/francesco#
> >>>
> >>> I have also tried with recently used MD files, same problem:
> >>> francesco_at_gig64:~/tmp$ charmrun namd2 heat-01.conf +p6 +idlepoll 2>&1
> >>> | tee heat-01.log
> >>> Running command: namd2 heat-01.conf +p6 +idlepoll
> >>>
> >>> Charm++: standalone mode (not using charmrun)
> >>> Converse/Charm++ Commit ID: v6.4.0-beta1-0-g5776d21
> >>> CharmLB> Load balancer assumes all CPUs are same.
> >>> Charm++> Running on 1 unique compute nodes (12-way SMP).
> >>> Charm++> cpu topology info is gathered in 0.001 seconds.
> >>> Info: NAMD CVS-2012-06-20 for Linux-x86_64-multicore-CUDA
> >>> Info:
> >>> Info: Please visit http://www.ks.uiuc.edu/Research/namd/
> >>> Info: for updates, documentation, and support information.
> >>> Info:
> >>> Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005)
> >>> Info: in all publications reporting results obtained with NAMD.
> >>> Info:
> >>> Info: Based on Charm++/Converse 60400 for multicore-linux64-iccstatic
> >>> Info: Built Wed Jun 20 02:24:32 CDT 2012 by jim on lisboa.ks.uiuc.edu
> >>> Info: 1 NAMD CVS-2012-06-20 Linux-x86_64-multicore-CUDA 6 gig64
> >>> francesco
> >>> Info: Running on 6 processors, 1 nodes, 1 physical nodes.
> >>> Info: CPU topology information available.
> >>> Info: Charm++/Converse parallel runtime startup completed at
> 0.00989199 s
> >>> FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 5 (gig64):
> >>> initialization error
> >>> ------------- Processor 5 Exiting: Called CmiAbort ------------
> >>> Reason: FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 5 (gig64):
> >>> initialization error
> >>>
> >>> FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 1 (gig64):
> >>> initialization error
> >>> Program finished.
> >>> FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 3 (gig64):
> >>> initialization error
> >>> FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 2 (gig64):
> >>> initialization error
> >>> FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 4 (gig64):
> >>> initialization error
> >>> francesco_at_gig64:~/tmp$
> >>>
> >>>
> >>> This is a shared-mem machine.
> >>> Does the version 302.17 work for you?
> >>>
> >>> Thanks
> >>> francesco pietra
> >>>
> >>
> >>
> >>
> >> --
> >> Aron Broom M.Sc
> >> PhD Student
> >> Department of Chemistry
> >> University of Waterloo
> >>
>
>

-- 
Aron Broom M.Sc
PhD Student
Department of Chemistry
University of Waterloo

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:22:07 CST