Re: FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 0 (thomasASUS): CUDA driver version is insufficient for CUDA runtime version

From: David Hardy (dhardy_at_ks.uiuc.edu)
Date: Mon Oct 22 2012 - 12:00:48 CDT

Next message: Thomas Evangelidis: "Re: FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 0 (thomasASUS): CUDA driver version is insufficient for CUDA runtime version"
Previous message: Aron Broom: "Re: FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 0 (thomasASUS): CUDA driver version is insufficient for CUDA runtime version"
In reply to: Aron Broom: "Re: FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 0 (thomasASUS): CUDA driver version is insufficient for CUDA runtime version"
Next in thread: Norman Geist: "AW: FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 0 (thomasASUS): CUDA driver version is insufficient for CUDA runtime version"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

I'm addressing just a few of issues from the previous comments.

NAMD uses the GPU to calculate only the short-range non-bonded interactions. The "nonbondedFreq" parameter controls for multiple time stepping how often these interactions are calculated. Also, there are two main GPU compute kernels: one that calculates the energy and another that omits the energy calculation when not needed. The second is faster, so that performance is improved by setting "outputEnergies" to a higher value than the default value 1, for example, setting it to 100 instead.

Most of the compute intensive parts of NAMD are well enough tuned that hyper-threading is likely to propagate cache misses which will reduce performance.

-Dave

On Oct 22, 2012, at 10:10 AM, Aron Broom wrote:

> Norman, Isn't it actually the opposite of what you wrote? Doesn't NAMD do the electrostatic calculations on the GPU and everything else on the CPU? So having fullelectfrequency at 4 fs while the timestep is 2 fs, will be an improvement because the code stays on the CPU for two steps in a row before going to the GPU.
>
> Thomas, I strongly second what Norman says about the hyper-threading, telling NAMD to use hyperthreads is extremely punishing to performance. test +p2, +p3, and +p4, one should work fairly well.
>
> Is there a really good reason for using the newest CUDA release, rather than 4.x? NAMD comes with it's own cuda library so maybe it doesn't matter, but still, it wasn't made for 5.0.
>
> On the same note, it looks like you installed the latest non-development drivers, you might want to instead install the latest development (295 or something)
>
> Do you need to use charmrun? You should download the binaries for NAMD_2.9_Linux-x86_64-multicore-CUDA, and then you should just be able to run: namd2 +p n +idlepoll myconfig.namd
>
> ~Aron
>
> On Mon, Oct 22, 2012 at 2:01 AM, Norman Geist <norman.geist_at_uni-greifswald.de> wrote:
> Hi Thomas,
>
>
>
> as NAMD is only partly ported to GPU, it need to switch between GPU and CPU at every timestep. To prevent NAMD from doing this, you can use a higher value for fullelectfrequency, for instance 4, to let NAMD stay at the GPU for 4 steps, before returning to CPU to do the electrostatic stuff. This will harm energy conservation and comes with a slight drift in temperature, but can be controlled with a low damping langevin.
>
>
>
> Nevertheless, there should be a speedup of about 2-3 times compared to CPU only without this hack and about 5-10 with. As you got a mobile chipset, you should check the following things:
>
>
>
> 1. Make sure the GPU is allowed to run in performance rather energy saving mode. (nvidia-smi)
>
> 2. Make sure it’s running on PCIE 2 or higher (nvidia-smi)
>
> 3. Try comparing the timing of raising numbers of cpus with and without GPU.
>
> This will show if you oversubscribe the GPU or the PCIE.
>
> 4. Are you really sure that your notebook got 8 physical cores??
>
> It doesn’t make much sense to oversubscribe the GPU with HT cores.
>
> 5. Why do you need to set the +devices?
>
>
>
> Let us know
>
> Norman Geist.
>
>
>
> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im Auftrag von Thomas Evangelidis
> Gesendet: Samstag, 20. Oktober 2012 18:47
> An: namd-l
> Betreff: Re: namd-l: FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 0 (thomasASUS): CUDA driver version is insufficient for CUDA runtime version
>
>
>
> Hi again,
>
> I managed to install the latest NVIDIA drivers (NVIDIA-Linux-x86_64-304.51) and the latest production CUDA-5.0 release on my AsusN56V with i7-3610QM and GeForce GT 650M. The trick for NAMD to find my GPU was to explicitly give in the command line "+devices 0". The whole command line looked like this:
>
> ${NAMD_HOME}/charmrun ++local +p8 ${NAMD_HOME}/namd2 +idlepoll +devices 0 production_default.amberff.octahedron.namd
>
> I used the precompiled binaries NAMD_CVS-2012-09-22_Linux-x86_64-multicore and NAMD_CVS-2012-09-22_Linux-x86_64-multicore-CUDA to monitor the performance on my system, which is a truncated octahedron with a protein (the ff I use is Amber99SB-NMR1-ILDN), 131788 TIP4P-Ew water atoms (32947 waters; each TIP4P-Ew counts 4 atoms in the Amber .prmtop), 93 Na and 113 Cl ions, namely 131788+93+113+2796=134790 atoms in total. Surprisingly the performance without the GPU is better as you can see below.
>
> With the GPU:
> Info: Benchmark time: 8 CPUs 0.238132 s/step 1.37808 days/ns 359.961 MB memory
>
> Without the GPU:
> Info: Benchmark time: 8 CPUs 0.206626 s/step 1.19575 days/ns 720.852 MB memory
>
> The only case I get better performance with the GPU is when I run NAMD in serial mode:
>
> With the GPU:
> Info: Benchmark time: 1 CPUs 0.26001 s/step 1.50469 days/ns 256.984 MB memory
>
> Without the GPU:
> Info: Benchmark time: 1 CPUs 0.808154 s/step 4.67682 days/ns 504.398 MB memory
>
>
> For the apo1a benchmark, NAMD complained about "++local" so I used the following command line:
>
> ${NAMD_HOME}//charmrun +p8 ${NAMD_HOME}//namd2 +idlepoll +devices 0 apoa1.namd
>
> This time the performance was almost the same with and without the GPU:
>
> With the GPU:
> Info: Benchmark time: 8 CPUs 0.22935 s/step 2.65451 days/ns 280.375 MB memory
>
> Without the GPU:
> Info: Benchmark time: 8 CPUs 0.223781 s/step 2.59006 days/ns 696.184 MB memory
>
>
> Is there any parameter I can tweak to get better GPU performance for my system??? Below is the GPU assignment when I run on all available cores.
>
> Pe 7 physical rank 7 will use CUDA device of pe 4
> Pe 2 physical rank 2 will use CUDA device of pe 4
> Pe 3 physical rank 3 will use CUDA device of pe 4
> Pe 6 physical rank 6 will use CUDA device of pe 4
> Pe 1 physical rank 1 will use CUDA device of pe 4
> Pe 5 physical rank 5 will use CUDA device of pe 4
> Pe 4 physical rank 4 binding to CUDA device 0 on thomasASUS: 'GeForce GT 650M' Mem: 2047MB Rev: 3.0
> Pe 0 physical rank 0 will use CUDA device of pe 4
>
>
>
> Thanks,
> Thomas
>
>
>
>
> --
> Aron Broom M.Sc
> PhD Student
> Department of Chemistry
> University of Waterloo
>

--
David J. Hardy, Ph.D.
Theoretical and Computational Biophysics
Beckman Institute, University of Illinois
dhardy_at_ks.uiuc.edu
http://www.ks.uiuc.edu/~dhardy/

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:22:41 CST