Re: FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 0 (thomasASUS): CUDA driver version is insufficient for CUDA runtime version

From: Aron Broom (broomsday_at_gmail.com)
Date: Mon Oct 22 2012 - 10:10:54 CDT

Next message: David Hardy: "Re: FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 0 (thomasASUS): CUDA driver version is insufficient for CUDA runtime version"
Previous message: Norman Geist: "AW: How can I write a log file on my cluster?"
Maybe in reply to: Chris Harrison: "Re: FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 0 (thomasASUS): CUDA driver version is insufficient for CUDA runtime version"
Next in thread: David Hardy: "Re: FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 0 (thomasASUS): CUDA driver version is insufficient for CUDA runtime version"
Reply: David Hardy: "Re: FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 0 (thomasASUS): CUDA driver version is insufficient for CUDA runtime version"
Reply: Norman Geist: "AW: FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 0 (thomasASUS): CUDA driver version is insufficient for CUDA runtime version"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

Norman, Isn't it actually the opposite of what you wrote? Doesn't NAMD do
the electrostatic calculations on the GPU and everything else on the CPU?
So having fullelectfrequency at 4 fs while the timestep is 2 fs, will be an
improvement because the code stays on the CPU for two steps in a row before
going to the GPU.

Thomas, I strongly second what Norman says about the hyper-threading,
telling NAMD to use hyperthreads is extremely punishing to performance.
test +p2, +p3, and +p4, one should work fairly well.

Is there a really good reason for using the newest CUDA release, rather
than 4.x? NAMD comes with it's own cuda library so maybe it doesn't
matter, but still, it wasn't made for 5.0.

On the same note, it looks like you installed the latest non-development
drivers, you might want to instead install the latest development (295 or
something)

Do you need to use charmrun? You should download the binaries for
NAMD_2.9_Linux-x86_64-multicore-CUDA, and then you should just be able to
run: namd2 +p n +idlepoll myconfig.namd

~Aron

On Mon, Oct 22, 2012 at 2:01 AM, Norman Geist <
norman.geist_at_uni-greifswald.de> wrote:

> Hi Thomas,****
>
> ** **
>
> as NAMD is only partly ported to GPU, it need to switch between GPU and
> CPU at every timestep. To prevent NAMD from doing this, you can use a
> higher value for fullelectfrequency, for instance 4, to let NAMD stay at
> the GPU for 4 steps, before returning to CPU to do the electrostatic stuff.
> This will harm energy conservation and comes with a slight drift in
> temperature, but can be controlled with a low damping langevin.****
>
> ** **
>
> Nevertheless, there should be a speedup of about 2-3 times compared to CPU
> only without this hack and about 5-10 with. As you got a mobile chipset,
> you should check the following things:****
>
> ** **
>
> **1. **Make sure the GPU is allowed to run in performance rather
> energy saving mode. (nvidia-smi)****
>
> **2. **Make sure it’s running on PCIE 2 or higher (nvidia-smi)****
>
> **3. **Try comparing the timing of raising numbers of cpus with and
> without GPU.****
>
> This will show if you oversubscribe the GPU or the PCIE.****
>
> **4. **Are you really sure that your notebook got 8 physical cores??
> ****
>
> It doesn’t make much sense to oversubscribe the GPU with HT cores.****
>
> **5. **Why do you need to set the +devices?****
>
> ** **
>
> Let us know****
>
> Norman Geist.****
>
> ** **
>
> *Von:* owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] *Im
> Auftrag von *Thomas Evangelidis
> *Gesendet:* Samstag, 20. Oktober 2012 18:47
> *An:* namd-l
> *Betreff:* Re: namd-l: FATAL ERROR: CUDA error in cudaGetDeviceCount on
> Pe 0 (thomasASUS): CUDA driver version is insufficient for CUDA runtime
> version****
>
> ** **
>
> Hi again,
>
> I managed to install the latest NVIDIA drivers
> (NVIDIA-Linux-x86_64-304.51) and the latest production CUDA-5.0 release on
> my AsusN56V with i7-3610QM and GeForce GT 650M. The trick for NAMD to find
> my GPU was to explicitly give in the command line "+devices 0". The whole
> command line looked like this:
>
> ${NAMD_HOME}/charmrun ++local +p8 ${NAMD_HOME}/namd2 +idlepoll +devices 0
> production_default.amberff.octahedron.namd
>
> I used the precompiled binaries NAMD_CVS-2012-09-22_Linux-x86_64-multicore
> and NAMD_CVS-2012-09-22_Linux-x86_64-multicore-CUDA to monitor the
> performance on my system, which is a truncated octahedron with a protein
> (the ff I use is Amber99SB-NMR1-ILDN), 131788 TIP4P-Ew water atoms (32947
> waters; each TIP4P-Ew counts 4 atoms in the Amber .prmtop), 93 Na and 113
> Cl ions, namely 131788+93+113+2796=134790 atoms in total. Surprisingly the
> performance without the GPU is better as you can see below.
>
> With the GPU:
> Info: Benchmark time: 8 CPUs 0.238132 s/step 1.37808 days/ns 359.961 MB
> memory
>
> Without the GPU:
> Info: Benchmark time: 8 CPUs 0.206626 s/step 1.19575 days/ns 720.852 MB
> memory
>
> The only case I get better performance with the GPU is when I run NAMD in
> serial mode:
>
> With the GPU:
> Info: Benchmark time: 1 CPUs 0.26001 s/step 1.50469 days/ns 256.984 MB
> memory
>
> Without the GPU:
> Info: Benchmark time: 1 CPUs 0.808154 s/step 4.67682 days/ns 504.398 MB
> memory
>
>
> For the apo1a benchmark, NAMD complained about "++local" so I used the
> following command line:
>
> ${NAMD_HOME}//charmrun +p8 ${NAMD_HOME}//namd2 +idlepoll +devices 0
> apoa1.namd
>
> This time the performance was almost the same with and without the GPU:
>
> With the GPU:
> Info: Benchmark time: 8 CPUs 0.22935 s/step 2.65451 days/ns 280.375 MB
> memory
>
> Without the GPU:
> Info: Benchmark time: 8 CPUs 0.223781 s/step 2.59006 days/ns 696.184 MB
> memory
>
>
> Is there any parameter I can tweak to get better GPU performance for my
> system??? Below is the GPU assignment when I run on all available cores.
>
> Pe 7 physical rank 7 will use CUDA device of pe 4
> Pe 2 physical rank 2 will use CUDA device of pe 4
> Pe 3 physical rank 3 will use CUDA device of pe 4
> Pe 6 physical rank 6 will use CUDA device of pe 4
> Pe 1 physical rank 1 will use CUDA device of pe 4
> Pe 5 physical rank 5 will use CUDA device of pe 4
> Pe 4 physical rank 4 binding to CUDA device 0 on thomasASUS: 'GeForce GT
> 650M' Mem: 2047MB Rev: 3.0
> Pe 0 physical rank 0 will use CUDA device of pe 4
>
>
>
> Thanks,
> Thomas****
>

-- 
Aron Broom M.Sc
PhD Student
Department of Chemistry
University of Waterloo

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:22:11 CST