AW: FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 0 (thomasASUS): CUDA driver version is insufficient for CUDA runtime version

From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Tue Oct 23 2012 - 00:37:16 CDT

Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im Auftrag
von Aron Broom
Gesendet: Montag, 22. Oktober 2012 17:11
An: Norman Geist
Cc: Thomas Evangelidis; Namd Mailing List
Betreff: Re: namd-l: FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 0
(thomasASUS): CUDA driver version is insufficient for CUDA runtime version

 

Hi Aron,

 

Norman, Isn't it actually the opposite of what you wrote? Doesn't NAMD do
the electrostatic calculations on the GPU and everything else on the CPU?
So having fullelectfrequency at 4 fs while the timestep is 2 fs, will be an
improvement because the code stays on the CPU for two steps in a row before
going to the GPU.

No, NAMD is doing the non-bonded stuff on the GPU and since short time also
the energy computation. What really pulls down the performance of the GPUs
and prevents NAMD from being again 5 times faster is the electrostatic stuff
to be done on the CPU.

Thomas, I strongly second what Norman says about the hyper-threading,
telling NAMD to use hyperthreads is extremely punishing to performance.
test +p2, +p3, and +p4, one should work fairly well.

Is there a really good reason for using the newest CUDA release, rather than
4.x? NAMD comes with it's own cuda library so maybe it doesn't matter, but
still, it wasn't made for 5.0.

On the same note, it looks like you installed the latest non-development
drivers, you might want to instead install the latest development (295 or
something)

Do you need to use charmrun? You should download the binaries for
NAMD_2.9_Linux-x86_64-multicore-CUDA, and then you should just be able to
run: namd2 +p n +idlepoll myconfig.namd

~Aron

On Mon, Oct 22, 2012 at 2:01 AM, Norman Geist
<norman.geist_at_uni-greifswald.de> wrote:

Hi Thomas,

 

as NAMD is only partly ported to GPU, it need to switch between GPU and CPU
at every timestep. To prevent NAMD from doing this, you can use a higher
value for fullelectfrequency, for instance 4, to let NAMD stay at the GPU
for 4 steps, before returning to CPU to do the electrostatic stuff. This
will harm energy conservation and comes with a slight drift in temperature,
but can be controlled with a low damping langevin.

 

Nevertheless, there should be a speedup of about 2-3 times compared to CPU
only without this hack and about 5-10 with. As you got a mobile chipset, you
should check the following things:

 

1. Make sure the GPU is allowed to run in performance rather energy
saving mode. (nvidia-smi)

2. Make sure it's running on PCIE 2 or higher (nvidia-smi)

3. Try comparing the timing of raising numbers of cpus with and
without GPU.

This will show if you oversubscribe the GPU or the PCIE.

4. Are you really sure that your notebook got 8 physical cores??

It doesn't make much sense to oversubscribe the GPU with HT cores.

5. Why do you need to set the +devices?

 

Let us know

Norman Geist.

 

Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im Auftrag
von Thomas Evangelidis
Gesendet: Samstag, 20. Oktober 2012 18:47
An: namd-l
Betreff: Re: namd-l: FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 0
(thomasASUS): CUDA driver version is insufficient for CUDA runtime version

 

Hi again,

I managed to install the latest NVIDIA drivers (NVIDIA-Linux-x86_64-304.51)
and the latest production CUDA-5.0 release on my AsusN56V with i7-3610QM and
GeForce GT 650M. The trick for NAMD to find my GPU was to explicitly give in
the command line "+devices 0". The whole command line looked like this:

${NAMD_HOME}/charmrun ++local +p8 ${NAMD_HOME}/namd2 +idlepoll +devices 0
production_default.amberff.octahedron.namd

I used the precompiled binaries NAMD_CVS-2012-09-22_Linux-x86_64-multicore
and NAMD_CVS-2012-09-22_Linux-x86_64-multicore-CUDA to monitor the
performance on my system, which is a truncated octahedron with a protein
(the ff I use is Amber99SB-NMR1-ILDN), 131788 TIP4P-Ew water atoms (32947
waters; each TIP4P-Ew counts 4 atoms in the Amber .prmtop), 93 Na and 113 Cl
ions, namely 131788+93+113+2796=134790 atoms in total. Surprisingly the
performance without the GPU is better as you can see below.

With the GPU:
Info: Benchmark time: 8 CPUs 0.238132 s/step 1.37808 days/ns 359.961 MB
memory

Without the GPU:
Info: Benchmark time: 8 CPUs 0.206626 s/step 1.19575 days/ns 720.852 MB
memory

The only case I get better performance with the GPU is when I run NAMD in
serial mode:

With the GPU:
Info: Benchmark time: 1 CPUs 0.26001 s/step 1.50469 days/ns 256.984 MB
memory

Without the GPU:
Info: Benchmark time: 1 CPUs 0.808154 s/step 4.67682 days/ns 504.398 MB
memory

For the apo1a benchmark, NAMD complained about "++local" so I used the
following command line:

${NAMD_HOME}//charmrun +p8 ${NAMD_HOME}//namd2 +idlepoll +devices 0
apoa1.namd

This time the performance was almost the same with and without the GPU:

With the GPU:
Info: Benchmark time: 8 CPUs 0.22935 s/step 2.65451 days/ns 280.375 MB
memory

Without the GPU:
Info: Benchmark time: 8 CPUs 0.223781 s/step 2.59006 days/ns 696.184 MB
memory

Is there any parameter I can tweak to get better GPU performance for my
system??? Below is the GPU assignment when I run on all available cores.

Pe 7 physical rank 7 will use CUDA device of pe 4
Pe 2 physical rank 2 will use CUDA device of pe 4
Pe 3 physical rank 3 will use CUDA device of pe 4
Pe 6 physical rank 6 will use CUDA device of pe 4
Pe 1 physical rank 1 will use CUDA device of pe 4
Pe 5 physical rank 5 will use CUDA device of pe 4
Pe 4 physical rank 4 binding to CUDA device 0 on thomasASUS: 'GeForce GT
650M' Mem: 2047MB Rev: 3.0
Pe 0 physical rank 0 will use CUDA device of pe 4

Thanks,
Thomas

-- 
Aron Broom M.Sc
PhD Student
Department of Chemistry
University of Waterloo

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:22:11 CST