From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Tue Oct 23 2012 - 00:37:16 CDT
Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im Auftrag
von Aron Broom
Gesendet: Montag, 22. Oktober 2012 17:11
An: Norman Geist
Cc: Thomas Evangelidis; Namd Mailing List
Betreff: Re: namd-l: FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 0
(thomasASUS): CUDA driver version is insufficient for CUDA runtime version
 
Hi Aron,
 
Norman,  Isn't it actually the opposite of what you wrote?  Doesn't NAMD do
the electrostatic calculations on the GPU and everything else on the CPU?
So having fullelectfrequency at 4 fs while the timestep is 2 fs, will be an
improvement because the code stays on the CPU for two steps in a row before
going to the GPU.  
No, NAMD is doing the non-bonded stuff on the GPU and since short time also
the energy computation. What really pulls down the performance of the GPUs
and prevents NAMD from being again 5 times faster is the electrostatic stuff
to be done on the CPU.
Thomas,  I strongly second what Norman says about the hyper-threading,
telling NAMD to use hyperthreads is extremely punishing to performance.
test +p2, +p3, and +p4, one should work fairly well.
Is there a really good reason for using the newest CUDA release, rather than
4.x?  NAMD comes with it's own cuda library so maybe it doesn't matter, but
still, it wasn't made for 5.0.
On the same note, it looks like you installed the latest non-development
drivers, you might want to instead install the latest development (295 or
something)
Do you need to use charmrun?  You should download the binaries for
NAMD_2.9_Linux-x86_64-multicore-CUDA, and then you should just be able to
run: namd2 +p n +idlepoll myconfig.namd
~Aron
On Mon, Oct 22, 2012 at 2:01 AM, Norman Geist
<norman.geist_at_uni-greifswald.de> wrote:
Hi Thomas,
 
as NAMD is only partly ported to GPU, it need to switch between GPU and CPU
at every timestep. To prevent NAMD from doing this, you can use a higher
value for fullelectfrequency, for instance 4, to let NAMD stay at the GPU
for 4 steps, before returning to CPU to do the electrostatic stuff. This
will harm energy conservation and comes with a slight drift in temperature,
but can be controlled with a low damping langevin.
 
Nevertheless, there should be a speedup of about 2-3 times compared to CPU
only without this hack and about 5-10 with. As you got a mobile chipset, you
should check the following things:
 
1.       Make sure the GPU is allowed to run in performance rather energy
saving mode. (nvidia-smi)
2.       Make sure it's running on PCIE 2 or higher (nvidia-smi)
3.       Try comparing the timing of raising numbers of cpus with and
without GPU.
This will show if you oversubscribe the GPU or the PCIE.
4.       Are you really sure that your notebook got 8 physical cores??
It doesn't make much sense to oversubscribe  the GPU with HT cores.
5.       Why do you need to set  the +devices?
 
Let us know
Norman Geist.
 
Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im Auftrag
von Thomas Evangelidis
Gesendet: Samstag, 20. Oktober 2012 18:47
An: namd-l
Betreff: Re: namd-l: FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 0
(thomasASUS): CUDA driver version is insufficient for CUDA runtime version
 
Hi again,
I managed to install the latest NVIDIA drivers (NVIDIA-Linux-x86_64-304.51)
and the latest production CUDA-5.0 release on my AsusN56V with i7-3610QM and
GeForce GT 650M. The trick for NAMD to find my GPU was to explicitly give in
the command line  "+devices 0". The whole command line looked like this:
${NAMD_HOME}/charmrun ++local +p8 ${NAMD_HOME}/namd2 +idlepoll +devices 0
production_default.amberff.octahedron.namd
I used the precompiled binaries NAMD_CVS-2012-09-22_Linux-x86_64-multicore
and NAMD_CVS-2012-09-22_Linux-x86_64-multicore-CUDA to monitor the
performance on my system, which is a truncated octahedron with a protein
(the ff I use is Amber99SB-NMR1-ILDN), 131788 TIP4P-Ew water atoms (32947
waters; each TIP4P-Ew counts 4 atoms in the Amber .prmtop), 93 Na and 113 Cl
ions, namely 131788+93+113+2796=134790 atoms in total. Surprisingly the
performance without the GPU is better as you can see below. 
With the GPU:
Info: Benchmark time: 8 CPUs 0.238132 s/step 1.37808 days/ns 359.961 MB
memory
Without the GPU:
Info: Benchmark time: 8 CPUs 0.206626 s/step 1.19575 days/ns 720.852 MB
memory
The only case I get better performance with the GPU is when I run NAMD in
serial mode:
With the GPU:
Info: Benchmark time: 1 CPUs 0.26001 s/step 1.50469 days/ns 256.984 MB
memory
Without the GPU:
Info: Benchmark time: 1 CPUs 0.808154 s/step 4.67682 days/ns 504.398 MB
memory
For the apo1a benchmark, NAMD complained about "++local" so I used the
following command line:
${NAMD_HOME}//charmrun +p8 ${NAMD_HOME}//namd2 +idlepoll +devices 0
apoa1.namd
This time the performance was almost the same with and without the GPU:
With the GPU:
Info: Benchmark time: 8 CPUs 0.22935 s/step 2.65451 days/ns 280.375 MB
memory
Without the GPU:
Info: Benchmark time: 8 CPUs 0.223781 s/step 2.59006 days/ns 696.184 MB
memory
Is there any parameter I can tweak to get better GPU performance for my
system??? Below is the GPU assignment when I run on all available cores. 
Pe 7 physical rank 7 will use CUDA device of pe 4
Pe 2 physical rank 2 will use CUDA device of pe 4
Pe 3 physical rank 3 will use CUDA device of pe 4
Pe 6 physical rank 6 will use CUDA device of pe 4
Pe 1 physical rank 1 will use CUDA device of pe 4
Pe 5 physical rank 5 will use CUDA device of pe 4
Pe 4 physical rank 4 binding to CUDA device 0 on thomasASUS: 'GeForce GT
650M'  Mem: 2047MB  Rev: 3.0
Pe 0 physical rank 0 will use CUDA device of pe 4
Thanks,
Thomas
-- Aron Broom M.Sc PhD Student Department of Chemistry University of Waterloo
This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:22:41 CST