AW: FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 0 (thomasASUS): CUDA driver version is insufficient for CUDA runtime version

From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Tue Oct 23 2012 - 01:04:04 CDT

Next message: Giulia: "extract velocities"
Previous message: Norman Geist: "AW: FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 0 (thomasASUS): CUDA driver version is insufficient for CUDA runtime version"
In reply to: Thomas Evangelidis: "Re: FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 0 (thomasASUS): CUDA driver version is insufficient for CUDA runtime version"
Next in thread: Roberts, Jason: "RE: FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 0 (thomasASUS): CUDA driver version is insufficient for CUDA runtime version"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im Auftrag
von Thomas Evangelidis
Gesendet: Montag, 22. Oktober 2012 23:43
An: Namd Mailing List
Betreff: Re: namd-l: FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 0
(thomasASUS): CUDA driver version is insufficient for CUDA runtime version

Thank you all for your comment! I'll try to address all your questions
below:

@Norman

as NAMD is only partly ported to GPU, it need to switch between GPU and CPU
at every timestep. To prevent NAMD from doing this, you can use a higher
value for fullelectfrequency, for instance 4, to let NAMD stay at the GPU
for 4 steps, before returning to CPU to do the electrostatic stuff. This
will harm energy conservation and comes with a slight drift in temperature,
but can be controlled with a low damping langevin.

I set stepsPerCycle 20, nonBondedFreq 2, fullElectFrequency 4, but still the
non-CUDA binary runs faster.
Without the GPU:
Info: Initial time: 4 CPUs 0.127345 s/step 0.736951 days/ns 690.531 MB
memory
With the GPU:
Info: Initial time: 4 CPUs 0.132138 s/step 0.764688 days/ns 325.098 MB
memory

Nevertheless, there should be a speedup of about 2-3 times compared to CPU
only without this hack and about 5-10 with. As you got a mobile chipset, you
should check the following things:

1. Make sure the GPU is allowed to run in performance rather energy
saving mode. (nvidia-smi)

I did: nvidia-smi -i 0 -c 3
nvidia-smi -i 0 -pm 1
Unfortunately my GPU does not support performance monitoring with nvidia-smi
neither setting GPU Operation Mode to COMPUTE (--gom=1).

True, try nvidia-settings then.

2. Make sure it's running on PCIE 2 or higher (nvidia-smi)

>From my graphics card specifications I know it runs on PCI Express 2.0, PCI
Express 3.0, but I cannot get that information from nvidia-smi.

Same as above.

3. Try comparing the timing of raising numbers of cpus with and
without GPU.

This will show if you oversubscribe the GPU or the PCIE.

The following statistics were measured using the default amber ff parameters
taken from http://ambermd.org/namd/namd_amber.html

Without the GPU:
Info: Initial time: 1 CPUs 0.815043 s/step 4.71669 days/ns 496.727 MB memory
Info: Initial time: 2 CPUs 0.433734 s/step 2.51003 days/ns 561.973 MB memory
Info: Initial time: 3 CPUs 0.296255 s/step 1.71444 days/ns 579.684 MB memory
Info: Initial time: 4 CPUs 0.240091 s/step 1.38942 days/ns 685.805 MB memory
Info: Initial time: 5 CPUs 0.301051 s/step 1.74219 days/ns 802.668 MB memory
Info: Initial time: 6 CPUs 0.261714 s/step 1.51455 days/ns 600.703 MB memory
Info: Initial time: 7 CPUs 0.230685 s/step 1.33498 days/ns 660.172 MB memory
Info: Initial time: 8 CPUs 0.234672 s/step 1.35805 days/ns 721.027 MB memory

With the GPU:
Info: Initial time: 1 CPUs 0.259564 s/step 1.50211 days/ns 275.074 MB memory
Info: Initial time: 2 CPUs 0.242015 s/step 1.40055 days/ns 304.035 MB memory
Info: Initial time: 3 CPUs 0.240104 s/step 1.38949 days/ns 308.801 MB memory
Info: Initial time: 4 CPUs 0.236633 s/step 1.36941 days/ns 332.348 MB memory
Info: Initial time: 5 CPUs 0.241345 s/step 1.39667 days/ns 338.609 MB memory
Info: Initial time: 6 CPUs 0.239742 s/step 1.3874 days/ns 342.359 MB memory
Info: Initial time: 7 CPUs 0.236587 s/step 1.36914 days/ns 372.566 MB memory
Info: Initial time: 8 CPUs 0.241 s/step 1.39468 days/ns 367.969 MB memory

So as you can see there is a nice three-fold speedup between 1CPU to
1CPU+GPU. So basically it's working.

But as you can also see, you can't get additional speedup through
oversubscribing the GPU. I don't have much experience with

the consumer cards, so I can't say if this is the expected behavior, but I
guess not.

Maybe the lack of support for optimus on linux is the problem here. Possibly
you only use the low performance chip meant for energy saving and the real
powerful GPU is just sleeping as a linux system doesn't automatically switch
to it when performance is needed. There are a bunch of open source projects
out there to bring optimus support to linux. Maybe try out and see if you
get the performance switching to work. But check your BIOS before if you are
lucky enough to be able to disable is there.

Good luck.

4. Are you really sure that your notebook got 8 physical cores??

It doesn't make much sense to oversubscribe the GPU with HT cores.

i7 processors have 4 physical cores which are split into 8 threads.

5. Why do you need to set the +devices?

Because otherwise I get:

FATAL ERROR: CUDA error on Pe 1 (thomasASUS device 0): All CUDA devices are
in prohibited mode, of compute capability 1.0, or otherwise unusable.

Possibly this has to do with Optimus technology, NAMD finds just the Intel
on-board graphics card.

@Aron

Thomas, I strongly second what Norman says about the hyper-threading,
telling NAMD to use hyperthreads is extremely punishing to performance.
test +p2, +p3, and +p4, one should work fairly well.

Is there a way to disable hyperthreading apart from just using +p4 or less?

Is there a really good reason for using the newest CUDA release, rather than
4.x? NAMD comes with it's own cuda library so maybe it doesn't matter, but
still, it wasn't made for 5.0.

It has better support for Kepler architecture.

On the same note, it looks like you installed the latest non-development
drivers, you might want to instead install the latest development (295 or
something)

I installed the latest available, 304.51.

Do you need to use charmrun? You should download the binaries for
NAMD_2.9_Linux-x86_64-

multicore-CUDA, and then you should just be able to run: namd2 +p n
+idlepoll myconfig.namd

Apparently not. I discovered it later.

Next message: Giulia: "extract velocities"
Previous message: Norman Geist: "AW: FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 0 (thomasASUS): CUDA driver version is insufficient for CUDA runtime version"
In reply to: Thomas Evangelidis: "Re: FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 0 (thomasASUS): CUDA driver version is insufficient for CUDA runtime version"
Next in thread: Roberts, Jason: "RE: FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 0 (thomasASUS): CUDA driver version is insufficient for CUDA runtime version"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:22:41 CST