Re: how to improve the occupation of GPU with NAMD-2.7b3-CUDA

From: Axel Kohlmeyer (akohlmey_at_gmail.com)
Date: Mon Oct 11 2010 - 22:09:15 CDT

2010/10/11 Jianing Song <sjn_sk_at_hotmail.com>:
> Dear axel,

dear jianing,

please always keep the mailing list in cc:. thanks.

> Thank you very much for your reply.
>
>> on what kind of analysis do you base this assertion?
> In the first case, the CPUTime from the .log file of 1CPU+GPU is 719 sec,
> while the CPUTime from .log file of 8CPU is 583 sec.
> In the second case, the CPUTime from the .log file of 1CPU+GPU is 1549 sec,
> while the CPUTime from .log file of 8CPU is 1352 sec.
> Then could I say that GPU show limited accelerate capacity?

no. you have to realize that that the GPU as one device can only
be occupied by one task (i.e. process) at a time, so - strictly speaking -
the comparison of 1 CPU vs. 1 CPU + 1 GPU is the "normal"
way how one would compare GPU acceleration.

>> you get about 6x acceleration out of the first case
>> and about 7x out of the second case. you only seem
>> to have one GPU, so when oversubscribing it, you can
>> only get additional acceleration out of it, since the GPU
>> is already mostly occupied by the first host thread.
>
> "6x acceleration" and "7xacceleration" , you mean that GPU accelerate
> capacity is compared with just one CPU. I mean that we should only compare
> the results of 1CPU+GPU with ones of 1CPU, not 8CPU.

yes. technical limitations aside, in principle you could add
more GPUs to your machine, right?

> Definitely, the cluster of our group has one GPU only.
> I couldn't understand "you only seem to have one GPU, so when
> oversubscribing it, you can only get additional acceleration out of it,
> since the GPU is already mostly occupied by the first host thread". Could
> you please explain it to me?

well, as i mentioned before, the GPU is just one device
can only be occupied by one task at a time. now, in the current
NAMD GPU code, only part of the total execution, the calculation
of the non-bonded interactions, is offloaded to the GPU. other
calculations are still performed on the CPU. that leaves the GPU
unoccupied at times and the NAMD implementation allows to
"oversubscribe", i.e. have two or more CPU tasks launch GPU
kernels on the same device. these executions, however, are
forced to be serialized and depending on the amount of time
spend on the GPU kernel and the rest, this is more or less
effective.

> The utilization of one GPU in whole simulation processes is only 50%-60%(use
> "nvidia-smi -a" command to get this utilization). Could it be possible to
> raise the utilization of multi-GPUs since the utilization of one GPU is so
> low?

nvidia-smi -a doesn't show anything along those lines on
my machine, but i don't have the latest driver and the latest
hardware, so maybe it is different for you.

you have to distinguish between utilization and occupancy.
it is quite difficult to get high occupancy with MD, particularly
when you have as easy to compute math like in a standard
classical MD force field (that is why those functional forms
were chosen in the first place after all).

i am fairly certain that already with two CPU tasks per GPU, you
will have kernels occasionally waiting for the GPU to be available.

you should certainly get a better overall performance with more GPUs.
on the other hand, you also have to consider that you need a certain
number of atoms per GPU to be efficient, or else the speedup through
using the GPU will be offset by the cost of transferring data in and
out of the GPU. overall, due to memory and bus bandwidth limits,
cache sizes and memory hierarchies, performance measurements
and estimates are getting quite difficult.

in conclusion, there are currently three major approaches to
using GPUs for MD.

a) keep the data completely on the GPU
    this gives the best performance, but you cannot run in parallel

b) try to keep data on the GPU as much as possible and transfer
    only when needed, and have a 1 CPU (=MPI task) to 1 GPU ratio.
   this is easy to program and one can launch kernels asynchronously
   and try to use the CPU for some tasks in between. this is
   potentially the way to get the absolute fastest execution, but it is
   also somewhat wasteful.

c) allow oversubscription of a GPU. this will increase GPU utilization,
   but requires smart programming and scheduling of kernels.
   this will give you the most bang for your buck, but only if you
   are not too greedy.

there is a wealth of material (papers and presentations) on GPU
programming applications and the GPU code implementation in
NAMD in particular listed at this URL:

http://www.ks.uiuc.edu/Research/gpu/

>> list what kind of hardware you are using.
> CPU infomation£şIntel(R) Xeon(R) CPU E5620 *2
> GPU infomation : Nvidia C2050

ok. that is more or less what i expected.

you are getting a pretty decent performance out of your GPU.

cheers,
    axel.

>
>
> Thanks in advance!
>
> Jianing
>
>
>

-- 
Dr. Axel Kohlmeyer    akohlmey_at_gmail.com
http://sites.google.com/site/akohlmey/
Institute for Computational Molecular Science
Temple University, Philadelphia PA, USA.

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:54:36 CST