RE: how to improve the occupation of GPU with NAMD-2.7b3-CUDA

From: Jianing Song (sjn_sk_at_hotmail.com)
Date: Wed Oct 13 2010 - 20:45:59 CDT

Dear axel,

Thank you very much for your help. You are so kind to give such a detailed relpy. Thanks again.
 
Best wishes!

Jianing

> Date: Mon, 11 Oct 2010 23:09:15 -0400
> Subject: Re: namd-l: how to improve the occupation of GPU with NAMD-2.7b3-CUDA
> From: akohlmey_at_gmail.com
> To: sjn_sk_at_hotmail.com
> CC: namd-l_at_ks.uiuc.edu
>
> 2010/10/11 Jianing Song <sjn_sk_at_hotmail.com>:
> > Dear axel,
>
> dear jianing,
>
> please always keep the mailing list in cc:. thanks.
>
> > Thank you very much for your reply.
> >
> >> on what kind of analysis do you base this assertion?
> > In the first case, the CPUTime from the .log file of 1CPU+GPU is 719 sec,
> > while the CPUTime from .log file of 8CPU is 583 sec.
> > In the second case, the CPUTime from the .log file of 1CPU+GPU is 1549 sec,
> > while the CPUTime from .log file of 8CPU is 1352 sec.
> > Then could I say that GPU show limited accelerate capacity?
>
> no. you have to realize that that the GPU as one device can only
> be occupied by one task (i.e. process) at a time, so - strictly speaking -
> the comparison of 1 CPU vs. 1 CPU + 1 GPU is the "normal"
> way how one would compare GPU acceleration.
>
> >> you get about 6x acceleration out of the first case
> >> and about 7x out of the second case. you only seem
> >> to have one GPU, so when oversubscribing it, you can
> >> only get additional acceleration out of it, since the GPU
> >> is already mostly occupied by the first host thread.
> >
> > "6x acceleration" and "7xacceleration" , you mean that GPU accelerate
> > capacity is compared with just one CPU. I mean that we should only compare
> > the results of 1CPU+GPU with ones of 1CPU, not 8CPU.
>
> yes. technical limitations aside, in principle you could add
> more GPUs to your machine, right?
>
> > Definitely, the cluster of our group has one GPU only.
> > I couldn't understand "you only seem to have one GPU, so when
> > oversubscribing it, you can only get additional acceleration out of it,
> > since the GPU is already mostly occupied by the first host thread". Could
> > you please explain it to me?
>
> well, as i mentioned before, the GPU is just one device
> can only be occupied by one task at a time. now, in the current
> NAMD GPU code, only part of the total execution, the calculation
> of the non-bonded interactions, is offloaded to the GPU. other
> calculations are still performed on the CPU. that leaves the GPU
> unoccupied at times and the NAMD implementation allows to
> "oversubscribe", i.e. have two or more CPU tasks launch GPU
> kernels on the same device. these executions, however, are
> forced to be serialized and depending on the amount of time
> spend on the GPU kernel and the rest, this is more or less
> effective.
>
> > The utilization of one GPU in whole simulation processes is only 50%-60%(use
> > "nvidia-smi -a" command to get this utilization). Could it be possible to
> > raise the utilization of multi-GPUs since the utilization of one GPU is so
> > low?
>
> nvidia-smi -a doesn't show anything along those lines on
> my machine, but i don't have the latest driver and the latest
> hardware, so maybe it is different for you.
>
> you have to distinguish between utilization and occupancy.
> it is quite difficult to get high occupancy with MD, particularly
> when you have as easy to compute math like in a standard
> classical MD force field (that is why those functional forms
> were chosen in the first place after all).
>
> i am fairly certain that already with two CPU tasks per GPU, you
> will have kernels occasionally waiting for the GPU to be available.
>
> you should certainly get a better overall performance with more GPUs.
> on the other hand, you also have to consider that you need a certain
> number of atoms per GPU to be efficient, or else the speedup through
> using the GPU will be offset by the cost of transferring data in and
> out of the GPU. overall, due to memory and bus bandwidth limits,
> cache sizes and memory hierarchies, performance measurements
> and estimates are getting quite difficult.
>
> in conclusion, there are currently three major approaches to
> using GPUs for MD.
>
> a) keep the data completely on the GPU
> this gives the best performance, but you cannot run in parallel
>
> b) try to keep data on the GPU as much as possible and transfer
> only when needed, and have a 1 CPU (=MPI task) to 1 GPU ratio.
> this is easy to program and one can launch kernels asynchronously
> and try to use the CPU for some tasks in between. this is
> potentially the way to get the absolute fastest execution, but it is
> also somewhat wasteful.
>
> c) allow oversubscription of a GPU. this will increase GPU utilization,
> but requires smart programming and scheduling of kernels.
> this will give you the most bang for your buck, but only if you
> are not too greedy.
>
> there is a wealth of material (papers and presentations) on GPU
> programming applications and the GPU code implementation in
> NAMD in particular listed at this URL:
>
> http://www.ks.uiuc.edu/Research/gpu/
>
> >> list what kind of hardware you are using.
> > CPU infomation£ºIntel(R) Xeon(R) CPU E5620 *2
> > GPU infomation : Nvidia C2050
>
> ok. that is more or less what i expected.
>
> you are getting a pretty decent performance out of your GPU.
>
> cheers,
> axel.
>
> >
> >
> > Thanks in advance!
> >
> > Jianing
> >
> >
> >
>
>
>
> --
> Dr. Axel Kohlmeyer akohlmey_at_gmail.com
> http://sites.google.com/site/akohlmey/
>
> Institute for Computational Molecular Science
> Temple University, Philadelphia PA, USA.
                                               

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:54:36 CST