From: Jianing Song (sjn_sk_at_hotmail.com)
Date: Wed Oct 13 2010 - 20:45:59 CDT
Thank you very much for your help. You are so kind to give such a detailed relpy. Thanks again.
> Date: Mon, 11 Oct 2010 23:09:15 -0400
> Subject: Re: namd-l: how to improve the occupation of GPU with NAMD-2.7b3-CUDA
> From: akohlmey_at_gmail.com
> To: sjn_sk_at_hotmail.com
> CC: namd-l_at_ks.uiuc.edu
> 2010/10/11 Jianing Song <sjn_sk_at_hotmail.com>:
> > Dear axel,
> dear jianing,
> please always keep the mailing list in cc:. thanks.
> > Thank you very much for your reply.
> >> on what kind of analysis do you base this assertion?
> > In the first case, the CPUTime from the .log file of 1CPU+GPU is 719 sec,
> > while the CPUTime from .log file of 8CPU is 583 sec.
> > In the second case, the CPUTime from the .log file of 1CPU+GPU is 1549 sec,
> > while the CPUTime from .log file of 8CPU is 1352 sec.
> > Then could I say that GPU show limited accelerate capacity?
> no. you have to realize that that the GPU as one device can only
> be occupied by one task (i.e. process) at a time, so - strictly speaking -
> the comparison of 1 CPU vs. 1 CPU + 1 GPU is the "normal"
> way how one would compare GPU acceleration.
> >> you get about 6x acceleration out of the first case
> >> and about 7x out of the second case. you only seem
> >> to have one GPU, so when oversubscribing it, you can
> >> only get additional acceleration out of it, since the GPU
> >> is already mostly occupied by the first host thread.
> > "6x acceleration" and "7xacceleration" , you mean that GPU accelerate
> > capacity is compared with just one CPU. I mean that we should only compare
> > the results of 1CPU+GPU with ones of 1CPU, not 8CPU.
> yes. technical limitations aside, in principle you could add
> more GPUs to your machine, right?
> > Definitely, the cluster of our group has one GPU only.
> > I couldn't understand "you only seem to have one GPU, so when
> > oversubscribing it, you can only get additional acceleration out of it,
> > since the GPU is already mostly occupied by the first host thread". Could
> > you please explain it to me?
> well, as i mentioned before, the GPU is just one device
> can only be occupied by one task at a time. now, in the current
> NAMD GPU code, only part of the total execution, the calculation
> of the non-bonded interactions, is offloaded to the GPU. other
> calculations are still performed on the CPU. that leaves the GPU
> unoccupied at times and the NAMD implementation allows to
> "oversubscribe", i.e. have two or more CPU tasks launch GPU
> kernels on the same device. these executions, however, are
> forced to be serialized and depending on the amount of time
> spend on the GPU kernel and the rest, this is more or less
> > The utilization of one GPU in whole simulation processes is only 50%-60%(use
> > "nvidia-smi -a" command to get this utilization). Could it be possible to
> > raise the utilization of multi-GPUs since the utilization of one GPU is so
> > low?
> nvidia-smi -a doesn't show anything along those lines on
> my machine, but i don't have the latest driver and the latest
> hardware, so maybe it is different for you.
> you have to distinguish between utilization and occupancy.
> it is quite difficult to get high occupancy with MD, particularly
> when you have as easy to compute math like in a standard
> classical MD force field (that is why those functional forms
> were chosen in the first place after all).
> i am fairly certain that already with two CPU tasks per GPU, you
> will have kernels occasionally waiting for the GPU to be available.
> you should certainly get a better overall performance with more GPUs.
> on the other hand, you also have to consider that you need a certain
> number of atoms per GPU to be efficient, or else the speedup through
> using the GPU will be offset by the cost of transferring data in and
> out of the GPU. overall, due to memory and bus bandwidth limits,
> cache sizes and memory hierarchies, performance measurements
> and estimates are getting quite difficult.
> in conclusion, there are currently three major approaches to
> using GPUs for MD.
> a) keep the data completely on the GPU
> this gives the best performance, but you cannot run in parallel
> b) try to keep data on the GPU as much as possible and transfer
> only when needed, and have a 1 CPU (=MPI task) to 1 GPU ratio.
> this is easy to program and one can launch kernels asynchronously
> and try to use the CPU for some tasks in between. this is
> potentially the way to get the absolute fastest execution, but it is
> also somewhat wasteful.
> c) allow oversubscription of a GPU. this will increase GPU utilization,
> but requires smart programming and scheduling of kernels.
> this will give you the most bang for your buck, but only if you
> are not too greedy.
> there is a wealth of material (papers and presentations) on GPU
> programming applications and the GPU code implementation in
> NAMD in particular listed at this URL:
> >> list what kind of hardware you are using.
> > CPU infomation£ºIntel(R) Xeon(R) CPU E5620 *2
> > GPU infomation : Nvidia C2050
> ok. that is more or less what i expected.
> you are getting a pretty decent performance out of your GPU.
> > Thanks in advance!
> > Jianing
> Dr. Axel Kohlmeyer akohlmey_at_gmail.com
> Institute for Computational Molecular Science
> Temple University, Philadelphia PA, USA.
This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 05:23:17 CST