AW: AMD-PhenomII-1075_GTX470 NAMD-CUDA performance

From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Mon Jun 06 2011 - 07:22:45 CDT

Maybe you're right, but why does my results fit with others I saw? Sure I
looked if my machine is as fast as it should, and it looks like it is. So
there was no reason for me to believe that my machine is bad configured and
I've tested a lot.
Also the speedup I see cannot come from the non-accelerated part and the
further cpus, because the gpu run is three times faster than the cpu run and
I understood this as a expectable speedup.
I could try to increase the offload to the gpu, do you have an advice which
configuration parameters to test for that?

Thanks

Norman Geist.

-----Ursprüngliche Nachricht-----
Von: Axel Kohlmeyer [mailto:akohlmey_at_gmail.com]
Gesendet: Montag, 6. Juni 2011 13:58
An: Norman Geist
Cc: Namd Mailing List
Betreff: Re: namd-l: AMD-PhenomII-1075_GTX470 NAMD-CUDA performance

On Mon, Jun 6, 2011 at 7:49 AM, Norman Geist
<norman.geist_at_uni-greifswald.de> wrote:
> Axel, sure I tested my configuration before telling the world. My results
> are very similar to others I found, so let's say my machine is well
> configured. This machine had a dual core processor since I upgraded to a
six
> core a few days ago. Before that, the utilisation with both cores was
about
> 6%. Don't say this is expectable, because the same gpu runs at 100% with 1
> core at acemd. Theres much capacity on the tesla for namd.

norman,

those are very unusual numbers. somehow you are running a
configuration that spends _very_ little time in the non-bonded
kernel and thus the GPU is not utilized a lot. thus what you
are seeing is mostly the speedup in the non-accelerated part
of NAMD.

for a single task you should see at least see 20% utilization.
what is the speedup that you get for your system when using
a single NAMD task when going from running on the CPU only
to using the GPU?

> The result of the processor change from 2 to 6 was a expectable three
times
> faster execution and a three time higher utilisation of the gpu (now near
> 20%). All my tests with increasing numbers of cores (also with the dual
> core) that share a gpu are always three times faster than the cpus alone.
6

the GPU utilization number printed by nvidia-smi is not capturing
the entire performance of the machine, same as the top command
is not a good benchmarking tool.

> cores with 1 Tesla C2050 is the same as 6 cores with 2 Tesla C2050 and
every
> core I add, results in speedup. So I could, if I had, add more cores to
see
> what happens and when the limit is reached somewhere. But until now,
there's
> no bottleneck in pcie bandwidth or gpu utilisation with namd.

i disagree. you are for some unknown reason not moving a lot of work to the
GPU.
this doesn't hold for the general use of GPUs with NAMD.

axel.

> The graph shows the monitoring before and after the cpu change with namd
and
> a 1.3 million atoms system.
>
> Norman Geist.
>
>
> -----Ursprüngliche Nachricht-----
> Von: Axel Kohlmeyer [mailto:akohlmey_at_gmail.com]
> Gesendet: Montag, 6. Juni 2011 13:30
> An: Norman Geist
> Cc: Namd Mailing List
> Betreff: Re: namd-l: AMD-PhenomII-1075_GTX470 NAMD-CUDA performance
>
> On Mon, Jun 6, 2011 at 7:11 AM, Norman Geist
> <norman.geist_at_uni-greifswald.de> wrote:
>> Dear Axel,
>>
>> What I tried to say was that one thing is of course the bandwidth of the
>> pcie bus. But what's about the utilization of the gpu? If I have a
>> configuration of oversubscription that would still allow communication
>> between cpu and gpu due to enough pcie bandwidth, that wouldn't help me
if
>> my gpu is already fully utilized. And that’s what I tried to say. I ran a
>> 1,3 million atoms system on a tesla C2050, shared by 6 cpu cores, the
>> utilization of the gpu is about 20 percent. I haven't worked with the
>
> i wouldn't be surprised to see such a low utilization since
> most of the time is probably spent on moving data in and
> out of the GPU. a 6x oversubscription is pretty extreme.
> the main benefit is probably in the non-GPU accelerated parts
> (which is still a significant amount  of work for such a large system).
>
> have you made a systematic test in running different number
> of host processes?
>
> please note that high-end GeForce cards can have as many
> or even more (GTX 580!) cores than a C2050.
>
> the huge difference in cost between the Tesla and the GeForce
> does not automatically translate into different performance for
> a code that doesn't benefit from the features of the Tesla
> (4x double precision units, more memory, ECC memory,
> more reliability). in fact, turning on the ECC of the Tesla may
> even slow it down.
>
> axel.
>
>> geforce cards yet, but I can imagine that the utilization of the gpu
would
>> be much higher here because of the less cuda cores, what means that
> further
>> cpu cores wouldn't help here, while they would with the tesla C2050.
Maybe
>> my post was too general and not enough directed to namd, sorry for this.
>>
>> Norman Geist.
>>
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Axel Kohlmeyer [mailto:akohlmey_at_gmail.com]
>> Gesendet: Montag, 6. Juni 2011 12:50
>> An: Norman Geist
>> Cc: Francesco Pietra; Namd Mailing List
>> Betreff: Re: namd-l: AMD-PhenomII-1075_GTX470 NAMD-CUDA performance
>>
>> On Mon, Jun 6, 2011 at 5:51 AM, Norman Geist
>> <norman.geist_at_uni-greifswald.de> wrote:
>>> Hi Francesco,
>>>
>>> As your output shows, both gtx cards were in use.
>>>
>>> Pe 1 sharing CUDA device 1 -> This is gtx 1
>>> Pe 0 sharing CUDA device 0 -> This is gtx 2
>>>
>>> The driver you get from nvidia and from your os is the same I think. The
>> nvidia driver must be compiled for your platform, the os driver already
> is.
>>>
>>> If more gpus bring better performance regards heavily on your hardware
> and
>> system size. Just try if 6 cpus sharing on gtx is slower or the same as
>> 6cpus sharing 2 gtx cards. I think the oversubscription of such a gtx is
>> limited very quick and u should get better performance while using both
> the
>> cards.
>>
>> of course, oversubscribing GPUs can only help up to a point. it doesn't
>> create more GPUs, it only allows you to use it more efficiently. think of
>> it like hyperthreading. that also it a trick to improve utilization of
>> the different
>> units on the CPU, but it cannot replace a full processor core and its
>> efficiency
>> is limited to how much the different units of the CPU are occupied and
>> by the available memory bandwidth.
>>
>>>Not so if using a Tesla C2050. This card can be shared by more than 6
> cores
>> without running into a bottleneck if plugged into a pcie 2.0 x16 slot.
>>
>> this is nonsense. as far as the CUDA code in NAMD is concerned there is
> not
>> much of a difference between a Tesla and a GeForce card. In fact the
>> high-end
>> GeForce cards are often faster due to having higher memory and processor
>> clocks.
>> there is, however, the bottleneck of having sufficient PCI-e bus
>> bandwidth available,
>> but that affects both type of cards.
>>
>> axel.
>>
>>> Best regards.
>>>
>>> Norman Geist.
>>>
>>> -----Ursprüngliche Nachricht-----
>>> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
> Auftrag
>> von Francesco Pietra
>>> Gesendet: Montag, 6. Juni 2011 11:16
>>> An: NAMD
>>> Betreff: Fwd: namd-l: AMD-PhenomII-1075_GTX470 NAMD-CUDA performance
>>>
>>> I forgot to show the output log:
>>>
>>> Charm++> scheduler running in netpoll mode.
>>> Charm++> Running on 1 unique compute nodes (6-way SMP).
>>> Charm++> cpu topology info is gathered in 0.000 seconds.
>>> Info: NAMD CVS-2011-06-04 for Linux-x86_64-CUDA
>>>
>>> Info: Based on Charm++/Converse 60303 for net-linux-x86_64-iccstatic
>>> Info: Built Sat Jun 4 02:22:51 CDT 2011 by jim on lisboa.ks.uiuc.edu
>>> Info: 1 NAMD  CVS-2011-06-04  Linux-x86_64-CUDA  6    gig64  francesco
>>> Info: Running on 6 processors, 6 nodes, 1 physical nodes.
>>> Info: CPU topology information available.
>>> Info: Charm++/Converse parallel runtime startup completed at 0.00653386
s
>>> Pe 3 sharing CUDA device 1 first 1 next 5
>>> Pe 3 physical rank 3 binding to CUDA device 1 on gig64: 'GeForce GTX
>>> 470'  Mem: 1279MB  Rev: 2.0
>>> Pe 1 sharing CUDA device 1 first 1 next 3
>>> Pe 1 physical rank 1 binding to CUDA device 1 on gig64: 'GeForce GTX
>>> 470'  Mem: 1279MB  Rev: 2.0
>>> Pe 5 sharing CUDA device 1 first 1 next 1
>>> Did not find +devices i,j,k,... argument, using all
>>> Pe 5 physical rank 5 binding to CUDA device 1 on gig64: 'GeForce GTX
>>> 470'  Mem: 1279MB  Rev: 2.0
>>> Pe 0 sharing CUDA device 0 first 0 next 2
>>> Pe 0 physical rank 0 binding to CUDA device 0 on gig64: 'GeForce GTX
>>> 470'  Mem: 1279MB  Rev: 2.0
>>> Pe 2 sharing CUDA device 0 first 0 next 4
>>> Pe 2 physical rank 2 binding to CUDA device 0 on gig64: 'GeForce GTX
>>> 470'  Mem: 1279MB  Rev: 2.0
>>> Pe 4 sharing CUDA device 0 first 0 next 0
>>> Pe 4 physical rank 4 binding to CUDA device 0 on gig64: 'GeForce GTX
>>> 470'  Mem: 1279MB  Rev: 2.0
>>> Info: 1.64104 MB of memory in use based on CmiMemoryUsage
>>>
>>>
>>>
>>>
>>> ---------- Forwarded message ----------
>>> From: Francesco Pietra <chiendarret_at_gmail.com>
>>> Date: Mon, Jun 6, 2011 at 9:54 AM
>>> Subject: namd-l: AMD-PhenomII-1075_GTX470 NAMD-CUDA performance
>>> To: NAMD <namd-l_at_ks.uiuc.edu>
>>>
>>>
>>> Hello:
>>>
>>> I have assembled a gaming machine with
>>>
>>> Gigabyte GA890FXA-UD5
>>> AMD PhenomII 1075T (3.0 GHz)
>>> 2xGTX-470
>>> AMD Edition 1280MB GDDRV DX11 DUAL DVI / MINI HDMI SLI ATX
>>> 2x 1TB HD software RAID1
>>> 16 GB RAM DDR3 1600 MHz
>>> Debian amd64 whyzee
>>> NAMD_CVS-2011-06-04_Linux-x86_64-CUDA.tar.gz
>>> No X server (ssh to machines with X server)
>>>
>>> In my .bashrc:
>>>
>>> NAMD_HOME=/usr/local/namd-cuda_4Jun2010nb
>>> PATH=$PATH:$NAMD_HOME/bin/namd2; export NAMD_HOME PATH
>>> PATH="/usr/local/namd-cuda_4Jun2010nb/bin:$PATH"; export PATH
>>>
>>> if [ "LD_LIBRARY_PATH" ]; then
>>>    export
>> LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/namd-cuda_4Jun2010nb
>>> else
>>>    export LD_LIBRARY_PATH="/usr/local/namd-cuda_4Jun2010nb"
>>>
>>>
>>> I lauched a RAMD rrn on a >200,000-atoms system with
>>>
>>> charmrun $NAMD_HOME/bin/namd2 ++local +p6 +idepoll  ++verbose
>>> filename.conf 2>&1 | tee filename.log
>>>
>>> It runs fine, approximately (by judging from "The last velocity output
>>> at each ten-steps writing) ten times faster than a 8-CPU shared-mem
>>> machine with dual-opteron 2.2 GHz.
>>>
>>> I did nothing as to indicating the GTX-470 to use. Can both be used?
>>> Is that the same (in terms of performance) using the nvidia-provided
>>> cuda driver or the one available with the OS (Debian)?. Sorry for the
>>> last two naive questions, perhaps resulting from the stress of the
>>> enterprise. I assume that "nvidia-smi" is of no use for these graphic
>>> cards.
>>>
>>> Thanks a lot for advice
>>>
>>> francesco pietra
>>>
>>>
>>>
>>
>>
>>
>> --
>> Dr. Axel Kohlmeyer
>> akohlmey_at_gmail.com  http://goo.gl/1wk0
>>
>> Institute for Computational Molecular Science
>> Temple University, Philadelphia PA, USA.
>>
>>
>
>
>
> --
> Dr. Axel Kohlmeyer
> akohlmey_at_gmail.com  http://goo.gl/1wk0
>
> Institute for Computational Molecular Science
> Temple University, Philadelphia PA, USA.
>

-- 
Dr. Axel Kohlmeyer
akohlmey_at_gmail.com  http://goo.gl/1wk0
Institute for Computational Molecular Science
Temple University, Philadelphia PA, USA.

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 05:24:02 CST