Re: Why CPU Usage is low when I run ibverbs-smp-cuda version NAMD

From: Bin He (binhe_at_hustunique.com)
Date: Tue Nov 11 2014 - 06:19:06 CST

Hi,

Thanks a lot for your kind reply.

I am sorry that the time data I provided was confused.

So I used the default binary(Download from the NAMD website) to test again.

The binary I used:

NAMD_2.10b1_Linux-x86_64-multicore-CUDA.tar.gz

NAMD_2.10b1_Linux-x86_64-ibverbs-smp-CUDA.tar.gz

Hardware

E5-2670*2

GPU k20m*2

IB

command:

*1 node with multicore-CUDA version:*

*./namd2 +p16 +deices 0,1 ../workload/f1atpase2000/f1atpase.namd *

*1 node with ibverbs-smp-CUDA*

*/home/gpuusr/binhe/namd/NAMD_2.10b1_Linux-x86_64-ibverbs-smp-CUDA/charmrun
++p 16 ++ppn 8 ++nodelist nodelist ++scalable-start ++verbose
/home/gpuusr/binhe/namd/NAMD_2.10b1_Linux-x86
_64-ibverbs-smp-CUDA/namd2 +devices 0,1
/home/gpuusr/binhe/namd/workload/f1atpase2000/f1atpase.namd*

With "++local", the application can not start. So I have to run with
nodelist.

nodelist content

  group main ++shell ssh-

  host node330

  host node330

*2 node with ibverbs-smp-CUDA*

* /home/gpuusr/binhe/namd/NAMD_2.10b1_Linux-x86_64-ibverbs-smp-CUDA/charmrun
++p 32 ++ppn 8 ++nodelist nodelist2node ++scalable-start ++verbose
/home/gpuusr/binhe/namd/NAMD_2.10b1_Linu
x-x86_64-ibverbs-smp-CUDA/namd2 +devices 0,1
/home/gpuusr/binhe/namd/workload/f1atpase2000/f1atpase.namd*

 nodelist content

group main ++shell ssh-

host node330

host node330

host node329

host node329

~

*4 node with ibverbs-smp-CUDA *

* /home/gpuusr/binhe/namd/NAMD_2.10b1_Linux-x86_64-ibverbs-smp-CUDA/charmrun
++p 64 ++ppn 8 ++nodelist nodelist4node ++scalable-start ++verbose
/home/gpuusr/binhe/namd/NAMD_2.10b1_Linu
x-x86_64-ibverbs-smp-CUDA/namd2 +devices 0,1
/home/gpuusr/binhe/namd/workload/f1atpase2000/f1atpase.namd*

  nodelist content

group main ++shell ssh-

host node330

host node330

host node329

host node329

host node328

host node328

host node332

host node332

TIME

f1atpase

numstep 2000; outputEnergies 100

versionCPU/nodeGPU/nodeNODETIMEmulticores-CUDA162190ibverbs-smp-CUDA1621
111.24ibverbs-smp-CUDA162260ibverbs-smp-CUDA162435

Actually, the ibverbs-smp-CUDA version scales not bad. BUT the the cpu
usage:
Cpu(s): 53.1%us, 29.0%sy, 0.0%ni, 17.9%id, 0.0%wa, 0.0%hi, 0.0%si,
0.0%st
means that not all computing resources are used well.

And We can find that ibverbs-smp-CUDA is slower than multicores-CUDA in a
node. Yes, Network bandwidth and latency may cause it, but ibverbs version
without CUDA scale well and the cpu usage of ibverbs version is perfect
when running several nodes.

So, I do not think network bandwidth and latency is the key reason to cause
it. How can I increase the cpu usage and accelerate namd?

Thanks

Binhe

------------------------
Best Regards!
Bin He
Member of IT
Unique Studio
Room 811,Building LiangSheng,1037 Luoyu Road, Wuhan 430074,P.R. China
☎:(+86) 13163260252
Weibo:何斌_HUST
Email:binhe_at_hustunique.com
Email:binhe22_at_gmail.com

2014-11-11 16:48 GMT+08:00 Norman Geist <norman.geist_at_uni-greifswald.de>:

> Ok, you actually DON’T have a problem! You compare apples with oranges.
> To compare the performance of different binaries, you SHOULD use the same
> hardware. So you would want to test the ibverbs version on the machine with
> 4gpus+16 cores or vice versa the multicore binary on one of the
> 2GPU+12cores nodes.
>
>
>
> Away from that, using multiple nodes introduce a new bottleneck which is
> network bandwidth and latency. So you always will have losses due the
> additional overhead and you cpus spending time in waiting for communication
> rather than working. This varies for different system sizes (Amdahl’s law).
> BUT actually your scaling isn’t that bad. From 2 to 4 nodes It scales by
> 46% instead of ideal 50% (u miss the 1 node case btw.).
>
>
>
> So don’t care about CPU usage, only about the actual timings. Also try to
> namd2 “+idlepoll” which can improve parallel scaling across network.
>
> Also for CUDA and small systems try in config:
>
>
>
> twoawayx yes
>
>
>
> only if that brings improvement try
>
>
>
> twoawayx yes
>
> twoawayy yes
>
>
>
> only if that brings improvement try
>
>
>
> twoawayx yes
>
> twoawayy yes
>
> twoawayz yes
>
>
>
> Most of cases twoawayx is enough or already too much.
>
>
>
> Norman Geist.
>
>
>
> *Von:* owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] *Im
> Auftrag **von *Bin He
> *Gesendet:* Montag, 10. November 2014 20:51
> *An:* Norman Geist
> *Cc:* namd-l_at_ks.uiuc.edu
> *Betreff:* Re: namd-l: Why CPU Usage is low when I run ibverbs-smp-cuda
> version NAMD
>
>
>
> 1. Using the servers mentioned above, I got the result:
>
>
>
> multicores-CUDA
>
> GPU
>
> CORE
>
> TIME
>
> 4
>
> 16
>
> 64s
>
>
>
> ibverbs-smp-cuda
>
> GPU
>
> CORE
>
> NODE
>
> TIME
>
> 2 per node
>
> 12 per node
>
> 2
>
> 57
>
> 2 per node
>
> 12 per node
>
> 4
>
> 37
>
>
>
> when running ibverbs-smp-cuda, the cpu usr usage is less than 50%, and sys
> usage is about 30%.
>
>
>
> The cpu usage is too ugly. *What I want to do is to find the reason why
> the cpu usage is so strange*.
>
>
>
> 2. If I want to get the best performance with cuda, what parameters in the
> config file I can modify?
>
>
>
>
>
>
>
>
> ------------------------
> Best Regards!
> Bin He
> Member of IT
> Unique Studio
> Room 811,Building LiangSheng,1037 Luoyu Road, Wuhan 430074,P.R. China
> ☎:(+86) 13163260252
> Weibo:何斌_HUST
> Email:binhe_at_hustunique.com
> Email:binhe22_at_gmail.com
>
>
>
>
>
> 2014-11-10 14:53 GMT+08:00 Norman Geist <norman.geist_at_uni-greifswald.de>:
>
> What you observe might be expectable as the CUDA code of NAMD is
> officially tuned for the multicore version. BUT, do you actually notice any
> performance difference regarding time/step?
>
>
>
> Norman Geist.
>
>
>
> *Von:* owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] *Im
> Auftrag von *Bin He
> *Gesendet:* Samstag, 8. November 2014 08:25
> *An:* namd-l_at_ks.uiuc.edu
> *Betreff:* namd-l: Why CPU Usage is low when I run ibverbs-smp-cuda
> version NAMD
>
>
>
> Hi everyone,
>
>
>
> I am a fresh man to NAMD.
>
>
>
> The Desc of our clusters:
>
> cpu:E5-2670(8cores)
>
> memory:32GB
>
> socket:2
>
> network:IB
>
> GPU:k20m*2
>
> CUDA:6.5
>
> workload:* f1atpase(numsteps2000)*
>
>
>
> When I run the multicores-namd version, the cpu usage is about 100% and
> GPU usage is about 50%;
>
> *CMD:./namd2 +p16 +devices 0,1 ../workload/f1atpase/f1atpase.namd*
>
> cpu time is about 88s.
>
> When I run the ibverbs-smp-cuda version, the cpu usage is about just
> 40%us and 30 % sy. GPU usage is about 50%.
>
> *CMD:/home/gpuusr/binhe/namd/NAMD_2.10b1_Linux-x86_64-ibverbs-smp-CUDA/charmrun
> ++p 60 ++ppn 15 ++nodelist nodelist ++scalable-start ++verbose
> /home/gpuusr/binhe/namd/NAMD_2.10b1_Linux-x86_64-ibverbs-smp-CUDA/namd2
> +devices 0,1 /home/gpuusr/binhe/namd/workload/f1atpase/f1atpase.namd*
>
> cpu time is about 37s.
>
>
>
> when I try to use setcpuaffinity, the result is worst.
>
> So what is wrong with my operation?
>
> Thanks
>
>
> ------------------------
> Best Regards!
> Bin He
> Member of IT
> Unique Studio
>
>
> ------------------------------
>
> <http://www.avast.com/>
>
> Diese E-Mail ist frei von Viren und Malware, denn der avast! Antivirus
> <http://www.avast.com/> Schutz ist aktiv.
>
>
>
>
>
>
> ------------------------------
> <http://www.avast.com/>
>
> Diese E-Mail ist frei von Viren und Malware, denn der avast! Antivirus
> <http://www.avast.com/> Schutz ist aktiv.
>
>

This archive was generated by hypermail 2.1.6 : Thu Dec 31 2015 - 23:21:21 CST