Antw: Re: slow down when running 2 simulations on 1 node

From: Gerald Keller (gerald.keller_at_uni-wuerzburg.de)
Date: Sun Dec 29 2019 - 07:29:48 CST

Hi all,

here is an update to all repliers.

I am now using cpu affinity with +pemap and distribute the processes only on physical cores (no HT) and keep at least 1
physical core idle.

The slow down still occurs when I start a second namd run on a GPU server that only has 1 cpu socket.

I then tried to run 2 namd runs on a GPU server that has 2 cpu sockets. For each namd instance I have chosen physical
cores that only correspond to one socket. So if I have 10 pyhsical cores on each socket, I used 9 physical cores on
socket 1 for namd run 1 and 9 physical cores on socket 2 for namd run 2 (distributed to GPU 0 and 1, respectively).

For the 2 socket case the simulation speed suffers from no slow down!

Too bad, that for in my case, it seems to work only on multiple socket servers.

Best regards
Gerald

>>> Miro Astore <miro.astore_at_gmail.com> 27.12.19 23.43 Uhr >>>
Ive had this issue even when I reduced wrote frequencies by x10. I don't think this is the bottle neck

Le sam. 28 déc. 2019 à 05:01, Bryan Roessler <bryanroessler_at_gmail.com> a écrit :

Depending on your output frequency settings you may also be running into a disk i/o bottleneck.

On Wed, Dec 18, 2019 at 7:08 AM Norman Geist <norman.geist_at_uni-greifswald.de> wrote:

You’d also need to find out the real physical->logical core mapping (HT). Each physical core is spitted into 2 virtual
cores. Therefore using two virtual cores that map to the same physical core will result in slow down. This layout can be
different on various architecture and can be very tricky when having multiple sockets (not in your case). In you case
the mapping can be two options (simplified):
 
1. The mapping is 1-15 real and 16-27 virtual
2. The mapping is each even real, each uneven virtual
 
You want to use physical core exclusively.
 
So to map the 1. option, you’d have to map the following for two simulations each using 6 cores:
 
1st replica : 0-5
2nd replica: 6-11
 
So to map the 2. option, you’d have to map the following for two simulations each using 6 cores:
 
1st replica : 0,2,4,6,8,10
2nd replica: 12,14,16,18,20,22
 
Away from this, you can still have a bottleneck in the memory bandwidth when too many cores are active.
 
Bests
Norman Geist
 
Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im Auftrag von Gerald Keller
Gesendet: Mittwoch, 18. Dezember 2019 12:33
An: dhardy_at_ks.uiuc.edu; namd-l_at_ks.uiuc.edu
Cc: giacomo.fiorin_at_gmail.com
Betreff: Re: namd-l: slow down when running 2 simulations on 1 node

 
Hi Dave,

 

the GPUs are selected properly. NAMD is executed with a bash script that sets amongst other things the global variable
CUDA_VISIBLE_DEVICES.

 

Best regards

Gerald

 

 

>>> David Hardy <dhardy_at_ks.uiuc.edu> 12/16/19 7:36 PM >>>
Hi Gerald,

 

I think your slow down might be due to accidentally using both GPUs for each process.

 

By default, NAMD will use all devices that it finds. You should add to the first invocation of NAMD "+devices 0" to
restrict to using only GPU 0 and to the second "+devices 1" to restrict to using only GPU 1.
 

NAMD is already CPU-intensive enough on each thread that it generally does not benefit from hyperthreading.

 

Best regards,

Dave

 

--
David J. Hardy, Ph.D.
Beckman Institute
University of Illinois at Urbana-Champaign
405 N. Mathews Ave., Urbana, IL 61801
dhardy_at_ks.uiuc.edu, http://www.ks.uiuc.edu/~dhardy/
 
On Dec 14, 2019, at 9:51 AM, Gerald Keller <gerald.keller_at_uni-wuerzburg.de> wrote:
 
Thank you all for your suggestions! 
 
I tried out to set cpu affinity but the simulation speed still slows down when starting the second replica. 
 
On a node with Intel(R) Core(TM) i9-7940X CPU @ 3.10GHz (1 socket, 14 cores, 28 with hyperthreading) i tried, 
 
For the first replica on GPU 0 I used: namd2 +setcpuaffinity +pemap 0-11 +p 12 +idlepoll
The second on GPU 1: namd2 +setcpuaffinity +pemap 11-23 +p 12 +idlepoll
 
also tri2nd repilca: namd2 +setpcuaffinity +pemap 11-23:2 +p 6 + idlepoll
 
Giacomo mentioned that hyperthreading has to be disabled. I thaught namd would support hyperthreading? 
 
Best regards
Gerald
>>> Giacomo Fiorin <giacomo.fiorin_at_gmail.com> 12.12.19 20.27 Uhr >>>
Hello Gerald, I would go with Victor's and Julio's suggestion, but also try making sure that HyperThreading is disabled
i.e. there are 40 CPU physical cores  and not 20.  In /proc/cpuinfo look for the keyword "ht" among the CPU features.
 
It is likewise good to keep in mind that unless a program runs entirely on the GPU, transferring data between the GPU
and the CPU goes via circuitry that is most of the time shared among the devices on one motherboard.
 
Giacomo
 
On Thu, Dec 12, 2019 at 2:14 PM Julio Maia <jmaia_at_ks.uiuc.edu> wrote:
Hi, 
If you’re not setting the correct affinities, PEs from different replicas might compete for the same cores in your
machine.
Please try to set CPU affinities for PEs for each replica and try again. You can check how it’s done here:
https://www.ks.uiuc.edu/Research/namd/2.13/ug/node105.html
 
Thanks,
 
 
On Dec 12, 2019, at 2:09 AM, Gerald Keller <gerald.keller_at_uni-wuerzburg.de> wrote:
 
Hi everyone, 
in our working group we compute on our own GPU nodes, with no queue system and do not compute on multiple nodes. 
When we calculate two replicas of plain MD runs on 1 node with in total 2 GPUs and 40 CPUs we recognized that the
simulation speed slows down when starting the second replica. 
1x NAMD on 1 node using 1 GPU and 18 CPUs:
Info: Benchmark time: 18 CPUs 0.00742875 s/step 
Info: Benchmark time: 18 CPUs 0.0073947 s/step 
Info: Benchmark time: 18 CPUs 0.00747593 s/step
Info: Benchmark time: 18 CPUs 0.00752931 s/step
Info: Benchmark time: 18 CPUs 0.00744549 s/step
Info: Benchmark time: 18 CPUs 0.00746218 s/step
TIMING: 500  CPU: 3.86542, 0.0073741/step  Wall: 3.90971, 0.0074047/step
TIMING: 980  CPU: 7.43293, 0.00730715/step  Wall: 7.49914, 0.00738945/step
TIMING: 1000  CPU: 7.58503, 0.007605/step  Wall: 7.65193, 0.0076393/step
TIMING: 1500  CPU: 11.2973, 0.0073617/step  Wall: 11.3969, 0.00763561/step
TIMING: 2000  CPU: 15.0195, 0.00745355/step  Wall: 15.1411, 0.0075375/step
2x NAMD on 1 node 1 GPU and 18 CPUs for each replica:
Info: Benchmark time: 18 CPUs 0.0115988 s/step
Info: Benchmark time: 18 CPUs 0.0116316 s/step
Info: Benchmark time: 18 CPUs 0.0118586 s/step
Info: Benchmark time: 18 CPUs 0.0115375 s/step
Info: Benchmark time: 18 CPUs 0.0114114 s/step
Info: Benchmark time: 18 CPUs 0.0117798 s/step
TIMING: 500  CPU: 6.0915, 0.0113823/step  Wall: 6.18421, 0.0114815/step
TIMING: 1000  CPU: 11.8594, 0.0126053/step  Wall: 12.0109, 0.0127244/step
TIMING: 1500  CPU: 17.564, 0.0114935/step  Wall: 17.7579, 0.0116048/step
TIMING: 2000  CPU: 23.3157, 0.0119276/step  Wall: 23.5628, 0.0119936/step
If we run 1x NAMD on 1 node using 1 GPU and 18 CPUs and start another simulation with amber on the other GPU, there is
no influence on the namd simulation speed. 
Does anyone have an idea why this is happening and how to solve that problem? Because of limited resources, somtimes we
have to run only one simulation per GPU. 
Thank you in advance for your suggestions!
Best regards
Gerald
 
--
Giacomo Fiorin
Associate Professor of Research, Temple University, Philadelphia, PA
Research collaborator, National Institutes of Health, Bethesda, MD
http://goo.gl/Q3TBQU
https://github.com/giacomofiorin
 
 
 

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2019 - 23:21:04 CST