Re: slow down when running 2 simulations on 1 node

From: Miro Astore (miro.astore_at_gmail.com)
Date: Fri Dec 27 2019 - 16:42:25 CST

Ive had this issue even when I reduced wrote frequencies by x10. I don't
think this is the bottle neck

Le sam. 28 déc. 2019 à 05:01, Bryan Roessler <bryanroessler_at_gmail.com> a
écrit :

> Depending on your output frequency settings you may also be running into a
> disk i/o bottleneck.
>
> On Wed, Dec 18, 2019 at 7:08 AM Norman Geist <
> norman.geist_at_uni-greifswald.de> wrote:
>
>> You’d also need to find out the real physical->logical core mapping (HT).
>> Each physical core is spitted into 2 virtual cores. Therefore using two
>> virtual cores that map to the same physical core will result in slow down.
>> This layout can be different on various architecture and can be very tricky
>> when having multiple sockets (not in your case). In you case the mapping
>> can be two options (simplified):
>>
>>
>>
>> 1. The mapping is 1-15 real and 16-27 virtual
>>
>> 2. The mapping is each even real, each uneven virtual
>>
>>
>>
>> You want to use physical core exclusively.
>>
>>
>>
>> So to map the 1. option, you’d have to map the following for two
>> simulations each using 6 cores:
>>
>>
>>
>> 1st replica : 0-5
>>
>> 2nd replica: 6-11
>>
>>
>>
>> So to map the 2. option, you’d have to map the following for two
>> simulations each using 6 cores:
>>
>>
>>
>> 1st replica : 0,2,4,6,8,10
>>
>> 2nd replica: 12,14,16,18,20,22
>>
>>
>>
>> Away from this, you can still have a bottleneck in the memory bandwidth
>> when too many cores are active.
>>
>>
>>
>> Bests
>>
>> Norman Geist
>>
>>
>>
>> *Von:* owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] *Im
>> Auftrag von *Gerald Keller
>> *Gesendet:* Mittwoch, 18. Dezember 2019 12:33
>> *An:* dhardy_at_ks.uiuc.edu; namd-l_at_ks.uiuc.edu
>> *Cc:* giacomo.fiorin_at_gmail.com
>> *Betreff:* Re: namd-l: slow down when running 2 simulations on 1 node
>>
>>
>>
>> Hi Dave,
>>
>>
>>
>> the GPUs are selected properly. NAMD is executed with a bash script that
>> sets amongst other things the global variable CUDA_VISIBLE_DEVICES.
>>
>>
>>
>> Best regards
>>
>> Gerald
>>
>>
>>
>>
>>
>> >>> David Hardy <dhardy_at_ks.uiuc.edu> 12/16/19 7:36 PM >>>
>>
>> Hi Gerald,
>>
>>
>>
>> I think your slow down might be due to accidentally using both GPUs for
>> each process.
>>
>>
>>
>> By default, NAMD will use all devices that it finds. You should add to
>> the first invocation of NAMD "+devices 0" to restrict to using only GPU 0
>> and to the second "+devices 1" to restrict to using only GPU 1.
>>
>>
>>
>> NAMD is already CPU-intensive enough on each thread that it generally
>> does not benefit from hyperthreading.
>>
>>
>>
>> Best regards,
>>
>> Dave
>>
>>
>>
>> --
>>
>> David J. Hardy, Ph.D.
>>
>> Beckman Institute
>>
>> University of Illinois at Urbana-Champaign
>>
>> 405 N. Mathews Ave., Urbana, IL 61801
>>
>> dhardy_at_ks.uiuc.edu, http://www.ks.uiuc.edu/~dhardy/
>>
>>
>>
>> On Dec 14, 2019, at 9:51 AM, Gerald Keller <
>> gerald.keller_at_uni-wuerzburg.de> wrote:
>>
>>
>>
>> Thank you all for your suggestions!
>>
>>
>>
>> I tried out to set cpu affinity but the simulation speed still slows down
>> when starting the second replica.
>>
>>
>>
>> On a node with Intel(R) Core(TM) i9-7940X CPU @ 3.10GHz (1 socket, 14
>> cores, 28 with hyperthreading) i tried,
>>
>>
>>
>> For the first replica on GPU 0 I used: namd2 +setcpuaffinity +pemap 0-11
>> +p 12 +idlepoll
>>
>> The second on GPU 1: namd2 +setcpuaffinity +pemap 11-23 +p 12 +idlepoll
>>
>>
>>
>> also tried:
>>
>>
>>
>> 1st repilca: namd2 +setcpuaffinity +pemap 0-11:2 +p 6 +idlepoll
>>
>> 2nd repilca: namd2 +setpcuaffinity +pemap 11-23:2 +p 6 + idlepoll
>>
>>
>>
>> Giacomo mentioned that hyperthreading has to be disabled. I thaught namd
>> would support hyperthreading?
>>
>>
>>
>> Best regards
>>
>> Gerald
>>
>>
>>
>> >>> Giacomo Fiorin <giacomo.fiorin_at_gmail.com> 12.12.19 20.27 Uhr >>>
>>
>> Hello Gerald, I would go with Victor's and Julio's suggestion, but also
>> try making sure that HyperThreading is disabled i.e. there are 40 CPU
>> physical cores and not 20. In /proc/cpuinfo look for the keyword "ht"
>> among the CPU features.
>>
>>
>>
>> It is likewise good to keep in mind that unless a program runs entirely
>> on the GPU, transferring data between the GPU and the CPU goes via
>> circuitry that is most of the time shared among the devices on one
>> motherboard.
>>
>>
>>
>> Giacomo
>>
>>
>>
>> On Thu, Dec 12, 2019 at 2:14 PM Julio Maia <jmaia_at_ks.uiuc.edu> wrote:
>>
>> Hi,
>>
>> If you’re not setting the correct affinities, PEs from different replicas
>> might compete for the same cores in your machine.
>>
>> Please try to set CPU affinities for PEs for each replica and try again.
>> You can check how it’s done here:
>> https://www.ks.uiuc.edu/Research/namd/2.13/ug/node105.html
>>
>>
>>
>> Thanks,
>>
>>
>>
>>
>>
>> On Dec 12, 2019, at 2:09 AM, Gerald Keller <
>> gerald.keller_at_uni-wuerzburg.de> wrote:
>>
>>
>>
>> Hi everyone,
>>
>> in our working group we compute on our own GPU nodes, with no queue
>> system and do not compute on multiple nodes.
>> When we calculate two replicas of plain MD runs on 1 node with in total 2
>> GPUs and 40 CPUs we recognized that the simulation speed slows down when
>> starting the second replica.
>>
>> 1x NAMD on 1 node using 1 GPU and 18 CPUs:
>>
>> Info: Benchmark time: 18 CPUs 0.00742875 s/step
>> Info: Benchmark time: 18 CPUs 0.0073947 s/step
>> Info: Benchmark time: 18 CPUs 0.00747593 s/step
>> Info: Benchmark time: 18 CPUs 0.00752931 s/step
>> Info: Benchmark time: 18 CPUs 0.00744549 s/step
>> Info: Benchmark time: 18 CPUs 0.00746218 s/step
>>
>> TIMING: 500 CPU: 3.86542, 0.0073741/step Wall: 3.90971, 0.0074047/step
>> TIMING: 980 CPU: 7.43293, 0.00730715/step Wall: 7.49914, 0.00738945/step
>> TIMING: 1000 CPU: 7.58503, 0.007605/step Wall: 7.65193, 0.0076393/step
>> TIMING: 1500 CPU: 11.2973, 0.0073617/step Wall: 11.3969, 0.00763561/step
>> TIMING: 2000 CPU: 15.0195, 0.00745355/step Wall: 15.1411, 0.0075375/step
>>
>>
>> 2x NAMD on 1 node 1 GPU and 18 CPUs for each replica:
>>
>> Info: Benchmark time: 18 CPUs 0.0115988 s/step
>> Info: Benchmark time: 18 CPUs 0.0116316 s/step
>> Info: Benchmark time: 18 CPUs 0.0118586 s/step
>> Info: Benchmark time: 18 CPUs 0.0115375 s/step
>> Info: Benchmark time: 18 CPUs 0.0114114 s/step
>> Info: Benchmark time: 18 CPUs 0.0117798 s/step
>>
>> TIMING: 500 CPU: 6.0915, 0.0113823/step Wall: 6.18421, 0.0114815/step
>> TIMING: 1000 CPU: 11.8594, 0.0126053/step Wall: 12.0109, 0.0127244/step
>> TIMING: 1500 CPU: 17.564, 0.0114935/step Wall: 17.7579, 0.0116048/step
>> TIMING: 2000 CPU: 23.3157, 0.0119276/step Wall: 23.5628, 0.0119936/step
>>
>> If we run 1x NAMD on 1 node using 1 GPU and 18 CPUs and start another
>> simulation with amber on the other GPU, there is no influence on the namd
>> simulation speed.
>>
>> Does anyone have an idea why this is happening and how to solve that
>> problem? Because of limited resources, somtimes we have to run only one
>> simulation per GPU.
>>
>> Thank you in advance for your suggestions!
>>
>> Best regards
>> Gerald
>>
>>
>>
>>
>>
>> --
>>
>> Giacomo Fiorin
>>
>> Associate Professor of Research, Temple University, Philadelphia, PA
>>
>> Research collaborator, National Institutes of Health, Bethesda, MD
>>
>> http://goo.gl/Q3TBQU
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__goo.gl_Q3TBQU&d=DwMFaQ&c=OCIEmEwdEq_aNlsP4fF3gFqSN-E3mlr2t9JcDdfOZag&r=jUfnSyKkfkyVRBIUzlG1GSGGZAZGcznwr8YliSSCjPc&m=l2Cwbk2f0k2qYMVJj3K4Xy91p3coumyOtDd_gRZeKdk&s=cOJ1vDHtAz_1fJPS-_lGYSP0M_0Ig4B8eOmwmUtuAP4&e=>
>> https://github.com/giacomofiorin
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_giacomofiorin&d=DwMFaQ&c=OCIEmEwdEq_aNlsP4fF3gFqSN-E3mlr2t9JcDdfOZag&r=jUfnSyKkfkyVRBIUzlG1GSGGZAZGcznwr8YliSSCjPc&m=l2Cwbk2f0k2qYMVJj3K4Xy91p3coumyOtDd_gRZeKdk&s=p7Ls704FOJkeoxhyLWDuG1wdAaYoKb-VCZ5QBhgUizg&e=>
>>
>>
>>
>

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2019 - 23:21:04 CST