From: Josh Vermaas (joshua.vermaas_at_gmail.com)
Date: Thu Mar 19 2020 - 10:59:06 CDT
The serious developers may know more, but I have two ways of rationalizing
1. You may be hitting the limits of PCI transfers between CPU and GPU that
NAMD has conventionally done every timestep. Timestep integration in 2.13
is on the CPU, with many force calculations on the GPU. So every step all
the positions need to be uploaded to the GPU and the forces need to be
downloaded to the CPU. If its a bandwidth issue, there will be a system
size (20k atoms?) where this performance loss goes away.
2. Your +pemap argument is squirrely. Why isn't the second one 16-31?
You've got 32 cores total (might be reported as 64 in /proc/cpuinfo if it
includes hyperthread equivalents). On intel systems I've worked with, the
second half of the reported cpus are really just the hyperthreads for the
first set of cores, so 32-47 would map onto the same core set as your first
simulation I think. I don't know if AMD does this differently, but I'd also
On Thu, Mar 19, 2020 at 11:32 AM Stefano Guglielmo <
> Dear all,
> thanks for your advice and sorry for my late reply. I finally managed to
> optimize performance for a single simulation.
> Now I am trying to run two simulations in parallel using NAMD 2.13
> multicore-CUDA version. I used the following option to run the two
> +p16 +idlepoll +setcpuaffinity +devices 0 +pemap 0-15
> +p16 +idlepoll +setcpuaffinity +devices 1 +pemap 32-47.
> For two systems of comparable dimension I observed a sizeable performance
> loss when starting the second simulation (from 0.017 s/step to 0.028
> s/step). In your opinion is this reasonable or shall I tune some options
> differently/use a different version of NAMD?
> Thanks in advance for sharing advice,
> all the best
> Il giorno gio 5 mar 2020 alle ore 22:03 Josh Vermaas <
> joshua.vermaas_at_gmail.com> ha scritto:
>> Don't forget to compare against multicore builds. On one node with shared
>> memory, those builds often win for maximum 1 gpu throughput. Since you have
>> 2 on the same node, an smp build without communication threads may win.
>> On Thu, Mar 5, 2020, 10:23 AM Victor Kwan <vkwan8_at_uwo.ca> wrote:
>>> Hi Stefano,
>>> Since you already have a system in mind, you can compare the time it
>>> takes to perform a 10ps simulation with different setups.
>>> > one or both gpu, number of cores
>>> * With NAMD 2.13 comes a large improvement in dual gpu/single node
>>> performance and we observe almost linear scaling when going from 1 to 2
>>> * 16core/GPU is sufficient, from our experience 6-8core/GPU is the lower
>>> * For GPU runs, hyperthreading should not increase affect performance.
>>> > pemap/commap options
>>> * check the output of nvidia-smi topo matrix - leaving cpu/gpu affinity
>>> as default should be fine.
>>> On Thu, Mar 5, 2020 at 10:12 AM Stefano Guglielmo <
>>> stefano.guglielmo_at_unito.it> wrote:
>>>> Dear NAMD users,
>>>> I am using a workstation with an AMD Ryzen Threadripper 2990WX 32-Core
>>>> Processor, 128 GB RAM and two RTX 2080 Ti cards with NVlink; I am here to
>>>> ask for suggestions on what could be the "best" options to run a single
>>>> simulation on a 200K atom system with NAMD 2.13 (one or both gpu, number of
>>>> cores, hyperthreading or not, pemap/commap options...)
>>>> Thanks in advance for your time
>>>> Stefano GUGLIELMO PhD
>>>> Assistant Professor of Medicinal Chemistry
>>>> Department of Drug Science and Technology
>>>> Via P. Giuria 9
>>>> 10125 Turin, ITALY
>>>> ph. +39 (0)11 6707178
>>>> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> Mail
>>>> priva di virus. www.avast.com
> Stefano GUGLIELMO PhD
> Assistant Professor of Medicinal Chemistry
> Department of Drug Science and Technology
> Via P. Giuria 9
> 10125 Turin, ITALY
> ph. +39 (0)11 6707178
This archive was generated by hypermail 2.1.6 : Thu Dec 31 2020 - 23:17:13 CST