Re: NAMD performance on a supercomputer with Intel Xeon Platinum 8160 and 100Gb Intel Omni-Path Full-Fat Tree

From: Axel Kohlmeyer (akohlmey_at_gmail.com)
Date: Tue Nov 30 2021 - 09:50:28 CST

Vlad,

a few more comments and a possible explanation for your observations.

First off, it is **very** important to understand the parallelization
approach used in NAMD. Unlike most parallel scientific applications, NAMD
uses a more abstract parallelization scheme by running on top of the
charm++ library (which may be using different modes of communication for
exchanging messages between CPU cores). With plain MPI and MPI+OpenMP
parallelization, an MD system is decomposed into subdomains matching the
number of MPI ranks (with threads optionally on top of that) and the number
of MPI ranks and threads per rank determining how this is done. The MD loop
then progresses mostly as if this was a serial execution and then inserts
communication where necessary. This allows for rather low communication
overhead but introduces synchronization points which can interfere with
strong parallel scaling.
With charm++, however, you parallelize into a "virtual parallel machine"
with an (essentially arbitrarily) chosen number of "parallel compute
units". This is also referred to as over-parallelization. While this adds
some overhead, it facilitates the superior ability of NAMD to do load
balancing and hiding of communication overhead by overlapping communication
of completed work units while working on others. This works very well when
the number of physical nodes remains small to moderate, but becomes a
problem when you have many tasks. In that case the ability to hide
communication overhead is diminished since there is little work left to
hide behind. This is where the "twoaway" options can help because they make
the subdomains smaller and thus create more work units. That will increase
the overhead somewhat, but that is usually overcompensated by the gains in
load balance and communication hiding. That is until you reach the limit of
parallel scaling. Unlike with MPI based MD codes, which using too many
nodes will only introduce a moderate overhead, NAMD's performance will drop
dramatically when using too many CPU cores for a given system and its
decomposition.

Now about the lack of performance when using all CPU cores per node. This
is not uncommon for regular Linux clusters. There are multiple possible
explanations:
- The fact that using 47 cores per node instead of 48 cores is almost
always due to what is called "OS Jitter" or "OS Overhead". This means,
there are other processes running on the compute nodes and they also need
to occupy the CPU occasionally. The scheduler in the kernel will try to
balance those to give all a "fair chance" to doing their job, but that can
often lead to bouncing process between CPU cores (and thus requiring
invalidating/refreshing CPU caches). This can be alleviated by using
processor and memory affinity, but if one CPU core is **not** used by the
application, that can significantly improve the situation since al OS and
kernel processes will be bounced by the scheduler to run on that one CPU
core while the remaining can be dedicated to the attached compute processes
- With a large number of CPU cores per node, the available memory bandwidth
is shared and caches are also shared. That can be rather severe for
applications requiring a lot of reading and writing to memory like quantum
chemical calculations. For MD the consideration of the cache size is more
relevant. If you reduce the number of tasks per node to half, each has
effectively double the CPU cache available and if your compute kernel has
memory requirements that are under/over that threshold for the problem at
hand, you may see a significant in per CPU core performance.
- With a large number of CPU cores per node, the communication bandwidth
becomes a scarce resource that can be performance limiting, especially when
there are other users of the communication like a parallel file system and
when the communication bandwidth is limited due to the technology used. The
considerations are similar to those for memory bandwidth. In this case
careful benchmarking of the different communication options that NAMD
offers (either within the same binary or due to different compilation
settings where you could have a dedicated communication process/thread).
What also can make a difference is how individual nodes are
placed/allocated by the job scheduler and what the so-called bisection
bandwidth of the high-speed network of the cluster is. This latter option
requires working with the cluster system managers to figure out the optimal
settings for the hardware in question.

On Tue, Nov 30, 2021 at 10:14 AM Vlad Cojocaru <
vlad.cojocaru_at_mpi-muenster.mpg.de> wrote:

> Thanks for your thoughts !
>
> One thing that seemed weird during our tests on this site was that the
> performance and parallel scaling rapidly degraded when using all 48 cores
> available per node (2 CPUs with 24 cores each) . We actually saw negative
> scaling after as few as 16 nodes. Then, when using 47, 32, and 24
> cores/node, we got better parallel efficiency to higher node count, with
> the best efficiency obtained using just half of the core available on each
> node (24). At the end, when running on 48 nodes, we achieved the most
> ns/day when using 24 cores/node. However, the resources requested in the
> project we needed to calculate using all 48 cores/node regardless of how
> many we are actually using.
>
> I haven't experienced anything like this on other sites (similar systems,
> same configuration files). Normally using all cores available per node has
> always given the best performance. So, I am wondering whether there is
> anything obvious that could explain such a behavior ?
>
> Best
> Vlad
>
> On 11/30/21 15:33, Giacomo Fiorin wrote:
>
> Something in addition to what Axel says (all of which is absolutely true,
> even the counter-intuitive part about making the single-node performance
> artificially slower to get through the bottom-most tier of technical
> review).
>
> One possible issue to look at is how the cluster's network is utilized by
> other users/applications. In a local cluster that I use, the InfiniBand
> network is also used by the nodes to access data storage and there are many
> other users processing MRI, cryo-EM or bioinformatics data (all
> embarrassingly-parallel by design). So the InfiniBand network is
> constantly busy and does not necessarily offer very low latency for NAMD or
> other message-passing applications.
>
> Something that helped in that case was building Charm++ on top of the UCX
> library instead of IBverbs directly. I am wholly unfamiliar with the
> details of how UCX works, but in essence it provides better utilization of
> the network when the ratio of compute cores vs. network links is high. If
> the cluster's staff has a copy of UCX, try that. It wasn't easy to build,
> but it paid off specifically for those runs that were communication-bound.
>
> The main significant addition in 2.15 is the AVX-512 tiles algorithm,
> which would help with the most expensive Intel CPUs like those, but would
> also make the computation part faster with the caveat that Axel mentioned.
>
> Giacomo
>
> On Tue, Nov 30, 2021 at 6:16 AM Axel Kohlmeyer <akohlmey_at_gmail.com> wrote:
>
>> Actually, if you optimize how NAMD is compiled better than the system
>> provided executable, your parallel efficiency will go down. Please recall
>> Amdahl's law: the parallel efficiency is determined by the relation of time
>> spent on parallel execution and serial execution.
>>
>> A better optimized executable will spend even less time computing and
>> thus have more parallel overhead.
>>
>> To get better parallel efficiency, you have to avoid or reduce all non
>> parallel operations like output or use of features like Tcl scripting or
>> make your computations more expensive by increasing the cutoff or the
>> system size or make the executable slower by compiling a less optimized
>> version.
>>
>> --
>> Dr. Axel Kohlmeyer akohlmey_at_gmail.com https://urldefense.com/v3/__http://goo.gl/1wk0__;!!DZ3fjg!sBsOUMUE8RdbXjZzrvM8sQK3Kfkf-4m6P9MyTEAFTej2RoK3Fdch2D_AbLc__nB4xw$
>> <https://urldefense.com/v3/__http://goo.gl/1wk0__;!!DZ3fjg!vTfAy2yEX2CbE-RC_oXIbJCP-TYotczi7lvqSPqNSBEGEfUDyM103t2gWOCxycm24A$>
>> College of Science & Technology, Temple University, Philadelphia PA, USA
>> International Centre for Theoretical Physics, Trieste, Italy
>>
>> On Tue, Nov 30, 2021, 05:32 Vlad Cojocaru <
>> vlad.cojocaru_at_mpi-muenster.mpg.de> wrote:
>>
>>> Dear all,
>>>
>>> We submitted a proposal to run some extensive atomistic simulations with
>>> NAMD of systems ranging between 500 K to 2M atoms on a supercomputer
>>> with Intel Xeon Platinum 8160 processors and 100Gb Intel Omni-path
>>> Full-Fat Tree interconnection.
>>>
>>> Apparently, our project may fail the technical evaluation because during
>>> our tests we did not achieve a 75 % parallel efficiency between 2 to 48
>>> nodes (each node has 2 CPUs - 24 cores/CPU). We have tested the NAMD
>>> 2.14 provided by default at the site and we do not know how this was
>>> built. Looking at the NAMD benchmarks available for the Frontera
>>> supercomputer (quite similar architecture if I understand it correctly
>>> but for larger systems), it seems we should definitely achieve with NAMD
>>> 2.15 (maybe even 2.14) much better performance and parallel efficiency
>>> up to 48/64 nodes on this architecture than we actually achieved in our
>>> tests.
>>>
>>> So, my reasoning is that probably the NAMD built by default was not
>>> really carefully optimized.
>>>
>>> I would appreciate if anyone who has experience with building and
>>> optimizing NAMD on such an architecture could recommend any
>>> compiler/MPI/configuration/options for building an NAMD with a better
>>> performance and parallel efficiency. If I have some clear ideas about
>>> how to optimize NAMD, maybe I could make the case for our project to not
>>> fail the technical evaluation.
>>>
>>> Thank you very much for any advice
>>>
>>> Best wishes
>>> Vlad
>>>
>>>
>>>
>>> --
>>> Vlad Cojocaru, PD (Habil.), Ph.D.
>>> -----------------------------------------------
>>> Project Group Leader
>>> Department of Cell and Developmental Biology
>>> Max Planck Institute for Molecular Biomedicine
>>> Röntgenstrasse 20, 48149 Münster, Germany
>>> -----------------------------------------------
>>> Tel: +49-251-70365-324; Fax: +49-251-70365-399
>>> Email: vlad.cojocaru[at]mpi-muenster.mpg.de
>>> <https://urldefense.com/v3/__http://mpi-muenster.mpg.de__;!!DZ3fjg!vTfAy2yEX2CbE-RC_oXIbJCP-TYotczi7lvqSPqNSBEGEfUDyM103t2gWOC6ifHfyA$>
>>>
>>> https://urldefense.com/v3/__http://www.mpi-muenster.mpg.de/43241/cojocaru__;!!DZ3fjg!ouau8vpkIDbQ8KrgRCSrc8Ng4YRHk1w7tQfeHsxoB5VnnkEQuC3CQj5uCvq0Gx8Paw$
>>>
>>>
>>>
> --
> Vlad Cojocaru, PD (Habil.), Ph.D.
> -----------------------------------------------
> Project Group Leader
> Department of Cell and Developmental Biology
> Max Planck Institute for Molecular Biomedicine
> Röntgenstrasse 20, 48149 Münster, Germany
> -----------------------------------------------
> Tel: +49-251-70365-324; Fax: +49-251-70365-399
> Email: vlad.cojocaru[at]mpi-muenster.mpg.dehttp://www.mpi-muenster.mpg.de/43241/cojocaru
>
>

-- 
Dr. Axel Kohlmeyer  akohlmey_at_gmail.com  https://urldefense.com/v3/__http://goo.gl/1wk0__;!!DZ3fjg!sBsOUMUE8RdbXjZzrvM8sQK3Kfkf-4m6P9MyTEAFTej2RoK3Fdch2D_AbLc__nB4xw$ 
College of Science & Technology, Temple University, Philadelphia PA, USA
International Centre for Theoretical Physics, Trieste. Italy.

This archive was generated by hypermail 2.1.6 : Fri Dec 31 2021 - 23:17:12 CST