Re: NAMD performance on a supercomputer with Intel Xeon Platinum 8160 and 100Gb Intel Omni-Path Full-Fat Tree

From: James M Davis (jmdavis1_at_vcu.edu)
Date: Tue Nov 30 2021 - 12:25:46 CST

I should have said build from scratch to support Omnipath or use the
Charrm++ mpi with open-mpi. There will still be testing involved to tweak
the system and the performance. It still might not support 48*48. But you
should be able to get to something higher than 48*24.

----
Mike Davis
Technical Director: High Performance Research Computing
Virginia Commonwealth University
(804) 828-3885 (o) • (804) 307-3428(c)
https://urldefense.com/v3/__https://chipc.vcu.edu__;!!DZ3fjg!tdYi7N1NVr8KtU5IVzoCwBS8jTMzgmjY0Kb-WyaRtZgNVLJ-IVL91DFQum10OvRJ0w$ 
<https://urldefense.com/v3/__https://chipc.vcu.edu/__;!!DZ3fjg!tdYi7N1NVr8KtU5IVzoCwBS8jTMzgmjY0Kb-WyaRtZgNVLJ-IVL91DFQum0XLC_nFg$ >
On Tue, Nov 30, 2021 at 1:07 PM James M Davis <jmdavis1_at_vcu.edu> wrote:
> A few notes from the 2.15 release notes. I think you will need to build
> from scratch for Omnipath.
>
> Intel Omni-Path networks are incompatible with the pre-built verbs NAMD
>> binaries. Charm++ for verbs can be built with --with-qlogic to support
>> Omni-Path, but the Charm++ MPI network layer performs better than the verbs
>> layer.  Hangs have been observed with Intel MPI but not with OpenMPI, so
>> OpenMPI is preferred.  See "Compiling NAMD" below for MPI build
>> instructions.  NAMD MPI binaries may be launched directly with mpiexec
>> rather than via the provided charmrun script."
>
> https://www.ks.uiuc.edu/Research/namd/cvs/notes.html
>
> ----
> Mike Davis
> Technical Director: High Performance Research Computing
> Virginia Commonwealth University
> (804) 828-3885 (o) • (804) 307-3428(c)
> https://urldefense.com/v3/__https://chipc.vcu.edu__;!!DZ3fjg!tdYi7N1NVr8KtU5IVzoCwBS8jTMzgmjY0Kb-WyaRtZgNVLJ-IVL91DFQum10OvRJ0w$ 
>
> <https://urldefense.com/v3/__https://chipc.vcu.edu/__;!!DZ3fjg!tdYi7N1NVr8KtU5IVzoCwBS8jTMzgmjY0Kb-WyaRtZgNVLJ-IVL91DFQum0XLC_nFg$ >
>
>
> On Tue, Nov 30, 2021 at 1:00 PM Vermaas, Josh <vermaasj_at_msu.edu> wrote:
>
>> Hi Vlad,
>>
>>
>>
>> In addition to the great points Axel and Giacomo have made, I’d like to
>> point out that the 8160 is a 24 core processor, and that there are likely 2
>> of them on a given node. In these two socket configurations, where there
>> are two physical CPU dies, I’ve often found that the best performance is
>> achieved when you treat each socket as its own node, and allocate 2x the
>> number of “tasks” as you have nodes. That way, each SMP task gets placed on
>> its own socket. If you don’t, each node is trying to get all 48 cores
>> across both sockets to work together, which ends up saturating the UPI
>> links between the sockets, and can be detrimental to performance.
>>
>>
>>
>> This is usually a bigger problem for SMP-based builds. In my experience,
>> CPU-only systems benefit from MPI based builds, where the number of tasks
>> is equal to the number of CPUs. Usually this is a performance win for
>> modestly sized systems at the expense of scalability for really big systems.
>>
>>
>>
>> -Josh
>>
>>
>>
>> *From: *<owner-namd-l_at_ks.uiuc.edu> on behalf of Vlad Cojocaru <
>> vlad.cojocaru_at_mpi-muenster.mpg.de>
>> *Organization: *MPI Muenster
>> *Reply-To: *"namd-l_at_ks.uiuc.edu" <namd-l_at_ks.uiuc.edu>, Vlad Cojocaru <
>> vlad.cojocaru_at_mpi-muenster.mpg.de>
>> *Date: *Tuesday, November 30, 2021 at 10:18 AM
>> *To: *Giacomo Fiorin <giacomo.fiorin_at_gmail.com>, NAMD list <
>> namd-l_at_ks.uiuc.edu>, Axel Kohlmeyer <akohlmey_at_gmail.com>
>> *Cc: *HORIA-LEONARD BANCIU <horia.banciu_at_ubbcluj.ro>
>> *Subject: *Re: namd-l: NAMD performance on a supercomputer with Intel
>> Xeon Platinum 8160 and 100Gb Intel Omni-Path Full-Fat Tree
>>
>>
>>
>> Thanks for your thoughts !
>>
>> One thing that seemed weird during our tests on this site was that the
>> performance and parallel scaling rapidly degraded when using all 48 cores
>> available per node (2 CPUs with 24 cores each) . We actually saw negative
>> scaling after as few as 16 nodes. Then, when using 47, 32, and 24
>> cores/node, we got better parallel efficiency to higher node count, with
>> the best efficiency obtained using just half of the core available on each
>> node (24). At the end, when running on 48 nodes, we achieved the most
>> ns/day when using 24 cores/node. However, the resources requested in the
>> project we needed to calculate using all 48 cores/node regardless of how
>> many we are actually using.
>>
>> I haven't experienced anything like this on other sites (similar systems,
>> same configuration files). Normally using all cores available per node has
>> always given the best performance. So, I am wondering whether there is
>> anything obvious that could explain such a behavior ?
>>
>> Best
>> Vlad
>>
>> On 11/30/21 15:33, Giacomo Fiorin wrote:
>>
>> Something in addition to what Axel says (all of which is absolutely true,
>> even the counter-intuitive part about making the single-node performance
>> artificially slower to get through the bottom-most tier of technical
>> review).
>>
>>
>>
>> One possible issue to look at is how the cluster's network is utilized by
>> other users/applications.  In a local cluster that I use, the InfiniBand
>> network is also used by the nodes to access data storage and there are many
>> other users processing MRI, cryo-EM or bioinformatics data (all
>> embarrassingly-parallel by design).  So the InfiniBand network is
>> constantly busy and does not necessarily offer very low latency for NAMD or
>> other message-passing applications.
>>
>>
>>
>> Something that helped in that case was building Charm++ on top of the UCX
>> library instead of IBverbs directly.  I am wholly unfamiliar with the
>> details of how UCX works, but in essence it provides better utilization of
>> the network when the ratio of compute cores vs. network links is high.  If
>> the cluster's staff has a copy of UCX, try that.  It wasn't easy to build,
>> but it paid off specifically for those runs that were communication-bound.
>>
>>
>>
>> The main significant addition in 2.15 is the AVX-512 tiles algorithm,
>> which would help with the most expensive Intel CPUs like those, but would
>> also make the computation part faster with the caveat that Axel mentioned.
>>
>>
>>
>> Giacomo
>>
>>
>>
>> On Tue, Nov 30, 2021 at 6:16 AM Axel Kohlmeyer <akohlmey_at_gmail.com>
>> wrote:
>>
>> Actually, if you optimize how NAMD is compiled better than the system
>> provided executable, your parallel efficiency will go down. Please recall
>> Amdahl's law: the parallel efficiency is determined by the relation of time
>> spent on parallel execution and serial execution.
>>
>>
>>
>>  A better optimized executable will spend even less time computing and
>> thus have more parallel overhead.
>>
>>
>>
>> To get better parallel efficiency, you have to avoid or reduce all non
>> parallel operations like output or use of features like Tcl scripting or
>> make your computations more expensive by increasing the cutoff or the
>> system size or make the executable slower by compiling a less optimized
>> version.
>>
>> --
>> Dr. Axel Kohlmeyer akohlmey_at_gmail.com https://urldefense.com/v3/__http://goo.gl/1wk0__;!!DZ3fjg!tdYi7N1NVr8KtU5IVzoCwBS8jTMzgmjY0Kb-WyaRtZgNVLJ-IVL91DFQum0bMcA6Ag$ 
>> <https://urldefense.com/v3/__http:/goo.gl/1wk0__;!!DZ3fjg!vTfAy2yEX2CbE-RC_oXIbJCP-TYotczi7lvqSPqNSBEGEfUDyM103t2gWOCxycm24A$>
>> College of Science & Technology, Temple University, Philadelphia PA, USA
>> International Centre for Theoretical Physics, Trieste, Italy
>>
>>
>>
>> On Tue, Nov 30, 2021, 05:32 Vlad Cojocaru <
>> vlad.cojocaru_at_mpi-muenster.mpg.de> wrote:
>>
>> Dear all,
>>
>> We submitted a proposal to run some extensive atomistic simulations with
>> NAMD of systems ranging between 500 K to 2M atoms on a supercomputer
>> with  Intel Xeon Platinum 8160 processors and 100Gb Intel Omni-path
>> Full-Fat Tree interconnection.
>>
>> Apparently, our project may fail the technical evaluation because during
>> our tests we did not achieve a 75 % parallel efficiency between 2 to 48
>> nodes (each node has 2 CPUs - 24 cores/CPU).  We have tested the NAMD
>> 2.14 provided by default at the site and we do not know how this was
>> built. Looking at the NAMD benchmarks available for the Frontera
>> supercomputer (quite similar architecture if I understand it correctly
>> but for larger systems), it seems we should definitely achieve with NAMD
>> 2.15 (maybe even 2.14) much better performance and parallel efficiency
>> up to 48/64 nodes on this architecture than we actually achieved in our
>> tests.
>>
>> So, my reasoning is that probably the NAMD built by default was not
>> really carefully optimized.
>>
>> I would appreciate if anyone who has experience with building and
>> optimizing NAMD on such an architecture could recommend any
>> compiler/MPI/configuration/options for building an NAMD with a better
>> performance and parallel efficiency. If I have some clear ideas about
>> how to optimize NAMD, maybe I could make the case for our project to not
>> fail the technical evaluation.
>>
>> Thank you very much for any advice
>>
>> Best wishes
>> Vlad
>>
>>
>>
>> --
>> Vlad Cojocaru, PD (Habil.), Ph.D.
>> -----------------------------------------------
>> Project Group Leader
>> Department of Cell and Developmental Biology
>> Max Planck Institute for Molecular Biomedicine
>> Röntgenstrasse 20, 48149 Münster, Germany
>> -----------------------------------------------
>> Tel: +49-251-70365-324; Fax: +49-251-70365-399
>> Email: vlad.cojocaru[at]mpi-muenster.mpg.de
>> <https://urldefense.com/v3/__http:/mpi-muenster.mpg.de__;!!DZ3fjg!vTfAy2yEX2CbE-RC_oXIbJCP-TYotczi7lvqSPqNSBEGEfUDyM103t2gWOC6ifHfyA$>
>>
>> https://urldefense.com/v3/__http://www.mpi-muenster.mpg.de/43241/cojocaru__;!!DZ3fjg!ouau8vpkIDbQ8KrgRCSrc8Ng4YRHk1w7tQfeHsxoB5VnnkEQuC3CQj5uCvq0Gx8Paw$
>> <https://urldefense.com/v3/__http:/www.mpi-muenster.mpg.de/43241/cojocaru__;!!DZ3fjg!ouau8vpkIDbQ8KrgRCSrc8Ng4YRHk1w7tQfeHsxoB5VnnkEQuC3CQj5uCvq0Gx8Paw$>
>>
>>
>>
>> --
>>
>> Vlad Cojocaru, PD (Habil.), Ph.D.
>>
>> -----------------------------------------------
>>
>> Project Group Leader
>>
>> Department of Cell and Developmental Biology
>>
>> Max Planck Institute for Molecular Biomedicine
>>
>> Röntgenstrasse 20, 48149 Münster, Germany
>>
>> -----------------------------------------------
>>
>> Tel: +49-251-70365-324; Fax: +49-251-70365-399
>>
>> Email: vlad.cojocaru[at]mpi-muenster.mpg.de
>>
>> https://urldefense.com/v3/__http://www.mpi-muenster.mpg.de/43241/cojocaru__;!!DZ3fjg!tdYi7N1NVr8KtU5IVzoCwBS8jTMzgmjY0Kb-WyaRtZgNVLJ-IVL91DFQum3g_JSMjQ$  <https://urldefense.com/v3/__http:/www.mpi-muenster.mpg.de/43241/cojocaru__;!!DZ3fjg!vqb8Aw2vjDMitJSEPbF74htFKf2mY9JVQk1mwfueoHqSLJz36Td3erIQ1SMxxRzYIA$>
>>
>>

This archive was generated by hypermail 2.1.6 : Fri Dec 31 2021 - 23:17:12 CST