Re: NAMD performance on a supercomputer with Intel Xeon Platinum 8160 and 100Gb Intel Omni-Path Full-Fat Tree

From: James M Davis (jmdavis1_at_vcu.edu)
Date: Tue Nov 30 2021 - 12:07:32 CST

A few notes from the 2.15 release notes. I think you will need to build
from scratch for Omnipath.

Intel Omni-Path networks are incompatible with the pre-built verbs NAMD
> binaries. Charm++ for verbs can be built with --with-qlogic to support
> Omni-Path, but the Charm++ MPI network layer performs better than the verbs
> layer. Hangs have been observed with Intel MPI but not with OpenMPI, so
> OpenMPI is preferred. See "Compiling NAMD" below for MPI build
> instructions. NAMD MPI binaries may be launched directly with mpiexec
> rather than via the provided charmrun script."

https://www.ks.uiuc.edu/Research/namd/cvs/notes.html

----
Mike Davis
Technical Director: High Performance Research Computing
Virginia Commonwealth University
(804) 828-3885 (o) • (804) 307-3428(c)
https://urldefense.com/v3/__https://chipc.vcu.edu__;!!DZ3fjg!tBqcYe25cf24sZ9_e4pE4S0MfI7TUwx4UR9D0O_0i4e4sXoGWogadgTbIibB33t_1A$ 
<https://urldefense.com/v3/__https://chipc.vcu.edu/__;!!DZ3fjg!tBqcYe25cf24sZ9_e4pE4S0MfI7TUwx4UR9D0O_0i4e4sXoGWogadgTbIiaNlrv_dw$ >
On Tue, Nov 30, 2021 at 1:00 PM Vermaas, Josh <vermaasj_at_msu.edu> wrote:
> Hi Vlad,
>
>
>
> In addition to the great points Axel and Giacomo have made, I’d like to
> point out that the 8160 is a 24 core processor, and that there are likely 2
> of them on a given node. In these two socket configurations, where there
> are two physical CPU dies, I’ve often found that the best performance is
> achieved when you treat each socket as its own node, and allocate 2x the
> number of “tasks” as you have nodes. That way, each SMP task gets placed on
> its own socket. If you don’t, each node is trying to get all 48 cores
> across both sockets to work together, which ends up saturating the UPI
> links between the sockets, and can be detrimental to performance.
>
>
>
> This is usually a bigger problem for SMP-based builds. In my experience,
> CPU-only systems benefit from MPI based builds, where the number of tasks
> is equal to the number of CPUs. Usually this is a performance win for
> modestly sized systems at the expense of scalability for really big systems.
>
>
>
> -Josh
>
>
>
> *From: *<owner-namd-l_at_ks.uiuc.edu> on behalf of Vlad Cojocaru <
> vlad.cojocaru_at_mpi-muenster.mpg.de>
> *Organization: *MPI Muenster
> *Reply-To: *"namd-l_at_ks.uiuc.edu" <namd-l_at_ks.uiuc.edu>, Vlad Cojocaru <
> vlad.cojocaru_at_mpi-muenster.mpg.de>
> *Date: *Tuesday, November 30, 2021 at 10:18 AM
> *To: *Giacomo Fiorin <giacomo.fiorin_at_gmail.com>, NAMD list <
> namd-l_at_ks.uiuc.edu>, Axel Kohlmeyer <akohlmey_at_gmail.com>
> *Cc: *HORIA-LEONARD BANCIU <horia.banciu_at_ubbcluj.ro>
> *Subject: *Re: namd-l: NAMD performance on a supercomputer with Intel
> Xeon Platinum 8160 and 100Gb Intel Omni-Path Full-Fat Tree
>
>
>
> Thanks for your thoughts !
>
> One thing that seemed weird during our tests on this site was that the
> performance and parallel scaling rapidly degraded when using all 48 cores
> available per node (2 CPUs with 24 cores each) . We actually saw negative
> scaling after as few as 16 nodes. Then, when using 47, 32, and 24
> cores/node, we got better parallel efficiency to higher node count, with
> the best efficiency obtained using just half of the core available on each
> node (24). At the end, when running on 48 nodes, we achieved the most
> ns/day when using 24 cores/node. However, the resources requested in the
> project we needed to calculate using all 48 cores/node regardless of how
> many we are actually using.
>
> I haven't experienced anything like this on other sites (similar systems,
> same configuration files). Normally using all cores available per node has
> always given the best performance. So, I am wondering whether there is
> anything obvious that could explain such a behavior ?
>
> Best
> Vlad
>
> On 11/30/21 15:33, Giacomo Fiorin wrote:
>
> Something in addition to what Axel says (all of which is absolutely true,
> even the counter-intuitive part about making the single-node performance
> artificially slower to get through the bottom-most tier of technical
> review).
>
>
>
> One possible issue to look at is how the cluster's network is utilized by
> other users/applications.  In a local cluster that I use, the InfiniBand
> network is also used by the nodes to access data storage and there are many
> other users processing MRI, cryo-EM or bioinformatics data (all
> embarrassingly-parallel by design).  So the InfiniBand network is
> constantly busy and does not necessarily offer very low latency for NAMD or
> other message-passing applications.
>
>
>
> Something that helped in that case was building Charm++ on top of the UCX
> library instead of IBverbs directly.  I am wholly unfamiliar with the
> details of how UCX works, but in essence it provides better utilization of
> the network when the ratio of compute cores vs. network links is high.  If
> the cluster's staff has a copy of UCX, try that.  It wasn't easy to build,
> but it paid off specifically for those runs that were communication-bound.
>
>
>
> The main significant addition in 2.15 is the AVX-512 tiles algorithm,
> which would help with the most expensive Intel CPUs like those, but would
> also make the computation part faster with the caveat that Axel mentioned.
>
>
>
> Giacomo
>
>
>
> On Tue, Nov 30, 2021 at 6:16 AM Axel Kohlmeyer <akohlmey_at_gmail.com> wrote:
>
> Actually, if you optimize how NAMD is compiled better than the system
> provided executable, your parallel efficiency will go down. Please recall
> Amdahl's law: the parallel efficiency is determined by the relation of time
> spent on parallel execution and serial execution.
>
>
>
>  A better optimized executable will spend even less time computing and
> thus have more parallel overhead.
>
>
>
> To get better parallel efficiency, you have to avoid or reduce all non
> parallel operations like output or use of features like Tcl scripting or
> make your computations more expensive by increasing the cutoff or the
> system size or make the executable slower by compiling a less optimized
> version.
>
> --
> Dr. Axel Kohlmeyer akohlmey_at_gmail.com https://urldefense.com/v3/__http://goo.gl/1wk0__;!!DZ3fjg!tBqcYe25cf24sZ9_e4pE4S0MfI7TUwx4UR9D0O_0i4e4sXoGWogadgTbIib9XSAtIQ$ 
> <https://urldefense.com/v3/__http:/goo.gl/1wk0__;!!DZ3fjg!vTfAy2yEX2CbE-RC_oXIbJCP-TYotczi7lvqSPqNSBEGEfUDyM103t2gWOCxycm24A$>
> College of Science & Technology, Temple University, Philadelphia PA, USA
> International Centre for Theoretical Physics, Trieste, Italy
>
>
>
> On Tue, Nov 30, 2021, 05:32 Vlad Cojocaru <
> vlad.cojocaru_at_mpi-muenster.mpg.de> wrote:
>
> Dear all,
>
> We submitted a proposal to run some extensive atomistic simulations with
> NAMD of systems ranging between 500 K to 2M atoms on a supercomputer
> with  Intel Xeon Platinum 8160 processors and 100Gb Intel Omni-path
> Full-Fat Tree interconnection.
>
> Apparently, our project may fail the technical evaluation because during
> our tests we did not achieve a 75 % parallel efficiency between 2 to 48
> nodes (each node has 2 CPUs - 24 cores/CPU).  We have tested the NAMD
> 2.14 provided by default at the site and we do not know how this was
> built. Looking at the NAMD benchmarks available for the Frontera
> supercomputer (quite similar architecture if I understand it correctly
> but for larger systems), it seems we should definitely achieve with NAMD
> 2.15 (maybe even 2.14) much better performance and parallel efficiency
> up to 48/64 nodes on this architecture than we actually achieved in our
> tests.
>
> So, my reasoning is that probably the NAMD built by default was not
> really carefully optimized.
>
> I would appreciate if anyone who has experience with building and
> optimizing NAMD on such an architecture could recommend any
> compiler/MPI/configuration/options for building an NAMD with a better
> performance and parallel efficiency. If I have some clear ideas about
> how to optimize NAMD, maybe I could make the case for our project to not
> fail the technical evaluation.
>
> Thank you very much for any advice
>
> Best wishes
> Vlad
>
>
>
> --
> Vlad Cojocaru, PD (Habil.), Ph.D.
> -----------------------------------------------
> Project Group Leader
> Department of Cell and Developmental Biology
> Max Planck Institute for Molecular Biomedicine
> Röntgenstrasse 20, 48149 Münster, Germany
> -----------------------------------------------
> Tel: +49-251-70365-324; Fax: +49-251-70365-399
> Email: vlad.cojocaru[at]mpi-muenster.mpg.de
> <https://urldefense.com/v3/__http:/mpi-muenster.mpg.de__;!!DZ3fjg!vTfAy2yEX2CbE-RC_oXIbJCP-TYotczi7lvqSPqNSBEGEfUDyM103t2gWOC6ifHfyA$>
>
> https://urldefense.com/v3/__http://www.mpi-muenster.mpg.de/43241/cojocaru__;!!DZ3fjg!ouau8vpkIDbQ8KrgRCSrc8Ng4YRHk1w7tQfeHsxoB5VnnkEQuC3CQj5uCvq0Gx8Paw$
> <https://urldefense.com/v3/__http:/www.mpi-muenster.mpg.de/43241/cojocaru__;!!DZ3fjg!ouau8vpkIDbQ8KrgRCSrc8Ng4YRHk1w7tQfeHsxoB5VnnkEQuC3CQj5uCvq0Gx8Paw$>
>
>
>
> --
>
> Vlad Cojocaru, PD (Habil.), Ph.D.
>
> -----------------------------------------------
>
> Project Group Leader
>
> Department of Cell and Developmental Biology
>
> Max Planck Institute for Molecular Biomedicine
>
> Röntgenstrasse 20, 48149 Münster, Germany
>
> -----------------------------------------------
>
> Tel: +49-251-70365-324; Fax: +49-251-70365-399
>
> Email: vlad.cojocaru[at]mpi-muenster.mpg.de
>
> https://urldefense.com/v3/__http://www.mpi-muenster.mpg.de/43241/cojocaru__;!!DZ3fjg!tBqcYe25cf24sZ9_e4pE4S0MfI7TUwx4UR9D0O_0i4e4sXoGWogadgTbIiamghzcLQ$  <https://urldefense.com/v3/__http:/www.mpi-muenster.mpg.de/43241/cojocaru__;!!DZ3fjg!vqb8Aw2vjDMitJSEPbF74htFKf2mY9JVQk1mwfueoHqSLJz36Td3erIQ1SMxxRzYIA$>
>
>

This archive was generated by hypermail 2.1.6 : Fri Dec 31 2021 - 23:17:12 CST