From: James M Davis (jmdavis1_at_vcu.edu)
Date: Tue Nov 30 2021 - 12:07:32 CST
A few notes from the 2.15 release notes. I think you will need to build
from scratch for Omnipath.
Intel Omni-Path networks are incompatible with the pre-built verbs NAMD
> binaries. Charm++ for verbs can be built with --with-qlogic to support
> Omni-Path, but the Charm++ MPI network layer performs better than the verbs
> layer. Hangs have been observed with Intel MPI but not with OpenMPI, so
> OpenMPI is preferred. See "Compiling NAMD" below for MPI build
> instructions. NAMD MPI binaries may be launched directly with mpiexec
> rather than via the provided charmrun script."
---- Mike Davis Technical Director: High Performance Research Computing Virginia Commonwealth University (804) 828-3885 (o) • (804) 307-3428(c) https://urldefense.com/v3/__https://chipc.vcu.edu__;!!DZ3fjg!tBqcYe25cf24sZ9_e4pE4S0MfI7TUwx4UR9D0O_0i4e4sXoGWogadgTbIibB33t_1A$ <https://urldefense.com/v3/__https://chipc.vcu.edu/__;!!DZ3fjg!tBqcYe25cf24sZ9_e4pE4S0MfI7TUwx4UR9D0O_0i4e4sXoGWogadgTbIiaNlrv_dw$ > On Tue, Nov 30, 2021 at 1:00 PM Vermaas, Josh <vermaasj_at_msu.edu> wrote: > Hi Vlad, > > > > In addition to the great points Axel and Giacomo have made, I’d like to > point out that the 8160 is a 24 core processor, and that there are likely 2 > of them on a given node. In these two socket configurations, where there > are two physical CPU dies, I’ve often found that the best performance is > achieved when you treat each socket as its own node, and allocate 2x the > number of “tasks” as you have nodes. That way, each SMP task gets placed on > its own socket. If you don’t, each node is trying to get all 48 cores > across both sockets to work together, which ends up saturating the UPI > links between the sockets, and can be detrimental to performance. > > > > This is usually a bigger problem for SMP-based builds. In my experience, > CPU-only systems benefit from MPI based builds, where the number of tasks > is equal to the number of CPUs. Usually this is a performance win for > modestly sized systems at the expense of scalability for really big systems. > > > > -Josh > > > > *From: *<owner-namd-l_at_ks.uiuc.edu> on behalf of Vlad Cojocaru < > vlad.cojocaru_at_mpi-muenster.mpg.de> > *Organization: *MPI Muenster > *Reply-To: *"namd-l_at_ks.uiuc.edu" <namd-l_at_ks.uiuc.edu>, Vlad Cojocaru < > vlad.cojocaru_at_mpi-muenster.mpg.de> > *Date: *Tuesday, November 30, 2021 at 10:18 AM > *To: *Giacomo Fiorin <giacomo.fiorin_at_gmail.com>, NAMD list < > namd-l_at_ks.uiuc.edu>, Axel Kohlmeyer <akohlmey_at_gmail.com> > *Cc: *HORIA-LEONARD BANCIU <horia.banciu_at_ubbcluj.ro> > *Subject: *Re: namd-l: NAMD performance on a supercomputer with Intel > Xeon Platinum 8160 and 100Gb Intel Omni-Path Full-Fat Tree > > > > Thanks for your thoughts ! > > One thing that seemed weird during our tests on this site was that the > performance and parallel scaling rapidly degraded when using all 48 cores > available per node (2 CPUs with 24 cores each) . We actually saw negative > scaling after as few as 16 nodes. Then, when using 47, 32, and 24 > cores/node, we got better parallel efficiency to higher node count, with > the best efficiency obtained using just half of the core available on each > node (24). At the end, when running on 48 nodes, we achieved the most > ns/day when using 24 cores/node. However, the resources requested in the > project we needed to calculate using all 48 cores/node regardless of how > many we are actually using. > > I haven't experienced anything like this on other sites (similar systems, > same configuration files). Normally using all cores available per node has > always given the best performance. So, I am wondering whether there is > anything obvious that could explain such a behavior ? > > Best > Vlad > > On 11/30/21 15:33, Giacomo Fiorin wrote: > > Something in addition to what Axel says (all of which is absolutely true, > even the counter-intuitive part about making the single-node performance > artificially slower to get through the bottom-most tier of technical > review). > > > > One possible issue to look at is how the cluster's network is utilized by > other users/applications. In a local cluster that I use, the InfiniBand > network is also used by the nodes to access data storage and there are many > other users processing MRI, cryo-EM or bioinformatics data (all > embarrassingly-parallel by design). So the InfiniBand network is > constantly busy and does not necessarily offer very low latency for NAMD or > other message-passing applications. > > > > Something that helped in that case was building Charm++ on top of the UCX > library instead of IBverbs directly. I am wholly unfamiliar with the > details of how UCX works, but in essence it provides better utilization of > the network when the ratio of compute cores vs. network links is high. If > the cluster's staff has a copy of UCX, try that. It wasn't easy to build, > but it paid off specifically for those runs that were communication-bound. > > > > The main significant addition in 2.15 is the AVX-512 tiles algorithm, > which would help with the most expensive Intel CPUs like those, but would > also make the computation part faster with the caveat that Axel mentioned. > > > > Giacomo > > > > On Tue, Nov 30, 2021 at 6:16 AM Axel Kohlmeyer <akohlmey_at_gmail.com> wrote: > > Actually, if you optimize how NAMD is compiled better than the system > provided executable, your parallel efficiency will go down. Please recall > Amdahl's law: the parallel efficiency is determined by the relation of time > spent on parallel execution and serial execution. > > > > A better optimized executable will spend even less time computing and > thus have more parallel overhead. > > > > To get better parallel efficiency, you have to avoid or reduce all non > parallel operations like output or use of features like Tcl scripting or > make your computations more expensive by increasing the cutoff or the > system size or make the executable slower by compiling a less optimized > version. > > -- > Dr. Axel Kohlmeyer akohlmey_at_gmail.com https://urldefense.com/v3/__http://goo.gl/1wk0__;!!DZ3fjg!tBqcYe25cf24sZ9_e4pE4S0MfI7TUwx4UR9D0O_0i4e4sXoGWogadgTbIib9XSAtIQ$ > <https://urldefense.com/v3/__http:/goo.gl/1wk0__;!!DZ3fjg!vTfAy2yEX2CbE-RC_oXIbJCP-TYotczi7lvqSPqNSBEGEfUDyM103t2gWOCxycm24A$> > College of Science & Technology, Temple University, Philadelphia PA, USA > International Centre for Theoretical Physics, Trieste, Italy > > > > On Tue, Nov 30, 2021, 05:32 Vlad Cojocaru < > vlad.cojocaru_at_mpi-muenster.mpg.de> wrote: > > Dear all, > > We submitted a proposal to run some extensive atomistic simulations with > NAMD of systems ranging between 500 K to 2M atoms on a supercomputer > with Intel Xeon Platinum 8160 processors and 100Gb Intel Omni-path > Full-Fat Tree interconnection. > > Apparently, our project may fail the technical evaluation because during > our tests we did not achieve a 75 % parallel efficiency between 2 to 48 > nodes (each node has 2 CPUs - 24 cores/CPU). We have tested the NAMD > 2.14 provided by default at the site and we do not know how this was > built. Looking at the NAMD benchmarks available for the Frontera > supercomputer (quite similar architecture if I understand it correctly > but for larger systems), it seems we should definitely achieve with NAMD > 2.15 (maybe even 2.14) much better performance and parallel efficiency > up to 48/64 nodes on this architecture than we actually achieved in our > tests. > > So, my reasoning is that probably the NAMD built by default was not > really carefully optimized. > > I would appreciate if anyone who has experience with building and > optimizing NAMD on such an architecture could recommend any > compiler/MPI/configuration/options for building an NAMD with a better > performance and parallel efficiency. If I have some clear ideas about > how to optimize NAMD, maybe I could make the case for our project to not > fail the technical evaluation. > > Thank you very much for any advice > > Best wishes > Vlad > > > > -- > Vlad Cojocaru, PD (Habil.), Ph.D. > ----------------------------------------------- > Project Group Leader > Department of Cell and Developmental Biology > Max Planck Institute for Molecular Biomedicine > Röntgenstrasse 20, 48149 Münster, Germany > ----------------------------------------------- > Tel: +49-251-70365-324; Fax: +49-251-70365-399 > Email: vlad.cojocaru[at]mpi-muenster.mpg.de > <https://urldefense.com/v3/__http:/mpi-muenster.mpg.de__;!!DZ3fjg!vTfAy2yEX2CbE-RC_oXIbJCP-TYotczi7lvqSPqNSBEGEfUDyM103t2gWOC6ifHfyA$> > > https://urldefense.com/v3/__http://www.mpi-muenster.mpg.de/43241/cojocaru__;!!DZ3fjg!ouau8vpkIDbQ8KrgRCSrc8Ng4YRHk1w7tQfeHsxoB5VnnkEQuC3CQj5uCvq0Gx8Paw$ > <https://urldefense.com/v3/__http:/www.mpi-muenster.mpg.de/43241/cojocaru__;!!DZ3fjg!ouau8vpkIDbQ8KrgRCSrc8Ng4YRHk1w7tQfeHsxoB5VnnkEQuC3CQj5uCvq0Gx8Paw$> > > > > -- > > Vlad Cojocaru, PD (Habil.), Ph.D. > > ----------------------------------------------- > > Project Group Leader > > Department of Cell and Developmental Biology > > Max Planck Institute for Molecular Biomedicine > > Röntgenstrasse 20, 48149 Münster, Germany > > ----------------------------------------------- > > Tel: +49-251-70365-324; Fax: +49-251-70365-399 > > Email: vlad.cojocaru[at]mpi-muenster.mpg.de > > https://urldefense.com/v3/__http://www.mpi-muenster.mpg.de/43241/cojocaru__;!!DZ3fjg!tBqcYe25cf24sZ9_e4pE4S0MfI7TUwx4UR9D0O_0i4e4sXoGWogadgTbIiamghzcLQ$ <https://urldefense.com/v3/__http:/www.mpi-muenster.mpg.de/43241/cojocaru__;!!DZ3fjg!vqb8Aw2vjDMitJSEPbF74htFKf2mY9JVQk1mwfueoHqSLJz36Td3erIQ1SMxxRzYIA$> > >
This archive was generated by hypermail 2.1.6 : Fri Dec 31 2021 - 23:17:12 CST