From: James M Davis (jmdavis1_at_vcu.edu)
Date: Tue Nov 30 2021 - 12:25:46 CST
I should have said build from scratch to support Omnipath or use the
Charrm++ mpi with open-mpi. There will still be testing involved to tweak
the system and the performance. It still might not support 48*48. But you
should be able to get to something higher than 48*24.
---- Mike Davis Technical Director: High Performance Research Computing Virginia Commonwealth University (804) 828-3885 (o) • (804) 307-3428(c) https://urldefense.com/v3/__https://chipc.vcu.edu__;!!DZ3fjg!tdYi7N1NVr8KtU5IVzoCwBS8jTMzgmjY0Kb-WyaRtZgNVLJ-IVL91DFQum10OvRJ0w$ <https://urldefense.com/v3/__https://chipc.vcu.edu/__;!!DZ3fjg!tdYi7N1NVr8KtU5IVzoCwBS8jTMzgmjY0Kb-WyaRtZgNVLJ-IVL91DFQum0XLC_nFg$ > On Tue, Nov 30, 2021 at 1:07 PM James M Davis <jmdavis1_at_vcu.edu> wrote: > A few notes from the 2.15 release notes. I think you will need to build > from scratch for Omnipath. > > Intel Omni-Path networks are incompatible with the pre-built verbs NAMD >> binaries. Charm++ for verbs can be built with --with-qlogic to support >> Omni-Path, but the Charm++ MPI network layer performs better than the verbs >> layer. Hangs have been observed with Intel MPI but not with OpenMPI, so >> OpenMPI is preferred. See "Compiling NAMD" below for MPI build >> instructions. NAMD MPI binaries may be launched directly with mpiexec >> rather than via the provided charmrun script." > > https://www.ks.uiuc.edu/Research/namd/cvs/notes.html > > ---- > Mike Davis > Technical Director: High Performance Research Computing > Virginia Commonwealth University > (804) 828-3885 (o) • (804) 307-3428(c) > https://urldefense.com/v3/__https://chipc.vcu.edu__;!!DZ3fjg!tdYi7N1NVr8KtU5IVzoCwBS8jTMzgmjY0Kb-WyaRtZgNVLJ-IVL91DFQum10OvRJ0w$ > > <https://urldefense.com/v3/__https://chipc.vcu.edu/__;!!DZ3fjg!tdYi7N1NVr8KtU5IVzoCwBS8jTMzgmjY0Kb-WyaRtZgNVLJ-IVL91DFQum0XLC_nFg$ > > > > On Tue, Nov 30, 2021 at 1:00 PM Vermaas, Josh <vermaasj_at_msu.edu> wrote: > >> Hi Vlad, >> >> >> >> In addition to the great points Axel and Giacomo have made, I’d like to >> point out that the 8160 is a 24 core processor, and that there are likely 2 >> of them on a given node. In these two socket configurations, where there >> are two physical CPU dies, I’ve often found that the best performance is >> achieved when you treat each socket as its own node, and allocate 2x the >> number of “tasks” as you have nodes. That way, each SMP task gets placed on >> its own socket. If you don’t, each node is trying to get all 48 cores >> across both sockets to work together, which ends up saturating the UPI >> links between the sockets, and can be detrimental to performance. >> >> >> >> This is usually a bigger problem for SMP-based builds. In my experience, >> CPU-only systems benefit from MPI based builds, where the number of tasks >> is equal to the number of CPUs. Usually this is a performance win for >> modestly sized systems at the expense of scalability for really big systems. >> >> >> >> -Josh >> >> >> >> *From: *<owner-namd-l_at_ks.uiuc.edu> on behalf of Vlad Cojocaru < >> vlad.cojocaru_at_mpi-muenster.mpg.de> >> *Organization: *MPI Muenster >> *Reply-To: *"namd-l_at_ks.uiuc.edu" <namd-l_at_ks.uiuc.edu>, Vlad Cojocaru < >> vlad.cojocaru_at_mpi-muenster.mpg.de> >> *Date: *Tuesday, November 30, 2021 at 10:18 AM >> *To: *Giacomo Fiorin <giacomo.fiorin_at_gmail.com>, NAMD list < >> namd-l_at_ks.uiuc.edu>, Axel Kohlmeyer <akohlmey_at_gmail.com> >> *Cc: *HORIA-LEONARD BANCIU <horia.banciu_at_ubbcluj.ro> >> *Subject: *Re: namd-l: NAMD performance on a supercomputer with Intel >> Xeon Platinum 8160 and 100Gb Intel Omni-Path Full-Fat Tree >> >> >> >> Thanks for your thoughts ! >> >> One thing that seemed weird during our tests on this site was that the >> performance and parallel scaling rapidly degraded when using all 48 cores >> available per node (2 CPUs with 24 cores each) . We actually saw negative >> scaling after as few as 16 nodes. Then, when using 47, 32, and 24 >> cores/node, we got better parallel efficiency to higher node count, with >> the best efficiency obtained using just half of the core available on each >> node (24). At the end, when running on 48 nodes, we achieved the most >> ns/day when using 24 cores/node. However, the resources requested in the >> project we needed to calculate using all 48 cores/node regardless of how >> many we are actually using. >> >> I haven't experienced anything like this on other sites (similar systems, >> same configuration files). Normally using all cores available per node has >> always given the best performance. So, I am wondering whether there is >> anything obvious that could explain such a behavior ? >> >> Best >> Vlad >> >> On 11/30/21 15:33, Giacomo Fiorin wrote: >> >> Something in addition to what Axel says (all of which is absolutely true, >> even the counter-intuitive part about making the single-node performance >> artificially slower to get through the bottom-most tier of technical >> review). >> >> >> >> One possible issue to look at is how the cluster's network is utilized by >> other users/applications. In a local cluster that I use, the InfiniBand >> network is also used by the nodes to access data storage and there are many >> other users processing MRI, cryo-EM or bioinformatics data (all >> embarrassingly-parallel by design). So the InfiniBand network is >> constantly busy and does not necessarily offer very low latency for NAMD or >> other message-passing applications. >> >> >> >> Something that helped in that case was building Charm++ on top of the UCX >> library instead of IBverbs directly. I am wholly unfamiliar with the >> details of how UCX works, but in essence it provides better utilization of >> the network when the ratio of compute cores vs. network links is high. If >> the cluster's staff has a copy of UCX, try that. It wasn't easy to build, >> but it paid off specifically for those runs that were communication-bound. >> >> >> >> The main significant addition in 2.15 is the AVX-512 tiles algorithm, >> which would help with the most expensive Intel CPUs like those, but would >> also make the computation part faster with the caveat that Axel mentioned. >> >> >> >> Giacomo >> >> >> >> On Tue, Nov 30, 2021 at 6:16 AM Axel Kohlmeyer <akohlmey_at_gmail.com> >> wrote: >> >> Actually, if you optimize how NAMD is compiled better than the system >> provided executable, your parallel efficiency will go down. Please recall >> Amdahl's law: the parallel efficiency is determined by the relation of time >> spent on parallel execution and serial execution. >> >> >> >> A better optimized executable will spend even less time computing and >> thus have more parallel overhead. >> >> >> >> To get better parallel efficiency, you have to avoid or reduce all non >> parallel operations like output or use of features like Tcl scripting or >> make your computations more expensive by increasing the cutoff or the >> system size or make the executable slower by compiling a less optimized >> version. >> >> -- >> Dr. Axel Kohlmeyer akohlmey_at_gmail.com https://urldefense.com/v3/__http://goo.gl/1wk0__;!!DZ3fjg!tdYi7N1NVr8KtU5IVzoCwBS8jTMzgmjY0Kb-WyaRtZgNVLJ-IVL91DFQum0bMcA6Ag$ >> <https://urldefense.com/v3/__http:/goo.gl/1wk0__;!!DZ3fjg!vTfAy2yEX2CbE-RC_oXIbJCP-TYotczi7lvqSPqNSBEGEfUDyM103t2gWOCxycm24A$> >> College of Science & Technology, Temple University, Philadelphia PA, USA >> International Centre for Theoretical Physics, Trieste, Italy >> >> >> >> On Tue, Nov 30, 2021, 05:32 Vlad Cojocaru < >> vlad.cojocaru_at_mpi-muenster.mpg.de> wrote: >> >> Dear all, >> >> We submitted a proposal to run some extensive atomistic simulations with >> NAMD of systems ranging between 500 K to 2M atoms on a supercomputer >> with Intel Xeon Platinum 8160 processors and 100Gb Intel Omni-path >> Full-Fat Tree interconnection. >> >> Apparently, our project may fail the technical evaluation because during >> our tests we did not achieve a 75 % parallel efficiency between 2 to 48 >> nodes (each node has 2 CPUs - 24 cores/CPU). We have tested the NAMD >> 2.14 provided by default at the site and we do not know how this was >> built. Looking at the NAMD benchmarks available for the Frontera >> supercomputer (quite similar architecture if I understand it correctly >> but for larger systems), it seems we should definitely achieve with NAMD >> 2.15 (maybe even 2.14) much better performance and parallel efficiency >> up to 48/64 nodes on this architecture than we actually achieved in our >> tests. >> >> So, my reasoning is that probably the NAMD built by default was not >> really carefully optimized. >> >> I would appreciate if anyone who has experience with building and >> optimizing NAMD on such an architecture could recommend any >> compiler/MPI/configuration/options for building an NAMD with a better >> performance and parallel efficiency. If I have some clear ideas about >> how to optimize NAMD, maybe I could make the case for our project to not >> fail the technical evaluation. >> >> Thank you very much for any advice >> >> Best wishes >> Vlad >> >> >> >> -- >> Vlad Cojocaru, PD (Habil.), Ph.D. >> ----------------------------------------------- >> Project Group Leader >> Department of Cell and Developmental Biology >> Max Planck Institute for Molecular Biomedicine >> Röntgenstrasse 20, 48149 Münster, Germany >> ----------------------------------------------- >> Tel: +49-251-70365-324; Fax: +49-251-70365-399 >> Email: vlad.cojocaru[at]mpi-muenster.mpg.de >> <https://urldefense.com/v3/__http:/mpi-muenster.mpg.de__;!!DZ3fjg!vTfAy2yEX2CbE-RC_oXIbJCP-TYotczi7lvqSPqNSBEGEfUDyM103t2gWOC6ifHfyA$> >> >> https://urldefense.com/v3/__http://www.mpi-muenster.mpg.de/43241/cojocaru__;!!DZ3fjg!ouau8vpkIDbQ8KrgRCSrc8Ng4YRHk1w7tQfeHsxoB5VnnkEQuC3CQj5uCvq0Gx8Paw$ >> <https://urldefense.com/v3/__http:/www.mpi-muenster.mpg.de/43241/cojocaru__;!!DZ3fjg!ouau8vpkIDbQ8KrgRCSrc8Ng4YRHk1w7tQfeHsxoB5VnnkEQuC3CQj5uCvq0Gx8Paw$> >> >> >> >> -- >> >> Vlad Cojocaru, PD (Habil.), Ph.D. >> >> ----------------------------------------------- >> >> Project Group Leader >> >> Department of Cell and Developmental Biology >> >> Max Planck Institute for Molecular Biomedicine >> >> Röntgenstrasse 20, 48149 Münster, Germany >> >> ----------------------------------------------- >> >> Tel: +49-251-70365-324; Fax: +49-251-70365-399 >> >> Email: vlad.cojocaru[at]mpi-muenster.mpg.de >> >> https://urldefense.com/v3/__http://www.mpi-muenster.mpg.de/43241/cojocaru__;!!DZ3fjg!tdYi7N1NVr8KtU5IVzoCwBS8jTMzgmjY0Kb-WyaRtZgNVLJ-IVL91DFQum3g_JSMjQ$ <https://urldefense.com/v3/__http:/www.mpi-muenster.mpg.de/43241/cojocaru__;!!DZ3fjg!vqb8Aw2vjDMitJSEPbF74htFKf2mY9JVQk1mwfueoHqSLJz36Td3erIQ1SMxxRzYIA$> >> >>
This archive was generated by hypermail 2.1.6 : Fri Dec 31 2021 - 23:17:12 CST