Re: NAMD performance on a supercomputer with Intel Xeon Platinum 8160 and 100Gb Intel Omni-Path Full-Fat Tree

From: David Hardy (dhardy_at_ks.uiuc.edu)
Date: Mon Dec 06 2021 - 16:51:36 CST

Next message: Nate Walkins: "Extracting force on subset of atoms from tclForces"
Previous message: Michael Robinson: "Re: Dihedral parameters for terminal alkyne"
In reply to: James M Davis: "Re: NAMD performance on a supercomputer with Intel Xeon Platinum 8160 and 100Gb Intel Omni-Path Full-Fat Tree"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

Sorry to have come to this thread late.

I see that no one has suggested any pemap/commap settings. For Frontera (Intel Xeon Platinum 8280, so 28 cores per socket, 56 per node), we see best results running 4 ranks (processes) per node:

+ppn 13 +pemap 4-55:2,5-55:2 +commap 0,2,1,3

For Intel Xeon Platinum 8160, I would similarly try 4 ranks per node for the 48 cores:

+ppn 11 +pemap 4-47:2,5-47:2 +commap 0,2,1,3

Also, in NAMD 2.15, we have additional optimizations for Intel CPUs, including an AVX512 build that introduces special kernels based on a porting of the CUDA tile list kernels to AVX-512, offering up to 2x performance over the traditional CPU kernels. Also, there is an SKX (Skylake) build that improves the traditional CPU kernel performance using OpenMP directives. I mention them both because the AVX512 is not compatible with all features, and attempts to use it for unsupported features might find it either not working or falling back to a slower implementation, whereas the SKX build should be compatible with all features.

You can build these version of NAMD as follows:

./config Linux-AVX512-icc ...

./config Linux-SKX-icc ...

and you have to use Intel's compiler to build (we have been using version 19.0.0.117 20180804 in house and 19.1.1.217 20200306 on Frontera).

Here are the results from a 2-node test I recently ran with STMV (~1M atoms) on Frontera to illustrate the difference in performance you might see between these versions:

     BUILD ns/day speedup
     x86_64 1.89547 1.0
     SKX 3.17095 1.6729
     AVX512 4.17481 2.2025

Hope this helps,
Dave

--
David J. Hardy, Ph.D.
Beckman Institute
University of Illinois at Urbana-Champaign
405 N. Mathews Ave., Urbana, IL 61801
dhardy_at_ks.uiuc.edu, http://www.ks.uiuc.edu/~dhardy/
> On Nov 30, 2021, at 12:25 PM, James M Davis <jmdavis1_at_vcu.edu> wrote:
> 
> I should have said build from scratch to support Omnipath or use the Charrm++ mpi with open-mpi. There will still be testing involved to tweak the system and the performance. It still might not support 48*48. But you should be able to get to something higher than 48*24.
> ----
> Mike Davis
> Technical Director: High Performance Research Computing 
> Virginia Commonwealth University
> (804) 828-3885 (o) • (804) 307-3428(c)
> https://chipc.vcu.edu <https://urldefense.com/v3/__https://chipc.vcu.edu/__;!!DZ3fjg!tdYi7N1NVr8KtU5IVzoCwBS8jTMzgmjY0Kb-WyaRtZgNVLJ-IVL91DFQum0XLC_nFg$>
> 
>  <https://urldefense.com/v3/__https://chipc.vcu.edu/__;!!DZ3fjg!tdYi7N1NVr8KtU5IVzoCwBS8jTMzgmjY0Kb-WyaRtZgNVLJ-IVL91DFQum0XLC_nFg$>
> 
> 
> On Tue, Nov 30, 2021 at 1:07 PM James M Davis <jmdavis1_at_vcu.edu <mailto:jmdavis1_at_vcu.edu>> wrote:
> A few notes from the 2.15 release notes. I think you will need to build from scratch for Omnipath. 
> 
> Intel Omni-Path networks are incompatible with the pre-built verbs NAMD binaries. Charm++ for verbs can be built with --with-qlogic to support Omni-Path, but the Charm++ MPI network layer performs better than the verbs layer.  Hangs have been observed with Intel MPI but not with OpenMPI, so OpenMPI is preferred.  See "Compiling NAMD" below for MPI build instructions.  NAMD MPI binaries may be launched directly with mpiexec rather than via the provided charmrun script."
> https://www.ks.uiuc.edu/Research/namd/cvs/notes.html <https://www.ks.uiuc.edu/Research/namd/cvs/notes.html>
> ----
> Mike Davis
> Technical Director: High Performance Research Computing 
> Virginia Commonwealth University
> (804) 828-3885 (o) • (804) 307-3428(c)
> https://chipc.vcu.edu <https://urldefense.com/v3/__https://chipc.vcu.edu/__;!!DZ3fjg!tdYi7N1NVr8KtU5IVzoCwBS8jTMzgmjY0Kb-WyaRtZgNVLJ-IVL91DFQum0XLC_nFg$>
> 
>  <https://urldefense.com/v3/__https://chipc.vcu.edu/__;!!DZ3fjg!tdYi7N1NVr8KtU5IVzoCwBS8jTMzgmjY0Kb-WyaRtZgNVLJ-IVL91DFQum0XLC_nFg$>
> 
> 
> On Tue, Nov 30, 2021 at 1:00 PM Vermaas, Josh <vermaasj_at_msu.edu <mailto:vermaasj_at_msu.edu>> wrote:
> Hi Vlad,
> 
>  
> 
> In addition to the great points Axel and Giacomo have made, I’d like to point out that the 8160 is a 24 core processor, and that there are likely 2 of them on a given node. In these two socket configurations, where there are two physical CPU dies, I’ve often found that the best performance is achieved when you treat each socket as its own node, and allocate 2x the number of “tasks” as you have nodes. That way, each SMP task gets placed on its own socket. If you don’t, each node is trying to get all 48 cores across both sockets to work together, which ends up saturating the UPI links between the sockets, and can be detrimental to performance.
> 
>  
> 
> This is usually a bigger problem for SMP-based builds. In my experience, CPU-only systems benefit from MPI based builds, where the number of tasks is equal to the number of CPUs. Usually this is a performance win for modestly sized systems at the expense of scalability for really big systems.
> 
>  
> 
> -Josh
> 
>  
> 
> From: <owner-namd-l_at_ks.uiuc.edu <mailto:owner-namd-l_at_ks.uiuc.edu>> on behalf of Vlad Cojocaru <vlad.cojocaru_at_mpi-muenster.mpg.de <mailto:vlad.cojocaru_at_mpi-muenster.mpg.de>>
> Organization: MPI Muenster
> Reply-To: "namd-l_at_ks.uiuc.edu <mailto:namd-l_at_ks.uiuc.edu>" <namd-l_at_ks.uiuc.edu <mailto:namd-l_at_ks.uiuc.edu>>, Vlad Cojocaru <vlad.cojocaru_at_mpi-muenster.mpg.de <mailto:vlad.cojocaru_at_mpi-muenster.mpg.de>>
> Date: Tuesday, November 30, 2021 at 10:18 AM
> To: Giacomo Fiorin <giacomo.fiorin_at_gmail.com <mailto:giacomo.fiorin_at_gmail.com>>, NAMD list <namd-l_at_ks.uiuc.edu <mailto:namd-l_at_ks.uiuc.edu>>, Axel Kohlmeyer <akohlmey_at_gmail.com <mailto:akohlmey_at_gmail.com>>
> Cc: HORIA-LEONARD BANCIU <horia.banciu_at_ubbcluj.ro <mailto:horia.banciu_at_ubbcluj.ro>>
> Subject: Re: namd-l: NAMD performance on a supercomputer with Intel Xeon Platinum 8160 and 100Gb Intel Omni-Path Full-Fat Tree
> 
>  
> 
> Thanks for your thoughts !
> 
> One thing that seemed weird during our tests on this site was that the performance and parallel scaling rapidly degraded when using all 48 cores available per node (2 CPUs with 24 cores each) . We actually saw negative scaling after as few as 16 nodes. Then, when using 47, 32, and 24 cores/node, we got better parallel efficiency to higher node count, with the best efficiency obtained using just half of the core available on each node (24). At the end, when running on 48 nodes, we achieved the most ns/day when using 24 cores/node. However, the resources requested in the project we needed to calculate using all 48 cores/node regardless of how many we are actually using.
> 
> I haven't experienced anything like this on other sites (similar systems, same configuration files). Normally using all cores available per node has always given the best performance. So, I am wondering whether there is anything obvious that could explain such a behavior ?
> 
> Best
> Vlad
> 
> On 11/30/21 15:33, Giacomo Fiorin wrote:
> 
> Something in addition to what Axel says (all of which is absolutely true, even the counter-intuitive part about making the single-node performance artificially slower to get through the bottom-most tier of technical review).
> 
>  
> 
> One possible issue to look at is how the cluster's network is utilized by other users/applications.  In a local cluster that I use, the InfiniBand network is also used by the nodes to access data storage and there are many other users processing MRI, cryo-EM or bioinformatics data (all embarrassingly-parallel by design).  So the InfiniBand network is constantly busy and does not necessarily offer very low latency for NAMD or other message-passing applications.
> 
>  
> 
> Something that helped in that case was building Charm++ on top of the UCX library instead of IBverbs directly.  I am wholly unfamiliar with the details of how UCX works, but in essence it provides better utilization of the network when the ratio of compute cores vs. network links is high.  If the cluster's staff has a copy of UCX, try that.  It wasn't easy to build, but it paid off specifically for those runs that were communication-bound.
> 
>  
> 
> The main significant addition in 2.15 is the AVX-512 tiles algorithm, which would help with the most expensive Intel CPUs like those, but would also make the computation part faster with the caveat that Axel mentioned.
> 
>  
> 
> Giacomo
> 
>  
> 
> On Tue, Nov 30, 2021 at 6:16 AM Axel Kohlmeyer <akohlmey_at_gmail.com <mailto:akohlmey_at_gmail.com>> wrote:
> 
> Actually, if you optimize how NAMD is compiled better than the system provided executable, your parallel efficiency will go down. Please recall Amdahl's law: the parallel efficiency is determined by the relation of time spent on parallel execution and serial execution. 
> 
>  
> 
>  A better optimized executable will spend even less time computing and thus have more parallel overhead. 
> 
>  
> 
> To get better parallel efficiency, you have to avoid or reduce all non parallel operations like output or use of features like Tcl scripting or make your computations more expensive by increasing the cutoff or the system size or make the executable slower by compiling a less optimized version. 
> 
> --
> Dr. Axel Kohlmeyer akohlmey_at_gmail.com <mailto:akohlmey_at_gmail.com> http://goo.gl/1wk0 <https://urldefense.com/v3/__http:/goo.gl/1wk0__;!!DZ3fjg!vTfAy2yEX2CbE-RC_oXIbJCP-TYotczi7lvqSPqNSBEGEfUDyM103t2gWOCxycm24A$>
> College of Science & Technology, Temple University, Philadelphia PA, USA
> International Centre for Theoretical Physics, Trieste, Italy
> 
>  
> 
> On Tue, Nov 30, 2021, 05:32 Vlad Cojocaru <vlad.cojocaru_at_mpi-muenster.mpg.de <mailto:vlad.cojocaru_at_mpi-muenster.mpg.de>> wrote:
> 
> Dear all,
> 
> We submitted a proposal to run some extensive atomistic simulations with 
> NAMD of systems ranging between 500 K to 2M atoms on a supercomputer 
> with  Intel Xeon Platinum 8160 processors and 100Gb Intel Omni-path 
> Full-Fat Tree interconnection.
> 
> Apparently, our project may fail the technical evaluation because during 
> our tests we did not achieve a 75 % parallel efficiency between 2 to 48 
> nodes (each node has 2 CPUs - 24 cores/CPU).  We have tested the NAMD 
> 2.14 provided by default at the site and we do not know how this was 
> built. Looking at the NAMD benchmarks available for the Frontera 
> supercomputer (quite similar architecture if I understand it correctly 
> but for larger systems), it seems we should definitely achieve with NAMD 
> 2.15 (maybe even 2.14) much better performance and parallel efficiency 
> up to 48/64 nodes on this architecture than we actually achieved in our 
> tests.
> 
> So, my reasoning is that probably the NAMD built by default was not 
> really carefully optimized.
> 
> I would appreciate if anyone who has experience with building and 
> optimizing NAMD on such an architecture could recommend any 
> compiler/MPI/configuration/options for building an NAMD with a better 
> performance and parallel efficiency. If I have some clear ideas about 
> how to optimize NAMD, maybe I could make the case for our project to not 
> fail the technical evaluation.
> 
> Thank you very much for any advice
> 
> Best wishes
> Vlad
> 
> 
> 
> -- 
> Vlad Cojocaru, PD (Habil.), Ph.D.
> -----------------------------------------------
> Project Group Leader
> Department of Cell and Developmental Biology
> Max Planck Institute for Molecular Biomedicine
> Röntgenstrasse 20, 48149 Münster, Germany
> -----------------------------------------------
> Tel: +49-251-70365-324; Fax: +49-251-70365-399
> Email: vlad.cojocaru[at]mpi-muenster.mpg.de <https://urldefense.com/v3/__http:/mpi-muenster.mpg.de__;!!DZ3fjg!vTfAy2yEX2CbE-RC_oXIbJCP-TYotczi7lvqSPqNSBEGEfUDyM103t2gWOC6ifHfyA$>
> https://urldefense.com/v3/__http://www.mpi-muenster.mpg.de/43241/cojocaru__;!!DZ3fjg!ouau8vpkIDbQ8KrgRCSrc8Ng4YRHk1w7tQfeHsxoB5VnnkEQuC3CQj5uCvq0Gx8Paw$ <https://urldefense.com/v3/__http:/www.mpi-muenster.mpg.de/43241/cojocaru__;!!DZ3fjg!ouau8vpkIDbQ8KrgRCSrc8Ng4YRHk1w7tQfeHsxoB5VnnkEQuC3CQj5uCvq0Gx8Paw$> 
> 
> 
> 
> 
> 
> -- 
> Vlad Cojocaru, PD (Habil.), Ph.D.
> -----------------------------------------------
> Project Group Leader
> Department of Cell and Developmental Biology
> Max Planck Institute for Molecular Biomedicine
> Röntgenstrasse 20, 48149 Münster, Germany
> -----------------------------------------------
> Tel: +49-251-70365-324; Fax: +49-251-70365-399
> Email: vlad.cojocaru[at]mpi-muenster.mpg.de <https://urldefense.com/v3/__http://mpi-muenster.mpg.de__;!!DZ3fjg!tdYi7N1NVr8KtU5IVzoCwBS8jTMzgmjY0Kb-WyaRtZgNVLJ-IVL91DFQum24M7Dknw$>
> http://www.mpi-muenster.mpg.de/43241/cojocaru <https://urldefense.com/v3/__http:/www.mpi-muenster.mpg.de/43241/cojocaru__;!!DZ3fjg!vqb8Aw2vjDMitJSEPbF74htFKf2mY9JVQk1mwfueoHqSLJz36Td3erIQ1SMxxRzYIA$>

Next message: Nate Walkins: "Extracting force on subset of atoms from tclForces"
Previous message: Michael Robinson: "Re: Dihedral parameters for terminal alkyne"
In reply to: James M Davis: "Re: NAMD performance on a supercomputer with Intel Xeon Platinum 8160 and 100Gb Intel Omni-Path Full-Fat Tree"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.6 : Fri Dec 31 2021 - 23:17:12 CST