From: Vlad Cojocaru (vlad.cojocaru_at_mpi-muenster.mpg.de)
Date: Tue Nov 30 2021 - 09:14:04 CST
Thanks for your thoughts !
One thing that seemed weird during our tests on this site was that the
performance and parallel scaling rapidly degraded when using all 48
cores available per node (2 CPUs with 24 cores each) . We actually saw
negative scaling after as few as 16 nodes. Then, when using 47, 32, and
24 cores/node, we got better parallel efficiency to higher node count,
with the best efficiency obtained using just half of the core available
on each node (24). At the end, when running on 48 nodes, we achieved the
most ns/day when using 24 cores/node. However, the resources requested
in the project we needed to calculate using all 48 cores/node regardless
of how many we are actually using.
I haven't experienced anything like this on other sites (similar
systems, same configuration files). Normally using all cores available
per node has always given the best performance. So, I am wondering
whether there is anything obvious that could explain such a behavior ?
On 11/30/21 15:33, Giacomo Fiorin wrote:
> Something in addition to what Axel says (all of which is absolutely
> true, even the counter-intuitive part about making the single-node
> performance artificially slower to get through the bottom-most tier of
> technical review).
> One possible issue to look at is how the cluster's network is utilized
> by other users/applications. In a local cluster that I use, the
> InfiniBand network is also used by the nodes to access data storage
> and there are many other users processing MRI, cryo-EM or
> bioinformatics data (all embarrassingly-parallel by design). So the
> InfiniBand network is constantly busy and does not necessarily offer
> very low latency for NAMD or other message-passing applications.
> Something that helped in that case was building Charm++ on top of the
> UCX library instead of IBverbs directly. I am wholly unfamiliar with
> the details of how UCX works, but in essence it provides better
> utilization of the network when the ratio of compute cores vs. network
> links is high. If the cluster's staff has a copy of UCX, try that.
> It wasn't easy to build, but it paid off specifically for those runs
> that were communication-bound.
> The main significant addition in 2.15 is the AVX-512 tiles algorithm,
> which would help with the most expensive Intel CPUs like those, but
> would also make the computation part faster with the caveat that Axel
> On Tue, Nov 30, 2021 at 6:16 AM Axel Kohlmeyer <akohlmey_at_gmail.com> wrote:
> Actually, if you optimize how NAMD is compiled better than the
> system provided executable, your parallel efficiency will go down.
> Please recall Amdahl's law: the parallel efficiency is determined
> by the relation of time spent on parallel execution and serial
> A better optimized executable will spend even less time computing
> and thus have more parallel overhead.
> To get better parallel efficiency, you have to avoid or reduce all
> non parallel operations like output or use of features like Tcl
> scripting or make your computations more expensive by increasing
> the cutoff or the system size or make the executable slower by
> compiling a less optimized version.
> Dr. Axel Kohlmeyer akohlmey_at_gmail.com https://urldefense.com/v3/__http://goo.gl/1wk0__;!!DZ3fjg!vqb8Aw2vjDMitJSEPbF74htFKf2mY9JVQk1mwfueoHqSLJz36Td3erIQ1SNH2zl4Jw$
> College of Science & Technology, Temple University, Philadelphia
> PA, USA
> International Centre for Theoretical Physics, Trieste, Italy
> On Tue, Nov 30, 2021, 05:32 Vlad Cojocaru
> <vlad.cojocaru_at_mpi-muenster.mpg.de> wrote:
> Dear all,
> We submitted a proposal to run some extensive atomistic
> simulations with
> NAMD of systems ranging between 500 K to 2M atoms on a
> with Intel Xeon Platinum 8160 processors and 100Gb Intel
> Full-Fat Tree interconnection.
> Apparently, our project may fail the technical evaluation
> because during
> our tests we did not achieve a 75 % parallel efficiency
> between 2 to 48
> nodes (each node has 2 CPUs - 24 cores/CPU). We have tested
> the NAMD
> 2.14 provided by default at the site and we do not know how
> this was
> built. Looking at the NAMD benchmarks available for the Frontera
> supercomputer (quite similar architecture if I understand it
> but for larger systems), it seems we should definitely achieve
> with NAMD
> 2.15 (maybe even 2.14) much better performance and parallel
> up to 48/64 nodes on this architecture than we actually
> achieved in our
> So, my reasoning is that probably the NAMD built by default
> was not
> really carefully optimized.
> I would appreciate if anyone who has experience with building and
> optimizing NAMD on such an architecture could recommend any
> compiler/MPI/configuration/options for building an NAMD with a
> performance and parallel efficiency. If I have some clear
> ideas about
> how to optimize NAMD, maybe I could make the case for our
> project to not
> fail the technical evaluation.
> Thank you very much for any advice
> Best wishes
> Vlad Cojocaru, PD (Habil.), Ph.D.
> Project Group Leader
> Department of Cell and Developmental Biology
> Max Planck Institute for Molecular Biomedicine
> Röntgenstrasse 20, 48149 Münster, Germany
> Tel: +49-251-70365-324; Fax: +49-251-70365-399
> Email: vlad.cojocaru[at]mpi-muenster.mpg.de
-- Vlad Cojocaru, PD (Habil.), Ph.D. ----------------------------------------------- Project Group Leader Department of Cell and Developmental Biology Max Planck Institute for Molecular Biomedicine Röntgenstrasse 20, 48149 Münster, Germany ----------------------------------------------- Tel: +49-251-70365-324; Fax: +49-251-70365-399 Email: vlad.cojocaru[at]mpi-muenster.mpg.de https://urldefense.com/v3/__http://www.mpi-muenster.mpg.de/43241/cojocaru__;!!DZ3fjg!vqb8Aw2vjDMitJSEPbF74htFKf2mY9JVQk1mwfueoHqSLJz36Td3erIQ1SMxxRzYIA$
This archive was generated by hypermail 2.1.6 : Fri Dec 31 2021 - 23:17:12 CST