AW: Occasional performance slow down using NAMD with Xeon Phi

From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Mon Apr 04 2016 - 01:47:27 CDT

I can confirm, that we also observed such slow downs when using accelerators. In our case we noticed a more or less linear drift of the time/step using GPUs. One can overcome this by setting ldbperiod to a very large number of steps, so NAMD won't ever do any load balancing past the initial one. This of cource is only a work around for well behaving systems without much fluctuation in local densities. I can imagine that implicit solvation systems, or systems with uneven density distribution will suffer from the missing load balancing.

Norman Geist

> -----Ursprüngliche Nachricht-----
> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
> Auftrag von Mattia Felice Palermo
> Gesendet: Freitag, 1. April 2016 16:54
> An: namd-l_at_ks.uiuc.edu
> Betreff: namd-l: Occasional performance slow down using NAMD with Xeon
> Phi
>
> Dear NAMD developers and users,
>
> I've been using NAMD 2.10 compiled with the Xeon Phi support on a
> supercomputing facility and I'm experiencing some occasional slow downs
> for which I'm having a hard time figuring out the reason. The HPC machine is
> a cluster with more than 500 nodes, each node with two 8-cores Intel
> Haswell and two Intel Phi 7120p, and nodes are connect through Infiniband.
> Simulations are launched through a PBS scheduler and I'm allocating two
> nodes, on which I use 8 cores and two MICs cards each, for a total of 16
> cores and 4 MIC cards. Most of the PBS jobs run fine, but it occasionally
> happens that they slow down without any apparent reason. I found out that,
> when simulations are slow, there is something off with NAMD load
> balancing. I am attaching plots of the average and maximum load values
> (output from NAMD) as a function of the runtime for a normal simulation
> and a slow one. I have not found any documentation about the meaning of
> these numbers, but it is evident that when the simulation runs slow, the
> average load value is higher and the maximum one is also higher and with
> way more oscillations compared to a "normal" simulation. The slowdowns
> happen regardless of the nature of the simulated system. All the systems
> I've tried have periodic boundary conditions and the PME grid is set
> manually (to avoid the automatic procedure to change it from run to run).
>
> I have contacted the user support of the HPC facility and they said
> everything looks fine from the hardware point of view and that it might be
> an issue with the NAMD support of MICs, as it is reported to be
> experimental on the NAMD documentation.
>
> I know these kind of issues are quite hard to debug since a lot of variables
> are into play, but do you have any clue what might be the source of these
> slow downs, if we exclude hardware problems? And also, does anyone have
> a clear explanation of the meaning of the load balancing values output from
> NAMD? I searched through the documentation and have not found any
> reference to them.
>
> Thanks for the attention and of course I'm available to provide more details
> if necessary.
>
> Mattia
>
> 5x1000 AI GIOVANI RICERCATORI
> DELL'UNIVERSITÀ DI BOLOGNA
> Codice Fiscale: 80007010376

This archive was generated by hypermail 2.1.6 : Tue Dec 27 2016 - 23:21:57 CST