Occasional performance slow down using NAMD with Xeon Phi

From: Mattia Felice Palermo (mattiafelice.palerm2_at_unibo.it)
Date: Fri Apr 01 2016 - 09:53:37 CDT

Dear NAMD developers and users,

I've been using NAMD 2.10 compiled with the Xeon Phi support on a supercomputing facility and I'm experiencing some occasional slow downs for which I'm having a hard time figuring out the reason. The HPC machine is a cluster with more than 500 nodes, each node with two 8-cores Intel Haswell and two Intel Phi 7120p, and nodes are connect through Infiniband.
Simulations are launched through a PBS scheduler and I'm allocating two nodes, on which I use 8 cores and two MICs cards each, for a total of 16 cores and 4 MIC cards. Most of the PBS jobs run fine, but it occasionally happens that they slow down without any apparent reason. I found out that, when simulations are slow, there is something off with NAMD load balancing. I am attaching plots of the average and maximum load values (output from NAMD) as a function of the runtime for a normal simulation and a slow one. I have not found any documentation about the meaning of these numbers, but it is evident that when the simulation runs slow, the average load value is higher and the maximum one is also higher and with way more oscillations compared to a "normal" simulation. The slowdowns happen regardless of the nature of the simulated system. All the systems I've tried have periodic boundary conditions and the PME grid is set manually (to avoid the automatic procedure to change it from run to run).

I have contacted the user support of the HPC facility and they said everything looks fine from the hardware point of view and that it might be an issue with the NAMD support of MICs, as it is reported to be experimental on the NAMD documentation.

I know these kind of issues are quite hard to debug since a lot of variables are into play, but do you have any clue what might be the source of these slow downs, if we exclude hardware problems? And also, does anyone have a clear explanation of the meaning of the load balancing values output from NAMD? I searched through the documentation and have not found any reference to them.

Thanks for the attention and of course I'm available to provide more details if necessary.


