NAMD Myrinet issue - Strange behavior

From: Edward Patrick Obrien (edobrien_at_Glue.umd.edu)
Date: Thu Apr 07 2005 - 12:04:31 CDT

Hi All,
   We've been using NAMD over Myrinet (GM-2, i686-MPI-Linux) and for the
most part it seems to be working well. However,I've noticed an occassional
problem on the lead nodes of jobs. Our nodes are dual processor Xeons, but
we've noticed
that on the lead node NAMD seems to kick off extra process (on the order
of a dozen or more). Also, we see a massive
number in CPU interrupts (as reported by vmstat) and a corresponding
increase in system CPU which suggests heavy I/O
activity. I have a graph of CPU activity on the node at
http://www.lobos.nih.gov/~tim/node-graph.gif that illustratesthis. Note
that there is no data for some periods due to the load spikes.

Another symptom of this is radically varying completion times. Anticipated
completion times vary from around 100
hours to over 6000 hours! The jobs actually seem to finish roughly on
schedule, though. As a reference, one of the
affected jobs run on 4 nodes (8 2.66 GHz Xeon CPUs total) and we are
simulating a system with 18,257 atoms.

We have not noticed this problem when running over GigE, but our GigE jobs
only use 2 nodes. I am not sure if this isa NAMD problem or an issue with
our GM drivers. I originally thought it might be a GM problem, but the
extra NAMD
processes give me pause on that conclusion.

Has anybody seen this behavior before?
Thanks,
Ed

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:40:39 CST