Re: Problem with NAMD on Opteron cluster

From: Marc Q. Ma (qma_at_oak.njit.edu)
Date: Mon Nov 14 2005 - 08:37:55 CST

Harald,

We solved our problems by the following:

1. reinstall NAMD. we recompiled namd which is linked to the current
Myrinet communication libraries.

2. use mpiexec instead of "charmrun" or "mpiexec charmrun" to control
the parallel jobs.

After the above, we started to use NAMD without problems, and the
performance has been very good. If the above does not solve your
problem, I will connect you to our sys admin to help you further --
since he knows more ....

cheers!

Marc
On Nov 11, 2005, at 11:36 AM, tepper_at_amolf.nl wrote:

> Dear NAMD users/developers,
>
> We experience a lot of problems with compiling/running NAMD (2.6) on
> our
> recently installed Opteron cluster (dual processor nodes type 248 and
> GigaBit interconnect). Any help would be appreciated.
> Maybe it is good to say that our problems seem related to the posts by
> Marc Ma (Sep 01/2005) and Ralph Jimenez (Sep 29/2005) but we found no
> responses there that would completely solve our problems.
>
> Here is a summary of our start situation:
> *) we have only Gnu and Portland compilers available, and were able to
> compile charm and namd only with the GNU ones.
> *) We have both compiled versions with charmrun and with MPI.
> *) Results with both are similar and also similar to downloaded
> precompiled binaries (AMD / 64 / TCP)
>
> Here are the problems:
> *) The code runs more or less fine on 2 processors (one node), although
> CPU time seems not to be used to a maximum: when doing 'top' we see
> proc.
> 1 using 99% in 'user' mode and proc. 2 using 60-80% in 'sys' mode.
> *) This only gets worse on 4 and more processors: more and more time is
> spent in 'sys' mode on many processors and some look even totally dead.
> These findings seems similar to Dr. Ma's posting.
> *) The performance gets also worse over time, basically after the first
> and later 'load balancing' statements in the output file. Here we see
> behavior similar to Dr. Jimenez, namely 'negative timing values'. The
> one
> response to that previous posting says maybe there is just something
> wrong
> with the local timer. I would really hope the solution is that simple.
> Can
> any one give suggestions on how to find this out and/or where to
> specify
> the timer during the installation process?
>
> Thanks in advance for any help.
>
> Harald Tepper
> Amsterdam
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:41:22 CST