Re: hanging at startup phase

From: Allison Heath (aheath_at_houston.rr.com)
Date: Fri May 21 2004 - 10:29:49 CDT

Hi,
I found the problem. Several of the node's didn't have local loopback up, so
everyone could ping it and it could ping everyone, but it couldn't ping
itself. So the namd process was waiting on that node trying to reach itself
forever. I guess namd doesn't have anything that times out and reports an
error in this case. Also, namd seems to hang if one of the nodes goes down.
It would be nice (and probably pretty easy to put in?) if there were
timeouts / errors in these cases.

Everything seems to run just fine now. Thank you for the info,

Allison Heath

----- Original Message -----
From: "Gengbin Zheng" <gzheng_at_ks.uiuc.edu>
To: "allison" <aheath_at_houston.rr.com>
Cc: <namd-l_at_ks.uiuc.edu>
Sent: Friday, May 21, 2004 12:59 AM
Subject: Re: namd-l: hanging at startup phase

>
> Hi,
>
> Have you tried running on only one node but with 2 processes (using
> ++local charmrun option)?
>
> Step 0 only creates some internal data structures for communication
> followed by a quiescence detection. I suspect it hangs at the quiescence
> detection where the parallel system is waiting for all processors to
> finish processing all messages.
>
> "Info: REMOVING COM VELOCITY -0.0259478 -0.0273245 -0.014764" printout
> should appear in step 3. Each startup phase/step follows by the quiescence
> detection. I suspect somehow the quiescence detection fails on your
> machine, but I don't know why.
>
> btw, can other people using the same cluster run namd without problem?
>
> Gengbin
>
> On Wed, 19 May 2004, allison wrote:
>
> > Hello,
> > I am fairly new to using NAMD. I am trying to run it on a cluster
running
> > Debian where each node has 2 processors. If I launch two processes on
one
> > node using charmrun everything works fine. If I try to use multiple
nodes
> > everything works fine until:
> > Info: Entering startup phase 0 with 22577 kB of memory in use
> >
> > If I just try to use two nodes it gets further into the startup phase
with:
> > Info: REMOVING COM VELOCITY -0.0259478 -0.0273245 -0.014764
> >
> > Both times it just hangs at the those lines. I haven't let it run for
more
> > than about an hour before just killing it. One of the nodes shows two
namd
> > processes running full tilt, but all of the other spawned processes on
the
> > other nodes are sleeping. I asked a few people I know who have run
similar
> > namd simulations and they said they had never seen it take this long.
> >
> > Any idea about what's going on? or a way I could get more information
about
> > what it's trying to do at these points?
> >
> > Thank you
> >
> > Allison Heath
> >
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:38:40 CST