Re: hanging at startup phase

From: Gengbin Zheng (gzheng_at_ks.uiuc.edu)
Date: Fri May 21 2004 - 10:53:23 CDT

Hi,

Good to hear that you solved the problem.
Charm++ runtime has some error detection built-in. If one node (the
namd process) is gone, the charmrun should detect it and initiate a global
shutdown of all namd process. I don't know why it failed to do so in your
case. Maybe it is due to the misconfiguration of your network and this
global synchronization of shutdown couldn't get through.
WHen delivering an application message, Charm++ try the best effort. When
socket timeout, it will keep trying. I agree this can/should be improved
to have a large value of timeout and report error.

Gengbin

On Fri, 21 May 2004, Allison Heath wrote:

> Hi,
> I found the problem. Several of the node's didn't have local loopback up, so
> everyone could ping it and it could ping everyone, but it couldn't ping
> itself. So the namd process was waiting on that node trying to reach itself
> forever. I guess namd doesn't have anything that times out and reports an
> error in this case. Also, namd seems to hang if one of the nodes goes down.
> It would be nice (and probably pretty easy to put in?) if there were
> timeouts / errors in these cases.
>
> Everything seems to run just fine now. Thank you for the info,
>
> Allison Heath
>
> ----- Original Message -----
> From: "Gengbin Zheng" <gzheng_at_ks.uiuc.edu>
> To: "allison" <aheath_at_houston.rr.com>
> Cc: <namd-l_at_ks.uiuc.edu>
> Sent: Friday, May 21, 2004 12:59 AM
> Subject: Re: namd-l: hanging at startup phase
>
>
> >
> > Hi,
> >
> > Have you tried running on only one node but with 2 processes (using
> > ++local charmrun option)?
> >
> > Step 0 only creates some internal data structures for communication
> > followed by a quiescence detection. I suspect it hangs at the quiescence
> > detection where the parallel system is waiting for all processors to
> > finish processing all messages.
> >
> > "Info: REMOVING COM VELOCITY -0.0259478 -0.0273245 -0.014764" printout
> > should appear in step 3. Each startup phase/step follows by the quiescence
> > detection. I suspect somehow the quiescence detection fails on your
> > machine, but I don't know why.
> >
> > btw, can other people using the same cluster run namd without problem?
> >
> > Gengbin
> >
> > On Wed, 19 May 2004, allison wrote:
> >
> > > Hello,
> > > I am fairly new to using NAMD. I am trying to run it on a cluster
> running
> > > Debian where each node has 2 processors. If I launch two processes on
> one
> > > node using charmrun everything works fine. If I try to use multiple
> nodes
> > > everything works fine until:
> > > Info: Entering startup phase 0 with 22577 kB of memory in use
> > >
> > > If I just try to use two nodes it gets further into the startup phase
> with:
> > > Info: REMOVING COM VELOCITY -0.0259478 -0.0273245 -0.014764
> > >
> > > Both times it just hangs at the those lines. I haven't let it run for
> more
> > > than about an hour before just killing it. One of the nodes shows two
> namd
> > > processes running full tilt, but all of the other spawned processes on
> the
> > > other nodes are sleeping. I asked a few people I know who have run
> similar
> > > namd simulations and they said they had never seen it take this long.
> > >
> > > Any idea about what's going on? or a way I could get more information
> about
> > > what it's trying to do at these points?
> > >
> > > Thank you
> > >
> > > Allison Heath
> > >
> >
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 05:18:14 CST