From: Gengbin Zheng (gzheng_at_ks.uiuc.edu)
Date: Fri Mar 12 2004 - 13:18:01 CST
Hi, Leandro and min,
NAMD is not fault tolerant yet, which means if one node is down, the
whole job will be killed either by the job scheduler, MPI environment
(e.g. in mpi-linux) or charmrun process on the master node (e.g in
net-linux). The clean shutdown of NAMD without betrayed process in time
of crash is the major concern in current implementation.
Due to the load balancing capability in Charm++, NAMD can adaptively
migrate work among processors. So that when one node is heavily loaded and
not responding in timely fashion, the load balancer can migrate the work
away to other processors. This, however, only works when the whole program
is still running.
To achieve fault tolerance of NAMD, one can install fault tolerant MPI
package and build NAMD on top of that, which I don't know if there is
any stable implementation out there for downloading although the
literature has studied it extensively. Most of the traditional methods are
still limited in checkpointing program states in disks, and restart the
program from the checkpoints when crash happens.
For non MPI cluster versions of NAMD which builds on Charm++ with net-
versions (use UDP/TCP sockets), the fault tolerance feature needs
to come directly from Charm++ run-time. The work is in progress at PPL
lab. I have implemented a prototype fault tolerance protocol for Charm++
which allow a charm++ program to continue to run on remaining processors
without down time and manually restarting. It takes time though for NAMD2
to port onto the new Charm++ with fault tolerance features since NAMD2 was
designed without considering the fault tolerance support. The next
generation of NAMD will have fault tolerance support.
On Fri, 12 Mar 2004 yu275197_at_yorku.ca wrote:
> in my experience (running 70 node mac cluster) the simulation will stop if one
> system turns off. however if the system slows down for some reason or freezes
> (as in it won't respond to keyboard commands) namd simply redistributes work
> load on another node and keep running:)
> Quoting Leandro Martinez <lmartinez_at_iqm.unicamp.br>:
> > HI all,
> > Suppose I have a cluster with, lets say, 20 nodes,
> > running a simulation with NAMD. If one of the nodes goes
> > down (due to overheating, or anything else), does the
> > simulation stops at all?
> > Thanks,
> > Leandro.
This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:38:29 CST