Paper Citing NAMD - Abstract
Besseron, Xavier; Guitier, Thierry
Optimised Recovery with a Coordinated Checkpoint/Rollback Protocol for Domain Decomposition Applications
MODELLING, COMPUTATION AND OPTIMIZATION IN INFORMATION SYSTEMS AND MANAGEMENT SCIENCES, PROCEEDINGS, 14:497-506, 2008
Fault-tolerance protocols play an important role in today long runtime scientific parallel applications. The probability of a failure may be important due to the number of unreliable components involved during an execution. In this paper we present our approach and preliminary results about a new checkpoint/rollback protocol based on a coordinated scheme. One feature of this protocol is that fault recovery only requires a partial restart of other processes thanks to the availability of an abstract representation of the execution. Simulations on a domain decomposition application show that the amount of computations required to restart and the number of involved processes are reduced compared to the classical global rollback protocol.