Besseron, Xavier; Guitier, Thierry
Optimised Recovery with a Coordinated Checkpoint/Rollback Protocol for Domain Decomposition Applications
MODELLING, COMPUTATION AND OPTIMIZATION IN INFORMATION SYSTEMS AND MANAGEMENT SCIENCES, PROCEEDINGS, 14:497-506, 2008

Fault-tolerance protocols play an important role in today long runtime scientific parallel applications. The probability of a failure may be important due to the number of unreliable components involved during an execution. In this paper we present our approach and preliminary results about a new checkpoint/rollback protocol based on a coordinated scheme. One feature of this protocol is that fault recovery only requires a partial restart of other processes thanks to the availability of an abstract representation of the execution. Simulations on a domain decomposition application show that the amount of computations required to restart and the number of involved processes are reduced compared to the classical global rollback protocol.

Find full text with Google Scholar.