CRAFT: A Library for Easier Application-Level Checkpoint/Restart and Automatic Fault Tolerance

In order to efficiently use the future generations of supercomputers, fault tolerance and power consumption are two of the prime challenges anticipated by the High Performance Computing (HPC) community. Checkpoint/Restart (CR) has been and still is the most widely used technique to deal with hard fa...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on parallel and distributed systems Vol. 30; no. 3; pp. 501 - 514
Main Authors: Shahzad, Faisal, Thies, Jonas, Kreutzer, Moritz, Zeiser, Thomas, Hager, Georg, Wellein, Gerhard
Format: Journal Article
Language:English
Published: New York IEEE 01-03-2019
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:In order to efficiently use the future generations of supercomputers, fault tolerance and power consumption are two of the prime challenges anticipated by the High Performance Computing (HPC) community. Checkpoint/Restart (CR) has been and still is the most widely used technique to deal with hard failures. Application-level CR is the most effective CR technique in terms of overhead efficiency but it takes a lot of implementation effort. This work presents the implementation of our C++ based library CRAFT (Checkpoint-Restart and Automatic Fault Tolerance), which serves two purposes. First, it provides an extendable library that significantly eases the implementation of application-level checkpointing. The most basic and frequently used checkpoint data-types are already part of CRAFT and can be directly used out of the box. The library can be easily extended to add more data-types. As means of overhead reduction, the library offers a built-in asynchronous checkpointing mechanism and also supports the Scalable Checkpoint/Restart (SCR) library for node level checkpointing. Second, CRAFT provides an easier interface for User-Level Failure Mitigation (ULFM) based dynamic process recovery, which significantly reduces the complexity and effort of failure detection and communication recovery mechanism. By utilizing both functionalities together, applications can write application-level checkpoints and recover dynamically from process failures with very limited programming effort. This work presents the challenges addressed by the library, its design, and its use. The associated overheads are analyzed using benchmarks.
ISSN:1045-9219
1558-2183
DOI:10.1109/TPDS.2018.2866794