The Reliability Wall for Exascale Supercomputing

Reliability is a key challenge to be understood to turn the vision of exascale supercomputing into reality. Inevitably, large-scale supercomputing systems, especially those at the peta/exascale levels, must tolerate failures, by incorporating fault-tolerance mechanisms to improve their reliability a...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE transactions on computers Vol. 61; no. 6; pp. 767 - 779
Main Authors:	Yang, Xuejun, Wang, Zhiyuan, Xue, Jingling, Zhou, Yun
Format:	Journal Article
Language:	English
Published:	IEEE 01-06-2012
Subjects:	Checkpointing exascale Fault tolerance Fault tolerant systems performance metric reliability speedup Reliability theory reliability wall Software reliability Strontium
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Reliability is a key challenge to be understood to turn the vision of exascale supercomputing into reality. Inevitably, large-scale supercomputing systems, especially those at the peta/exascale levels, must tolerate failures, by incorporating fault-tolerance mechanisms to improve their reliability and availability. As the benefits of fault-tolerance mechanisms rarely come without associated time and/or capital costs, reliability will limit the scalability of parallel applications. This paper introduces for the first time the concept of "Reliability Wall" to highlight the significance of achieving scalable performance in peta/exascale supercomputing with fault tolerance. We quantify the effects of reliability on scalability, by proposing a reliability speedup, defining quantitatively the reliability wall, giving an existence theorem for the reliability wall, and categorizing a given system according to the time overhead incurred by fault tolerance. We also generalize these results into a general reliability speedup/wall framework by considering not only speedup but also costup. We analyze and extrapolate the existence of the reliability wall using two representative supercomputers, Intrepid and ASCI White, both employing checkpointing for fault tolerance, and have also studied the general reliability wall using Intrepid. These case studies provide insights on how to mitigate reliability-wall effects in system design and through hardware/software optimizations in peta/exascale supercomputing.
ISSN:	0018-9340 1557-9956
DOI:	10.1109/TC.2011.106