Fault-tolerant high-performance matrix multiplication: theory and practice

We extend the theory and practice regarding algorithmic fault-tolerant matrix-matrix multiplication, C=AB, in a number of ways. First, we propose low-overhead methods for detecting errors introduced not only in C but also in A and/or B. Second, we show that, theoretically, these methods will detect...

Full description

Saved in:

Bibliographic Details
Published in:	2001 International Conference on Dependable Systems and Networks pp. 47 - 56
Main Authors:	Gunnels, J.A., Katz, D.S., Quintana-Orti, E.S., Van de Gejin, R.A.
Format:	Conference Proceeding
Language:	English
Published:	2001
Subjects:	Contracts Costs Error correction Fault tolerance High performance computing Laboratories Linear algebra NASA Propulsion Space technology
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	We extend the theory and practice regarding algorithmic fault-tolerant matrix-matrix multiplication, C=AB, in a number of ways. First, we propose low-overhead methods for detecting errors introduced not only in C but also in A and/or B. Second, we show that, theoretically, these methods will detect all errors as long as only one entry, is corrupted. Third we propose a low-overhead roll-back approach to correct errors once detected. Finally, we give a high-performance implementation of matrix-matrix multiplication that incorporates these error detection and correction methods. Empirical results demonstrate that these methods work well in practice while imposing an acceptable level of overhead relative to high-performance implementations without fault-tolerance.
ISBN:	9780769511016 0769511015
DOI:	10.1109/DSN.2001.941390