Timely failure detection in a large distributed real-time system

The paper describes the experience of designing and implementing failure detection and reporting in a large distributed real time system used for air traffic control (ATC). We believe that systematic analysis is needed to guide the failure detection design and track the large number of failures that...

Full description

Saved in:
Bibliographic Details
Published in:Proceedings of Words '94. The First Workshop on Object-Oriented Real-Time Dependable Systems pp. 118 - 123
Main Authors: Ng, T.P., Patel, V.N.
Format: Conference Proceeding
Language:English
Published: IEEE 1994
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The paper describes the experience of designing and implementing failure detection and reporting in a large distributed real time system used for air traffic control (ATC). We believe that systematic analysis is needed to guide the failure detection design and track the large number of failures that it deals with. Analysis such as how fast failures have to be detected should be performed carefully to avoid redesigns later. A comprehensive analysis also provides a basis for testing the design subsequently, during which fault injection and extended testing are needed to evaluate and debug the design. Failure detectors should detect specific failures so that appropriate reports and recovery actions can be initiated after detection.
ISBN:9780818670831
0818670835
DOI:10.1109/WORDS.1994.518680